Search through PDF files with PHP - php

I'm trying to find a way to search inside PDF files. I came accross the PHP PDF class but I can't seem to find any function for reading/searching a filestream.
So, as naive as I am, i tried to simple get a stream using file_get_contents(), obviously it's an encrypted-like output ;)
So my question, is there any way to search through PDF files? I'm looking for script-only / free / open source solutions and not buying some expensive commercial libraray.

XPDF?
There is a blog post here that may be of help.
There seems to be some code here that could help - a simple class that reads a PDF into plaintext. Unsure if it supports decryption.
There are also a number of resources in PHP documentation that may help you. Click.
FPDF and FPDI may also help. Probably your best bet after some research.**

A PHP search engine called Sphider has the option of adding PDF search via XPDF. You can then customise the result templates to fit in with the rest of your site (if applicable).

Related

Extracting text from PDFs in PHP

I'm creating a php based web application which allows the user to upload a PDF file. This file will then be read and checked for certain data (text).
The problem is I can't figure out how to even open a PDF file in PHP. There are some PDF libraries mainly for creating PDF's, but they don't seem to be very good at reading them.
An alternative solution would be to use an already available solution in Python or something else (as described in other threads on this site) but I'd really like to stay as much as possible in PHP as I intend to later export the data to mysql, etc.
Any input on how to read a PDF and extract data from it would be much appreciated.
I personally haven't tried this out, but it looks like this one works: http://www.pdfparser.org/documentation
It's just a matter of downloading and telling your code to include it, just like the documentation shows.
Or you could try the class.pdf2text.php found in http://www.phpclasses.org/browse/file/31030.html

Dynamic PDF generation

this is what i'm trying to do. I have a Student Result Application in
which i'll like to print out a pdf format of a specially designed
Result's Sheet..
http://www.4shared.com/photo/yg8vCjYe/results_layout.html
My question is that is it possible to send all the html, css and php
variables from the final result sheet to the pdf engine, or just
design a new page result_printout.php page and implement the
pdf engine on that page.
I'll be happy your honest opinions
thanks for you help
honestly, i haven't done this before,
but i think this should help:
http://www.rustyparts.com/pdf.php
Since you allready have the HTML I would suggest to use wkhtmltopdf. There also some wrappers for PHP. It's allways a bit tricky to get it all work in the right way, especially with pagebreaks.
But I find it usually more easy to use then all the other PHP PDF creation classes/libraries.
I have only worked with MPDF and you can pretty much send all the html, css and php variables to the pdf enginge and tell the engine to either force a download or send it to the browser where the user can either save the file or print or just view it. You can also email the generated pdf.
You can read the documentation here: http://mpdf1.com/manual/

PHP PDF template library with PDF output? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
Is there any PHP PDF library that can replace placeholder variables in an existing PDF, ODT or DOCX document, and generate a PDF file as the end result, without screwing up the layout?
Requirements:
Needs no 3rd party web service
Ability to run on shared web hosting would be ideal (no binary installations / packages required)
Mind you, a library that is able to load an existing PDF file and insert text programmatically at a specific position is not enough for my use case.
As far as my research shows, there is no library that can do this:
TCPDF can only generate documents from scratch
FPDI can read existing PDF templates, but can only add contents programmatically (no template variable replacement)
There are various DOCX/ODT template libraries out there but they don't output PDF
PHPDOCx claims to be able to do exactly what I need - but they don't offer a trial version and I'm not going to buy a cat in a bag, especially not when there seems to be no other product on the web that does this. I find it hard to believe they can do this without problems - if you have successfully done this using the product, please drop a line here.
Am I overlooking something?
Is there a way to do this using PDF forms? I am creating the source documents in OpenOffice 3.
I may be able to use standard Linux commands (pdftk is available for example, trying that out right now.)
Update: *Argh!* I was called out of the office and the bounty expired in the meantime. Starting a new bounty: As far as my testing shows, no solution works for me perfectly yet.
Update II: I will be looking the pdftk approach soon, but I am also starting another bounty for one more round of collecting additional input. This question has now seen 1300 rep points in bounties, must be some kind of a record :)
This is not very practical, but for completeness: If you already have an ODT template, then you might very well retain that as template. Modifying the OpenDocument content.xml and replacing placeholders therein is pretty simple. If so, you could use unoconv or pyodconverter to transform the ODT into a final PDF.
unoconv -f pdf -o final.pdf template.odt
Very obviously this requires a full OpenOffice setup (UNO and Writer) on the webserver. And obviously not every webhoster would go with that! haha. Even if it's simple on any Debian or Fedora setup. The execution speed would probably not be stellar either. But then it might be the cleanest approach, since OOo governs both formats way better than any PHP class ever could.
Pekka,
I looked in to this previously, I think you can use pdftk (a command line utility), to fill in a PDF form using FDF/XFDF data files, which you could easily generate from within PHP. That was the best option I've seen so far, though there may well be a native library.
pdftk is quite useful in general, worth having a look at.
Update: Have a look here: http://php.net/manual/en/book.fdf.php
Have you considered using something like XSL:Formatting Objects (XSL:FO)? Basically they're XML documents that are processed and turned into PDFs. Doing string - or better, DOM - replacements within that should be pretty simple. It supports embedding images, links, annotations, etc.
It's not PHP but there are a number of PHP wrappers for it along with ways of using it via exec, etc. Not an ideal but it takes care of the template portion completely. For some more info: http://techportal.inviqa.com/2009/12/16/transforming-xml-with-php-and-xsl/
There's an implementation available as an Apache project - http://xmlgraphics.apache.org/fop/
fpdf and there is another extention on top of it, which I can't remember, which allows you to import templates
Your best bet would be to generate the entire document on the fly, with the template defined programatically using fpdf or something similar. That way, your text will not be cut off by paragraphs or anything like that, and you can easily position images/other elements as required.
Late, but you can use OpenSource template designer https://github.com/applicius/dhek/releases , to define pkaceholders/areas over any existing PDF, then load it in PHP (as it's JSON format) and write accordingly on original PDF using fpdf lib, to generate custom PDF with dynamic data written on.
Altough not exactly thing you asked, you may consider to make it at two steps: using some php templating sytem (smarty, dwoo) to generate html page and then using tools like Html2Pdf convert it to pdf. I am using it, and results are good (no problems with page layout etc)
Of course it depends of your input documents (can you use html instead of PDF/ ODT as source ) and complexity of the layout of those.
Ok I'm trying to help you solve the problem a little.
First the answer for couple of your question.
Q - Am I overlooking something?
A - No. There is a PHP PDF library that can replace placeholder variables in an existing PDF and generate a PDF file as the end result, without screwing up the layout
Q - Is there a way to do this using PDF forms?
A - Yes. absolutelly the tric to doing this is by using a PDF Forms
For both answer you can use Justin Koivisto fill pdf form field php library.
For more detail you please go to http://koivi.com/fill-pdf-form-fields/tutorial.php.
Take a look there for additional information.
Credit to Justin Koivisto for his work
P.S
For workaround for displaying a table like output from pdf form
please consider to take some reading on Oracle Business Intelligence Publisher User's Guide - Creating a PDF Template
I'll add this new answer since the FDF PHP extension is now dead.
I've just followed these instructions and ended up executing one perl script then the pdftk command
I'm pretty aware it's far from being a real PHP solution but it's reliable and fairly easy to implement on any *nix platform.
The tools described there are also available on Debian, just in case you were wondering.
It's a litte bit late but have a look at the PDFTemplate Library it does exatly what you want. You can create Open Document files (odt) and add placeholders in it. The PDFTemplate library can fill out these placeholders (even with images) and create a PDF file.
ODT Files with placeholders to PDF

PHP HTML to PDF free convertor Resources

What is the best PHP HTML to PDF free converter around, not just in terms of functionality but also in terms of resource usage and speed
Thanks
Have a look at open-souce fpdf library.
Check dompdf, an HTML to PDF converter written in PHP. No external dependencies, it supports complex tables, images and even external style sheets.
http://www.digitaljunkies.ca/dompdf/
If you want to be really clever about it, you could programmatically create a new Google doc containing your HTML and CSS, then programmatically export it as a PDF. No resource usage on your part, and it works very well.
Start here:
http://www.tecnick.com/public/code/cp_dpage.php?aiocp_dp=tcpdf
We recently used this on a project with quite a bit of luck. I don't know that it will go straight from HTML to PDF like you are looking for, but it is a good set of tools.

Direct Print webpage in PDF file

In my site i m fetching my mysql data by using PHP. I want open that data in pdf file when i click pdf print button is it possible?
First of all, if you want a high quality professional product to do that. You want Prince XML
If you are looking into some open source tool to achieve something similar. You can look into this SO question.
You could prepare static PDF form file, that just fill it in with values using PHP's FDF module.
It depends which platform are you using. This would be an easy job if you are using Groovy on grails. There are plugins which facilitate pdf reporting like the jasper-plugin.
Luis
Check out jsPDF, an open-source library for generating PDF documents using nothing but JavaScript.
You can process the data with Apache FOP after transforming it to XML. (http://xmlgraphics.apache.org/fop/).
If your page is template based, you may create a template which produces xml output and process that. You'll have extremely well contol over the pdf construction. The tradeoff is that it is not a "plug this in and will work" solution, but I've done that and once its set up, works like charm.
I've used TCPDF in the past, it's a little kludgy but can definitely get the job done. (http://www.tecnick.com/public/code/cp_dpage.php?aiocp_dp=tcpdf)
The FPDF module in PHP is simple enough to get the data together. It is a safe option since you know what data you are passing out to the PDF engine. There are some streaming pdf options which can take in a bunch of html and then output that to pdf however they can get it quite wrong without you knowing.
I used, on Linux machines, WKHTMLTOIMAGE/WKHTMLTOPDF a number of times, on many projects. It workes like a charm, easy to use, just a script that you run.

Categories