Situation
I have a website written in PHP.
In PHP, I can extract the text inside a pdf file uploaded to the same website and so on.
I found the tabula-java github repo.
So what's the issue?
I have tried the mac app for tabula. I noticed that I needed to highlight a certain section of the pdf before the table data can be converted.
However, that's not what I want to accomplish. I want to run tabula in the background and on demand. When my website receives a file upload and certain conditions are satisfied, I want to call the tabula as a service somehow and feed it the unstructured data and then get back the tabulated data.
How do I go about doing this?
One way is to wrap the tabula-extractor command line command and return the results into your application.
For example, in R, the tabulizer package works this way.
Related
This is just a speculative idea for a client who has a lot of PDF files.
Algolia say in their FAQs that to search PDF files you first need to extract the text from the file. How would you go about this?
The way I envisage the a system working would be:
Client uploads PDF via CMS
CMS calls some service / program to
extract the text
Algolia indexes the extracted and it's somehow
linked to the original PDF
It would need to be an automated system as the client shouldn't have to tell it to index.
It would be built in PHP, probably Laravel running on Ubuntu.
What software / service could do the text extraction from the PDFs and is any magic needed to 'link' this with the PDF file?
I'm also happy to have suggestions on other search services which may handle this.
Fortunately, text extraction from pdf's is a subject that has been covered multiple times. On the command line, you could use pdftotext (available on Linux or Mac) or in your code a library as Apache Tika (for which you can find a PHP wrapper).
To avoid having too much noise in your records, I'd recommend you to then split the text and create one record per paragraph. You can then use Algolia's distinct feature to deduplicate the results.
You should already have the links to your files somewhere, just store them in your records and then, in your front-end you'll easily be able to create links to them using for instance autocomplete.js or instantsearch.js .
For anyone still looking for a solution, I put together a GitHub repository that does exactly that: https://github.com/PDFTron/pdftron-document-search.
The text extraction happens client-side as the user uploads the document using React + Firebase + Algolia.
You can check out a quick video walking you through the sample app: https://youtu.be/IQATnzHTp7Q.
Let me know if you have any questions.
is there any possible way to not lose any content, when inserting an image into a filled pdf, i am using the fpdm.php script from here and works prettty good i might add. the pdfs i am using i pass them trough pdftk, as in pdftk.exe insert.pdf output output.pdf so they can be filled via php with out throwing errors
so my problem is this, i have a pdf template, which i use to fill it with an array passed from php, and output it to browser or server, and works ok, but when i try to insert an image into it, it inserts it, but loses all filled data, i need to retain that data. i cant use pdftk because im on a godaddy shared hosting plan, also setasign scripts works i know, but i am trying to find a way without buying anything yet.
i found this stamper which stamps ok but loses pdf data, all boxes get blanked, and also this one that places the image and loses all data too. setasign is doing some magic stuff right there
All mentioned scripts are using FPDI in the background which simply doesn't modifes the original document but will allow you to recreate a completely new PDF document by importing another one page by page into reuseable structures (XObjects). Because form fields or other dynamic content like links or any other annotation type are not part of a pages content stream they will get lost.
The mentioned "magic" of the SetaPDF products is, that they modify the original document. Because of this all content will retain.
I am writing a system at the moment, where a user can can select 1 item from 8 arrays (creating 8 options). They then have the option to print out there selections a PDF.
The PDF is generated using DOMPDF and that data is sent to via POST. I obviously need to test that each combination of options prints correctly (by my mind this means a hell of a lot of manual testing).
Is there anything I can do with a bash script to automate the testing the process? As the content of the arrays will never change would it possible to write a testing script of some kind I can fire through the browser of the terminal?
If the PDF-Files for one combination of options are always identical, I would do the following:
I would create a "Master-PDF" for each option-combination. Then I would create the PDF-Files on the local filesystem, convert them to images and subtract them from the images of the appropriate Master-PDF (You only need to create them once).
When the resulting image is blank (every pixel is pure black), they are identical, so the generated PDF looks the same as the master, if they are not, something differs from the master and therefore the test didn't pass.
In combination with PHPUnit you should have a possibility to automate the testing rather easy.
We currently use PDForm to grab a blank pdf file (no values, just form fields and text) and list the form fields. We then query our database for the values which match those field names and create a pdf file with the newly populated data which the user can download from our site. The thing is PDForm is about $5,000 per machine and we are migrating servers. We want an alternative which is actively supported and recommended by the community.
I know Zend is working on a PDF manipulation extension, but we need something quick. I have done testing with PDFtk but the last update for that project was in 2006 and it now seems dead. It would be fine as it is open source, however it seems to be causing errors with certain files that seem to be generated with PDFPenPro (our pdf form creator).
Another solution I thought up was why not just use iText and write a java wrapper which accepts command line input, so that PHP can call it with passthru() or exec(). There are other applications that will work should we completely rewrite our code but we do not want to do that.
What we need.
The ability for PHP to receive the PDF form field names.
PHP to then either create and FDF file (then merge it with the PDF) or send a string to a command line application which will populate the fields with values from our database.
The user can then download the newly created PDF file with the populated form fields.
Am I moving in the write direction by creating a java command line application that will use iText to parse and create the PDF files specified by PHP or does anyone know of any cost effective alternatives?
TCPDF seems to have the most robust feature set that I have seen so far.
Thanks, d2burke, for the tip on TCPDF. I'm not trying to do quite as much as the OP, but the software packages available to accomplish any kind of pdf generation are in the $2k to $3k range. TCPDF is php based, open source and the guy developing it is very supportive.
Always donate to these guys! Where in the world would web development be without it?
So since none of the above solutions would work since TCPDF doesn't work with forms the way we are wanting and since PDFlib converts the form fields to blocks we decided to create a command line wrapper for iText which will grab the form field names from the PDF and then populate them based on the database values.
I don't know if another product whose license costs range from $1k -> $3k could be considered "cost effective", but PDFlib work quite nicely. And if you don't need the PPS functionality, it does get cheaper.
I have to make an image of a dynamic page i.e. the page keeps on changing in every 5 minutes.
I want to make images of that very page that keeps on changing so that i can have its records saved in the form of images.
How can i do that using php??
i have no idea about this and a little elaboration in answers will be highly appreciated!!
Two steps:
1: Create a script that captures the current data in image form.
If you provide more information about what you mean when you say "create an image of dynamic data", I can probably point you to some resources you can use. For now, just have a look at the GD library.
2: Set up a job that runs the script every 5 minutes
This can be done via Cron. I would suggest investigating if you can run the script when the data changes, instead of at specific intervals.
http://www.devarticles.com/c/a/PHP/Generating-Images-on-the-Fly-With-PHP/
http://www.thesitewizard.com/php/create-image.shtml
Getting a screenshot of a web page isn't an easy task.
You can choose one of the online services that do that for you and you can download the images from there.
Otherwise, I have found a solutions using webkit and python but you will need full access to your linux server in order to install the necessary packages, then you will be able to call that script from php and get your screenshots.