pdf text extracter class in php - php

Is there available any class in php that extract all text from pdf file so i can store it in mysql database. My pdf has many elements like images, tables,plain text,form elements,charts etc.
So far i saw many classes for last two days, that extract texts, but no one facilitate with complete text extraction, Not extracting complete text from pdf.
I want to extract all text from given pdf file, even if the text is in table etc.
Any one know about this ? :)
Thanks a lot. Have a nice day :)

Find the below url,
Reading the clean text from PDF with PHP

If you are running this on a linux server, you could try using apdf2text calling it via exec then grabing the contents of the output file.
Note that a few pdf to text scripts are around and you'll get different mileage from all.

I've tested many command line program, but none has 100% result.
So I've started my own library in PHP :
https://github.com/smalot/pdfparser
Currently it's text oriented, but image support will be planned.
If you encountered issues, thanks for sending me your PDF and if possible, the way you made it .

Related

Extracting text from PDFs in PHP

I'm creating a php based web application which allows the user to upload a PDF file. This file will then be read and checked for certain data (text).
The problem is I can't figure out how to even open a PDF file in PHP. There are some PDF libraries mainly for creating PDF's, but they don't seem to be very good at reading them.
An alternative solution would be to use an already available solution in Python or something else (as described in other threads on this site) but I'd really like to stay as much as possible in PHP as I intend to later export the data to mysql, etc.
Any input on how to read a PDF and extract data from it would be much appreciated.
I personally haven't tried this out, but it looks like this one works: http://www.pdfparser.org/documentation
It's just a matter of downloading and telling your code to include it, just like the documentation shows.
Or you could try the class.pdf2text.php found in http://www.phpclasses.org/browse/file/31030.html

Submit HTML form to PDF

We have a high-resolution PDF (for printing) which has some form fields on it. We would like to have an HTML form which submits to the PDF, which is then placed into the respective fields.
I found a solution on google: http://koivi.com/fill-pdf-form-fields/
However, with that solution you only get an FDF file... And the demo does not work for me, opening the FDF file simply downloads another FDF file.
Since this PDF will be available to the public we would like to keep it as simple as possible. If we must open our original PDF and import this FDF file, we need a different solution (which I'm not sure is what the FDF file is for, since it didn't work).
A related post talking about .net framework had the same idea, but there were only paid commercial solutions: From HTML form to PDF
The PHP solutions I have found so far are for creating a new PDF, which is not what I need. Our PDF is created with Adobe Illustrator (or a similar adobe product) and is high-res with embedded fonts, svg and image content.
The form elements are in place, we just need to get the data to there.
Update April 11, 2013:
Since posting this question I have been utilizing FPDF on multiple projects where I needed to accomplish this goal. Although it cannot seem to "merge" template PDFs with the provided data, it can create the PDF from scratch.
One example I have used, I had a high resolution PNG for printing (similar to initial question) which we had to write the customer's name and today's date clearly in the center. I simply made the background of the PDF using FPDF->Image() and write the text afterwards using FPDF->Text().
It was very simple after all, you will need to look up the paper sizes to determine the X,Y,W,H of the image and then base your text fields relative to those numbers.
There was even a Form Filling extension, but I couldn't get it to work.
It seems as though I should answer my own question, although Visions answer may be better (seems to be deleted?). I used Vasiliy Faronov's link which was a comment to my main question: https://stackoverflow.com/a/1890835/200445
Here I found how to install pdftk and run a command to merge (flatten) my FDF and PDF files. I still used the "hacky" way to generate an FDF using Koivi's FDF Generator but it works for the most part.
One caveat is that some characters, like single and double quotes are not inserted correctly. It may be an issue of escaping the fields, but I could not find an answer.
Regardless, my PDF form generator is working, but anyone with a similar issue should look for a better solution.
There are number of tools which are not paid like itextsharp. try the following https://web.archive.org/web/20211020001747/https://www.4guysfromrolla.com/articles/030211-1.aspx Hope this code will help you. I have tried it its worked for me. If you can pay then there are number of paid tools which convert the HtML to PDF like ABCPDF etc.This example is in Asp.net and i am sure if you can convert it in PHP it will work for you too.

create an image from table

I have following problem. I have txt file here : http://ch1zra.com/d2/runes.txt
I use PHP to loop throgh the file and generate this table : http://ch1zra.com/d2/runes.php
Table uses some basic styles and I like it that way.
txt file is generated and uploaded via python. I would like to create an image that looks like that table. Is there any way using python or PHP to do so ?
Any image format that is acceptable on the web is good, PNG being even quite welcome.
I've read somewhere that python reportlab can make styled tables with alignments and so on, so that could be a good start, but reportlab generates PDF. Of course, if that is just a step between it is also acceptable (if I could do the PDF > img conversion on my machine). ALso, IIRC every PDF contains a "screenshot" of each page for fast browsing, so that would also be cool.
All in all, I have this txt file and this HTML table that I want as image. If any1 can help that would be great :)
thanx in advance!
I can only speak for PHP.
You could try to build the image by hand with PHP's image functions http://www.php.net/manual/en/book.image.php
Or you could try executing a external script like: http://marginalhacks.com/Hacks/html2jpg/

Generate PDF from HTML PHP

I want to generate PDF from a PHP file that includes HTML controls like textbox, and textarea. I attached CSS in the same. I tried FPDF, DOMPDF and TCPDF, but still I don't get exactly what I want. How do I pass HTML controls with PHP variables and CSS to these libraries?
mpdf is another option that you could try.
EDIT :
Found another solution for it, TCPDF is a FLOSS PHP class for generating PDF documents. Looks more dominating library.
"PRINCEXML" is a good library (not completely free now).
Others:
If your meaning is to create a PDF file from PHP, pdflib will help you (as some other suggested).
Else, if you want to convert an HTML page in PDF via PHP, you'll find
a little trouble outta here.. For three years I have been trying to do it as best as I
can.
So, the options I know are:
HTML2PS: same of DOMPDF, but this one convert first in .ps
(Ghostscript), then, in whatever format you need (PDF, JPEG, PNG). For
me it is a little better than dompdf, but I have the same speed problem.. Oh,
it has better compatibility with CSS.
Those two are PHP classes, but if you can install some software on the
server, and access it through passthru() or system(), have a look at
these too:
wkhtmltopdf: based on webkit (safari's wrapper), is really fast and
powerful... It seem like it is the best one (atm) for converting HTML pages to PDF on the fly, taking only two seconds for a three pages XHTML document
with CSS 2. It is a recent project. Anyway, the Google Code page is often
updated.
htmldoc: this one is a tank, it really never stops orcrashes... The project
seems to have died in 2007, but anyway if you don't need CSS compatibility
this can be nice for you.
** Thumbs Up For Strae.
If I understand your needs correctly I don't think any PHP-PDF class would do that.
Mostly you could insert only text and images to a PDF file, so if you would want something that looks like an HTML element you would need to insert it as an image.
Usually just putting HTML doesn't mean all your elements would stay intact in the PDF . (Different world, after all)
http://www.fpdf.org/ is the site having a great HTML-to-PDF class which work well. I am using it, but you have to first study its functionality and then start.

extracting content from pdf using PHP

Could you please tell me how to extract content from PDF document using PHP? Formatting is the main problem im facing here. So let me know, if there are some ways to extract content with the same format and to display it on an online text editor.
Thanks
Have a look at XPDF
I suppose you could do
$text = shell_exec("pdftotext $pdffile");
As for displaying it in an editor? Well, which editor?
To retain some type of formatting information, and assuming by web editor you mean HTML editor, you can convert it to HTML. Perhaps there are other tools available, but since i use xpdf i came across this converter that is based on xpdf.
Basic usage
pdftohtml -noframes -c test.pdf test.html
To get it into your favorite editor
echo file_get_contents('test.html');
You may need to wrap things inside PHP functions/classes. And you may want to add security measures and whatnot.
As far as I can see, it is not possible to convert a PDF to editable HTML using PHP on the fly, while preserving formatting. There are a number of Desktop apps around that all try to extract data from PDFs with sometimes more, sometimes less reliable results. I would say this is not realistically possible at the moment and all you can do is to extract plain text using XPDF or other command line tools.
It may be different with that new XML-Based PDF format but I don't really know anything about that yet.
Feel free to prove me wrong, of course - I'd be very interested myself if there were a solution.

Categories