extracting content from pdf using PHP - php

Could you please tell me how to extract content from PDF document using PHP? Formatting is the main problem im facing here. So let me know, if there are some ways to extract content with the same format and to display it on an online text editor.
Thanks

Have a look at XPDF
I suppose you could do
$text = shell_exec("pdftotext $pdffile");
As for displaying it in an editor? Well, which editor?
To retain some type of formatting information, and assuming by web editor you mean HTML editor, you can convert it to HTML. Perhaps there are other tools available, but since i use xpdf i came across this converter that is based on xpdf.
Basic usage
pdftohtml -noframes -c test.pdf test.html
To get it into your favorite editor
echo file_get_contents('test.html');
You may need to wrap things inside PHP functions/classes. And you may want to add security measures and whatnot.

As far as I can see, it is not possible to convert a PDF to editable HTML using PHP on the fly, while preserving formatting. There are a number of Desktop apps around that all try to extract data from PDFs with sometimes more, sometimes less reliable results. I would say this is not realistically possible at the moment and all you can do is to extract plain text using XPDF or other command line tools.
It may be different with that new XML-Based PDF format but I don't really know anything about that yet.
Feel free to prove me wrong, of course - I'd be very interested myself if there were a solution.

Related

Generate PDF from HTML PHP

I want to generate PDF from a PHP file that includes HTML controls like textbox, and textarea. I attached CSS in the same. I tried FPDF, DOMPDF and TCPDF, but still I don't get exactly what I want. How do I pass HTML controls with PHP variables and CSS to these libraries?
mpdf is another option that you could try.
EDIT :
Found another solution for it, TCPDF is a FLOSS PHP class for generating PDF documents. Looks more dominating library.
"PRINCEXML" is a good library (not completely free now).
Others:
If your meaning is to create a PDF file from PHP, pdflib will help you (as some other suggested).
Else, if you want to convert an HTML page in PDF via PHP, you'll find
a little trouble outta here.. For three years I have been trying to do it as best as I
can.
So, the options I know are:
HTML2PS: same of DOMPDF, but this one convert first in .ps
(Ghostscript), then, in whatever format you need (PDF, JPEG, PNG). For
me it is a little better than dompdf, but I have the same speed problem.. Oh,
it has better compatibility with CSS.
Those two are PHP classes, but if you can install some software on the
server, and access it through passthru() or system(), have a look at
these too:
wkhtmltopdf: based on webkit (safari's wrapper), is really fast and
powerful... It seem like it is the best one (atm) for converting HTML pages to PDF on the fly, taking only two seconds for a three pages XHTML document
with CSS 2. It is a recent project. Anyway, the Google Code page is often
updated.
htmldoc: this one is a tank, it really never stops orcrashes... The project
seems to have died in 2007, but anyway if you don't need CSS compatibility
this can be nice for you.
** Thumbs Up For Strae.
If I understand your needs correctly I don't think any PHP-PDF class would do that.
Mostly you could insert only text and images to a PDF file, so if you would want something that looks like an HTML element you would need to insert it as an image.
Usually just putting HTML doesn't mean all your elements would stay intact in the PDF . (Different world, after all)
http://www.fpdf.org/ is the site having a great HTML-to-PDF class which work well. I am using it, but you have to first study its functionality and then start.

pdf text extracter class in php

Is there available any class in php that extract all text from pdf file so i can store it in mysql database. My pdf has many elements like images, tables,plain text,form elements,charts etc.
So far i saw many classes for last two days, that extract texts, but no one facilitate with complete text extraction, Not extracting complete text from pdf.
I want to extract all text from given pdf file, even if the text is in table etc.
Any one know about this ? :)
Thanks a lot. Have a nice day :)
Find the below url,
Reading the clean text from PDF with PHP
If you are running this on a linux server, you could try using apdf2text calling it via exec then grabing the contents of the output file.
Note that a few pdf to text scripts are around and you'll get different mileage from all.
I've tested many command line program, but none has 100% result.
So I've started my own library in PHP :
https://github.com/smalot/pdfparser
Currently it's text oriented, but image support will be planned.
If you encountered issues, thanks for sending me your PDF and if possible, the way you made it .

How do I convert a PDF file to HTML in PHP?

How do I convert a PDF file to HTML in PHP? Is there any lib or web service? I mean free, thanks!
Google pdf2html, pdftohtml looks to be the only viable one. and it's based on a command line program, not PHP. so it may not be useful to you. Google is capable of converting, so there may be a way to do it with GDocs as well. though I'm not sure of that. At any rate, I hope this gets you on the proper path at least.
I've tried Poppler's pdftohtml command to convert PDF files to HTML files. Check it out on The HTML file output of Poppler is lighter when used but the output is not very accurate.
If you want accurate output you should use pdf2htmlEX I've converted complicated PDF files and got the best HTML output.
You can't.
PDFs are complex documents containing embedded fonts, vector graphics and layout information that cannot be represented in HTML in an automated way. You may be able to extract the TEXT of the document, but that's about it.

How can i convert a php page into .doc file with php

Recently i worked in a project. On this project I need convert page into a Microsoft word document (.doc file) and offer the document for download, all using PHP. But I can't solve this problem.
Please help me. Thank You very much, Arif
This is not easy to solve.
First off, if you want to write real word documents, you will have to do on Windows. You can use COM to talk to Word and this is how you manage to get good results. I've tried all the unix/linux based solutions and the results were not so great.
Otherwise, I'd suggest you write RTF -- which is just as good. And in the end, you can call the .rtf-file, .doc and no one will notice it. RTF has a couple limitations (formatting), but on the flipside -- it's all ASCII and the RTF standard is pretty comprehensive and well documented.
There's a class which does it pretty nicely -- phpLiveDocx (this is a great introduction). And this class also claims to write PDF and DOC -- but I haven't tried those yet. I use another solution for PDF.
I would recommend using the RTF format instead of the .doc - it's much simpler to write to, and all text editors understand it. Similar recommendation for .csv when you want to output an Excel file.
Perhaps not the answer you seek, but still interesting to note, there is a open source word processor out there called abiword that has a CLI (Command Line Interface). You can use it to easily convert between document formats. I know that at least one website uses it to convert text files into various formats.
It is actively getting developed and could easily be used as a 3de party black box solution to converting documents server side.
Here is a blog from one of the developers on how to integrate it with PHP
Server-Side AbiWord
abiword home page

Using PHP to create a PDF from a mix of plain text and HTML text [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Convert HTML + CSS to PDF with PHP?
I have a form with 5 fields, two of which are textboxes extended with tinyMCE, the rest are simple inputs of type text.
I need to generate a PDF from this input. I understand that I can use Zend_Pdf to generate the PDF and include the plain text data. But how, for example, can I include a bulleted list from the tinyMCE fields?
Would the best way be to create an HTML file, and then use for example DOMPDF or HTML2PDF? Ideally, I'd prefer to just use the zend framework to create the document, position and insert the fields, and save.
Thanks in advance.
More info in Convert HTML + CSS to PDF with PHP?.
In my experience, Prince XML was the Rolls Royce of such technologies so far away and above any of the other ones it's not even funny. It's expensive though. But I had all sorts of problems with all the others.
Some time ago I tried to use HTML to PDF conversion programs to convert... HTML to PDF, but in the end I gave up with that approach and just created the PDFs directly in code. I use fpdf (http://www.fpdf.org/) as a base and added supporting code for lists and grids etc.
I am using Prince XML mentioned by cletus. Results are very good, even with css styled html with floats etc. It's expensive, but it just works and saves a lots of time.
FPDF is very old library for PHP4. It propably won't even work nowadays. I'd recommend DOMPDF or TCPDF. They both are for PHP5+ and can eat HTML or CSS to some degree.
You could convert it all to HTML and then use openoffice or some other tool (pandoc is quite nifty too) to convert from HTML to PDF.
Alternatively, you could take a look at LiveDocx, which has php-bindings too. It's a hosted service, but you can use it without charge.
I personal recommend command line application instead of any php libraries.
Reasons :
PHP libraries need more time and memory (cache) for conversion process
They need well formatted html pages only, otherwise through errors or warning
Not support for external style sheet.
Command Line Tool:
If run your script on Linux server then I suggest command line tool.
Reasons :
They are extremely fast as compared to PHP libraries.
Support css.
Accept non well formatted html.
Which command line tool to use?
wkhtmltopdf
htmltopdf
html2pdf
for more information refer Converting HTML to PDF (not PDF to HTML) using PHP

Categories