Extract TOC of PDF? - php

I am extracting a pdf into images / swf and text with the help of SWFTools and XPDF.. I am running these in a PDF script.
But now I am trying to go one step further and try to get the TOC from the PDF is it possible to extract this information?

I found this with a little bit of searching. It looks rather promising.
PDFMiner: http://www.unixuser.org/~euske/python/pdfminer/index.html
Note: The tool is Python based, but you should be able to use the tool via shell access. Alternatively, you may be able to glean some useful info from the source code itself, as the project is open source.
From the Site:
dumppdf.py
dumppdf.py dumps the internal contents of a PDF file in pseudo-XML format. This program is primarily for debugging purposes, but it's also possible to extract some meaningful contents (such as images).
Examples:
$ dumppdf.py -a foo.pdf
(dump all the headers and contents, except stream objects)
$ dumppdf.py -T foo.pdf
(dump the table of contents)
$ dumppdf.py -r -i6 foo.pdf > pic.jpeg
(extract a JPEG image)

I tried dump.pdf -T, but it did not work on some PDF files.
There is another tool from MuPDF named mutool, which I just found. I don't know if this is better than dump.pdf but worked on a PDF file dump.pdf throws an error.
Here's how to extract TOC with mutool
mutool show {your-pdf-file} outline
MuPDF

Alternatively, you can use MuPDF which is a pretty lightweight but complete PDF implementation written C. In the apps/ subdirectory you will find some tools which can view, dump and extract information from PDF files. I'd prefer MuPDF over xpdf because it is actively maintained and has better PDF support.
Otherwise, there's always Poppler which is actually based upon xpdf. The developers ported its code to C++. Hence, it's performs worse than its predecessor. Compared to MuPDF, Poppler seems to have slightly more features, but in return the code is much more complex.
For your purposes MuPDF should be sufficient though. You could hack together a simple application from the example code provided in apps/ that extracts all the information you need without relying on external applications.

I think looking at PHP's PDFLib would be a very good place to start. If you scroll down, you will see plenty of user-posted solutions for converting PDF to HTML or PDF to Text. After conversion, a relatively simple match function could extract the tagged TOC items and throw them into an array for example, which you can then manipulate as you please.
This StackOverflow post also has some more solutions.
Hope this helps.

Related

extract images from PDF with PHP

The thing is that the client wants to upload a pdf with images as a way of batch processing multiple images at once.
I already looked around and out of the box PHP can't read PDF's.
What are my alternatives?
I already know the host has not installed imageMagick or any pdf library and the exec function is disabled. That's basicly leaving me with nothing to work with, I guess?
Does anyone know if there is an online service that can do this, with an api of sorts?
thanks in adv
AFAIK, there is no PHP module to do it. There is a command line tool, pdfimages (part of xpdf). For reference, here's how that works:
pdfimages -j source.pdf image
Which will extract all images from source.pdf as image-000.jpg, image-001.jpg, etc. Note the output format is always Jpeg.
Possible Options
Being a command line tool, you need exec (or system, passthru, any of the command executing functions built into PHP). As your environment doesn't have that, I see four options:
Beg that exec be turned on for you (your hosting provider can limit what you can exec to a single command)
Change the design -- how about a ZIP upload?
Roll your own, using the source code of pdfimages as a model
Let pdfimages do the heavy lifting, by running it on a remote host you do control
Regarding #3, rolling your own, I don't think rolling your own, to solve a very narrow definition of requirements, would be too difficult. I seem to recall that the image boundaries in PDF are well defined: just read in the file to a boundary, cut to the end of the boundary, base64_decode, and write to a file -- repeat. However, that may be too much...
If rolling your own is too complicated, then option #4 is kind of like what Joel Spolsky describes for working with complicated Excel objects (see the numbered list under the bold heading "Let Office do the heavy work for you").
Find a cheap hosting environment (eg Amazon EC2) that let's you exec and curl
Install pdfimages
Write a PHP script that takes a URL to a PDF, curl opens that PDF, writes it to disk, passes it to pdfimages, then returns the URL to the resulting images.
An example exchange could look like this:
GET http://www.cheaphost.com/pdfimages.php?extract=http://www.limitedhost.com/path/to/uploaded.pdf
Content-type: text/html
<html>
<body>
<ul>
<li>http://www.cheaphost.com/pdfimages.php?retrieve=ab9895v/image-000.jpg</li>
<li>http://www.cheaphost.com/pdfimages.php?retrieve=ab9895v/image-001.jpg</li>
</ul>
</body>
</html>
So your single pdfimages.php script (running on the host with the exec functionality) can both extract images, and give you access to the extracted images. When extracting, it reads a PDF you tell it, runs pdfimages on it, and gives you back a list of URL to call to retrieve the extracted images. When retrieving, it just gives you back a straight image.
You would need to deal with cleanup, perhaps the thing to do would be to delete the image after retrieval. You would also need to handle security -- don't know what's in these images, but the content might need to be wrapped in SSL and other precautions taken.
You can use pdfimages and install it this way:
apt install poppler-utils
Then use it this way to get all the images as PNG files:
pdfimages -j mypdf.pdf image -png
Images will be placed in the same folder under image-000.png, image-001.png, etc.
There are many options available, including some to change the output format, more information here.
I hope this helps!

Displaying word documents, excel sheet, power point in browser

Is there is any way to displaying world document,excel sheet and power point in browser with out downloading.
I assume that you are going to use php for this, so you can try checking some libraries such as PHPWord by Microsoft for example.
If you wish to only display the document content, it is possible to do using some scripting language such as php. Basically office 2007+ formats are zipped XML documents with changed extension. Make a simple word 2007+ document, save it and change extension from .docx to .zip, than you can extract it and see what it's made of. You can find a lot of details here. Now displaying content may be a little tricky. As mentioned, there are libraries out there to handle this, but how will they handle the documents, I am not really sure. Most of them are abandoned, PHPword is in beta since 2011.
There are some indications that Apache is working on cloud version of Open office, but there is no release date yet. Once done, you will have a full featured office suite web app.
If you feel really creative you could use cron job (or scheduled task if you like Windows) to open a document, take a screenshot and basically make .jpg or .png version of the document (works fine with short documents, longer ones may be problematic), displaying it in a browser without much complication. It is also possible to schedule export to .pdf - all browsers do have Adobe PDF plugins.
To sum up, using php for parsing simple documents should be fine, but getting complex docs to display properly, may be much more difficult task and possibly not worth your time. I would go for cron export to pdf, to preserve most if not all of the document's structure.

generate docx file with basic styling in php

I'm creating a docx file from an HTML page using pandoc, but for the life of me I can't seem to get it to take on any kind of styling or successfully use a dotx template. I don't know if it's because you can't style docx files or I'm doing something wrong - documentation isn't all that verbose for pandoc.
I've also tried just echoing the html out and setting headers so the client will open the file as a doc, but this has some problems when you save it (it will try to save as an html file and converting to a doc isn't all that easy).
What I want to do is create an editable document which is styled and contains a logo image - just font types, colours and sizes would be enough, maybe some basic positioning would be nice.
Does anyone know how to acheive this on a LAMP - like system?
I stumbled on using Libreoffice on the CLI to do the conversion, with a much greater degree of success. It's still not perfect but alot better than what I was getting, and seems to take onboard font types, sizes and colours alot better.
Steps to install and use (CentOS / Redhat here):
sudo yum install libreoffice libreoffice-headless
You may need some X11 / Xorg libs, easiest to just install Xorg if it won't run.
libreoffice --headless --convert-to docx --outdir ./ myfile.html
Worked for me, I ended up with a serviceable .docx file which could be read by MS Word 2008 and LibreOffice 3.5.6.2.
Other tools that might also be worth examining are JODReports and Docmosis which are focussed on generation from templates (mail-merge) rather than just format conversion. JODReports is free/Open Source, Docmosis is not. Both can be ivoked from PHP in various ways and Docmosis has a cloud-service which means a zero-install footprint if your application is allowed to reach the cloud. Please note I work for the company that created Docmosis.
I think they both can work from docx/dotx templates and produce a variety of output formats including DocX
Hope that helps.

Best way to convert files into pdf files using php

What is, according to you, the best way to convert uploaded files of any kind (.doc, .docx,...) into a pdf-file using nothing but php. Is it even possible to do so?
I looked at FPDF, but this creates the pdf files from text.
An other solution previously given was to use the PDFlib library on your server, but unfortunately, my server doesn't support this library...
What is the best way to convert to files my users upload on my site to pdf files?
A simpler approach would be to restrict uploads to .PDF format programmatically and require your users to only upload .pdf files. Provide a link on the upload page to a free and open source pdf printer (e.g. Cuteftp) that the user can install to create .pdf documents from any file that can be printed.
Trying to do it through PHP will be problematic because the uploads could be generated from many different programs that would be impossible to cater for in their entirety. e.g. How would it handle Scribus or ABC Flowcharter or any other 'non-standard' application someone used to create a document?
Much better to filter the upload upfront.
The best server-side PDF generator from those I tried was, so far, wkhtmltopdf, a WebKit-based, self-contained invisible browser that can render any HTML+CSS and generate a PDF from it. Reasonably fast and fairly reliable, has some useful PDF options, such as page size, orientation, etc.
The second part of the job in your case is to convert documents to HTML prior to feeding them to wkhtmltopdf. If possible, have your users upload the docs in HTML (Word and Co. can export (crappy) HTML). If this is not an option, you will have to find a tool just for that, which, in my opinion, is much easier than finding a tool that converts Word docs directly into PDF.
Good thing about wkhtmltopdf is also that you can feed the output of your PHP script to it using the ob_xxx() functions.
PHP Excel best simple way to create doc, docx, xls, xlsx, pdf files with PHP. Its lot easier with clear documentation.
Use Microsoft Office to render Microsoft Office documents, if you care about accuracy at all. This is easily done by invoking Office over COM.
Get access to your server, and install what you need. Doing so would be far easier than monkeying around with sub-par solutions.
Well... I can think of one way of doing it quite easily, but it doesn't involve using PHP.
Upload your documents to a folder on your server, that are browsable by your users.
EG: http://mysite.com/docs/
Then get your users to install a virtual printer driver such as Primo PDF
http://www.primopdf.com/index.aspx
then they can load the document into their browser, and print to PDF for offline browsing.
If this is not an option, and your dealing with office documents that conform to the openXML standard, you could attempt to parse the XML doc into a PHP page for display in the browser, then use JavaScript to trigger a print.
Unfortunately, it does still depend on your user having a PDF printer installed.
Alternatively, you could just load the docs natively, and print to your own PDF printer, then upload the PDF's to the web server for download.
I can't think of any easy way of doing this otherwise, without installing all sorts of different document parser tool-kits and doing a huge amount of behind the scenes work.

Converting doc, docx, pdf to HTML using PHP linux

i run a job search site, and i need to convert doc, docx and pdf files into HTML on linux CentOS server running php. People submit these files as resumes. So far, I found PHPDocx to be great at converting docx to html. But I am stuck at doc/pdf. PDFTOHTML gives error "bad color" when i run tests. As far as doc, i only found wvwave, which seems complex and bulky to install.
does anyone have any ideas on how to easily convert doc/pdf to HTML?
The only thing i can think of is FPDF.
It is intended for creating PDF files in PHP but it can also open PDF files.
Maybe you can use that as a base and develop some sort of toHTML function for it.
It is completely free to use and it has some extensions already.
It MIGHT help you.
http://www.fpdf.org
EDIT:
Thanks for the addition to my post in the comments to Pierre:
You can use fpdi: http://www.setasign.de/products/pdf-php-solutions/fpdi but the input pdf is just like an image.
I havent taken a look at it myself so far but this might help.
As far as .doc files go how about trying OpenOffice/LibreOffice, something like:
lowriter -convert-to html doc_file.doc –
As far as PDF goes, if the PDF is a graphical representation of text then you're out of luck, best you can do is try convert it to an image with ImageMagick, if it is a proper text it should easily convert.
There are various tools out there already to do this, such as http://dag.wieers.com/home-made/unoconv/, http://www.phpdocx.com/ (which you've already tried)
http://www.phplivedocx.org/2009/08/13/convert-docx-doc-rtf-to-html-in-php/ looks promising.
Or, you could install a portable version of libreoffice on your server which allows command line conversion
https://help.libreoffice.org/Common/Starting_the_Software_With_Parameters
I'm sure there'll be tutorials out there (on libreoffice support area)
To easily convert pdf to html, I would suggest pdf2htmlEX which produces outstanding HTML and is fast enough for runtime converting. You should first put some effort to optimize and build it for your system. There is simple build howto included on the project link.

Categories