i run a job search site, and i need to convert doc, docx and pdf files into HTML on linux CentOS server running php. People submit these files as resumes. So far, I found PHPDocx to be great at converting docx to html. But I am stuck at doc/pdf. PDFTOHTML gives error "bad color" when i run tests. As far as doc, i only found wvwave, which seems complex and bulky to install.
does anyone have any ideas on how to easily convert doc/pdf to HTML?
The only thing i can think of is FPDF.
It is intended for creating PDF files in PHP but it can also open PDF files.
Maybe you can use that as a base and develop some sort of toHTML function for it.
It is completely free to use and it has some extensions already.
It MIGHT help you.
http://www.fpdf.org
EDIT:
Thanks for the addition to my post in the comments to Pierre:
You can use fpdi: http://www.setasign.de/products/pdf-php-solutions/fpdi but the input pdf is just like an image.
I havent taken a look at it myself so far but this might help.
As far as .doc files go how about trying OpenOffice/LibreOffice, something like:
lowriter -convert-to html doc_file.doc –
As far as PDF goes, if the PDF is a graphical representation of text then you're out of luck, best you can do is try convert it to an image with ImageMagick, if it is a proper text it should easily convert.
There are various tools out there already to do this, such as http://dag.wieers.com/home-made/unoconv/, http://www.phpdocx.com/ (which you've already tried)
http://www.phplivedocx.org/2009/08/13/convert-docx-doc-rtf-to-html-in-php/ looks promising.
Or, you could install a portable version of libreoffice on your server which allows command line conversion
https://help.libreoffice.org/Common/Starting_the_Software_With_Parameters
I'm sure there'll be tutorials out there (on libreoffice support area)
To easily convert pdf to html, I would suggest pdf2htmlEX which produces outstanding HTML and is fast enough for runtime converting. You should first put some effort to optimize and build it for your system. There is simple build howto included on the project link.
Related
I am developing an API, in PHP, hosted on a linux server, that requires me to make jpeg previews for a .pptx powerpoint presentation.
I first convert the file to pdf and then convert the pdf to jpegs.
The second step is easy, with ghostscript, it's the first part that's proving difficult.
I have tried using the libreoffice executable, but pptx isn't completely compatible. Certain backgrounds become invisible.
I have the same problem with many 3rd party APIs (which I suspect also use libreoffice); the ones that do work, are ridiculously expensive.
Installing office on a Linux server and using COM functions seems impossible, or very tedious at best.
I have looked at Aspose.Slides, which also seems rather expensive, and their documentation is filled with errors.
I could use suggestions on how to tackle this problem.
I have tried to find the underlying problem of why LibreOffice and online conversion tools have a problem with the backgrounds of the presentations I need to convert.
The background is a .emf file, which has bad support.
My solution
I've unzipped the presentation, converted the .emf files to png (using ghostscript), changed all mentions of .emf to .png in the XML, and rezipped the altered presentation.
When I now use the LibreOffice headless to convert to pdf, the background shows up.
It might be a bit hacky, but it works for the intent of my program.
ps. I see that my question has gathered a few downvotes. In my opinion it was a valid question, and listed the various solutions that had worked for others, but not for me. If anyone has insights or ways to improve it, feel free to comment.
I'm creating a docx file from an HTML page using pandoc, but for the life of me I can't seem to get it to take on any kind of styling or successfully use a dotx template. I don't know if it's because you can't style docx files or I'm doing something wrong - documentation isn't all that verbose for pandoc.
I've also tried just echoing the html out and setting headers so the client will open the file as a doc, but this has some problems when you save it (it will try to save as an html file and converting to a doc isn't all that easy).
What I want to do is create an editable document which is styled and contains a logo image - just font types, colours and sizes would be enough, maybe some basic positioning would be nice.
Does anyone know how to acheive this on a LAMP - like system?
I stumbled on using Libreoffice on the CLI to do the conversion, with a much greater degree of success. It's still not perfect but alot better than what I was getting, and seems to take onboard font types, sizes and colours alot better.
Steps to install and use (CentOS / Redhat here):
sudo yum install libreoffice libreoffice-headless
You may need some X11 / Xorg libs, easiest to just install Xorg if it won't run.
libreoffice --headless --convert-to docx --outdir ./ myfile.html
Worked for me, I ended up with a serviceable .docx file which could be read by MS Word 2008 and LibreOffice 3.5.6.2.
Other tools that might also be worth examining are JODReports and Docmosis which are focussed on generation from templates (mail-merge) rather than just format conversion. JODReports is free/Open Source, Docmosis is not. Both can be ivoked from PHP in various ways and Docmosis has a cloud-service which means a zero-install footprint if your application is allowed to reach the cloud. Please note I work for the company that created Docmosis.
I think they both can work from docx/dotx templates and produce a variety of output formats including DocX
Hope that helps.
I have relatively sensitive data in .docx, .xlsx and PDF files that all need to be converted to a single PDF file locally. Sending these files off to phpdocx or Google Docs or anything like this is not an option.
The only other option I am seeing is OpenOffice / LibreOffice but I am not satisfied with how they are converting the documents.
Is there any other alternative anyone is aware of? Thanks!
Definitely a difficult task. The very recent release of LibreOffice 3.6 has fixes to it's docx processing if that might help, but you haven't specified what the actual problems you encountered when you tried OpenOffice.
If you have time to experiment (and bring in any tools/languages you need to get the job done) you could try LibreOffice to produce PDFS, then use one of the many PDF libs to stitch the PDFs into the single file you require.
You could also look at ODFConverter which has traditionally been much better with DOCX than either OpenOffice or LibreOffice. This would allow you docx -> odt -> pdf. I think it can do the xlsx also. Then do the PDF stitching again.
I suggest testing the stages manually at first and if promising, try something like JODConverter (requires Java) to allow you to automate the process via scripts.
Good luck.
What is, according to you, the best way to convert uploaded files of any kind (.doc, .docx,...) into a pdf-file using nothing but php. Is it even possible to do so?
I looked at FPDF, but this creates the pdf files from text.
An other solution previously given was to use the PDFlib library on your server, but unfortunately, my server doesn't support this library...
What is the best way to convert to files my users upload on my site to pdf files?
A simpler approach would be to restrict uploads to .PDF format programmatically and require your users to only upload .pdf files. Provide a link on the upload page to a free and open source pdf printer (e.g. Cuteftp) that the user can install to create .pdf documents from any file that can be printed.
Trying to do it through PHP will be problematic because the uploads could be generated from many different programs that would be impossible to cater for in their entirety. e.g. How would it handle Scribus or ABC Flowcharter or any other 'non-standard' application someone used to create a document?
Much better to filter the upload upfront.
The best server-side PDF generator from those I tried was, so far, wkhtmltopdf, a WebKit-based, self-contained invisible browser that can render any HTML+CSS and generate a PDF from it. Reasonably fast and fairly reliable, has some useful PDF options, such as page size, orientation, etc.
The second part of the job in your case is to convert documents to HTML prior to feeding them to wkhtmltopdf. If possible, have your users upload the docs in HTML (Word and Co. can export (crappy) HTML). If this is not an option, you will have to find a tool just for that, which, in my opinion, is much easier than finding a tool that converts Word docs directly into PDF.
Good thing about wkhtmltopdf is also that you can feed the output of your PHP script to it using the ob_xxx() functions.
PHP Excel best simple way to create doc, docx, xls, xlsx, pdf files with PHP. Its lot easier with clear documentation.
Use Microsoft Office to render Microsoft Office documents, if you care about accuracy at all. This is easily done by invoking Office over COM.
Get access to your server, and install what you need. Doing so would be far easier than monkeying around with sub-par solutions.
Well... I can think of one way of doing it quite easily, but it doesn't involve using PHP.
Upload your documents to a folder on your server, that are browsable by your users.
EG: http://mysite.com/docs/
Then get your users to install a virtual printer driver such as Primo PDF
http://www.primopdf.com/index.aspx
then they can load the document into their browser, and print to PDF for offline browsing.
If this is not an option, and your dealing with office documents that conform to the openXML standard, you could attempt to parse the XML doc into a PHP page for display in the browser, then use JavaScript to trigger a print.
Unfortunately, it does still depend on your user having a PDF printer installed.
Alternatively, you could just load the docs natively, and print to your own PDF printer, then upload the PDF's to the web server for download.
I can't think of any easy way of doing this otherwise, without installing all sorts of different document parser tool-kits and doing a huge amount of behind the scenes work.
I am extracting a pdf into images / swf and text with the help of SWFTools and XPDF.. I am running these in a PDF script.
But now I am trying to go one step further and try to get the TOC from the PDF is it possible to extract this information?
I found this with a little bit of searching. It looks rather promising.
PDFMiner: http://www.unixuser.org/~euske/python/pdfminer/index.html
Note: The tool is Python based, but you should be able to use the tool via shell access. Alternatively, you may be able to glean some useful info from the source code itself, as the project is open source.
From the Site:
dumppdf.py
dumppdf.py dumps the internal contents of a PDF file in pseudo-XML format. This program is primarily for debugging purposes, but it's also possible to extract some meaningful contents (such as images).
Examples:
$ dumppdf.py -a foo.pdf
(dump all the headers and contents, except stream objects)
$ dumppdf.py -T foo.pdf
(dump the table of contents)
$ dumppdf.py -r -i6 foo.pdf > pic.jpeg
(extract a JPEG image)
I tried dump.pdf -T, but it did not work on some PDF files.
There is another tool from MuPDF named mutool, which I just found. I don't know if this is better than dump.pdf but worked on a PDF file dump.pdf throws an error.
Here's how to extract TOC with mutool
mutool show {your-pdf-file} outline
MuPDF
Alternatively, you can use MuPDF which is a pretty lightweight but complete PDF implementation written C. In the apps/ subdirectory you will find some tools which can view, dump and extract information from PDF files. I'd prefer MuPDF over xpdf because it is actively maintained and has better PDF support.
Otherwise, there's always Poppler which is actually based upon xpdf. The developers ported its code to C++. Hence, it's performs worse than its predecessor. Compared to MuPDF, Poppler seems to have slightly more features, but in return the code is much more complex.
For your purposes MuPDF should be sufficient though. You could hack together a simple application from the example code provided in apps/ that extracts all the information you need without relying on external applications.
I think looking at PHP's PDFLib would be a very good place to start. If you scroll down, you will see plenty of user-posted solutions for converting PDF to HTML or PDF to Text. After conversion, a relatively simple match function could extract the tagged TOC items and throw them into an array for example, which you can then manipulate as you please.
This StackOverflow post also has some more solutions.
Hope this helps.