Parsing/edition docx file with PHP - php

I've been asked to write a php script that should read/parse a docx file and do some operations such as duplicate a specific paragraph/table and fill-in some variables (#myvar or $myvar) with values.
What do you guys recommand, use the word/document.xml file directly or convert the whole document to an HTML file and then parse it using DOM(I don't like this solution :( )?
the structure of the docx to parse is not defined yet, it's my job to do that ! And it has to be as general as possible.
To have a clear idea about what I'm doing, the docx file is a CV model that I have to fill-in with data from DB.
P.S: I don't know how to efficiently parse/modify the XML file using Xquery since the only solution I have is to use variables (plain text with $ or #..) inside that docx
thanks for your help :)

There are 2 major PHP libraries able to create Word documents. Here's a description of features from both that might help you solve your problem:
PHPWord (opensource) - allows to load template documents and replace values... take a look at this example in library's source code, maybe you can define a CV template and use this to work a solution out;
PHPDocX (free with basic features, paid for more advanced features) - allows templates and search and replace of content in documents (probably only in paid versions though).

This is a old question, but I thought I give some pointers as I have been struggling with this for some time and have ended up writing my own package at github: wrklst/docxmustache.
Here are some solutions I know of:
Free solutions:
https://github.com/PHPOffice/PHPWord (as mentioned above, cumbersome and not very capable)
http://www.tinybutstrong.com/opentbs.php (works but is highly cumbersome, also introduces a lot of security issues if you plan to allow user supplied templates)
Partially Free and Paid:
https://www.phpdocx.com
http://www.docxpresso.com (looks like one of the more complete solutions to me, at 199 eur for a server license its not too expensive either)
https://modules.docxtemplater.com
I worked with opentbs quite a bit but I am not happy with it and I am currently trying to evaluate to write my own solution that is more geared to my specific needs. Generally you need:
- A zip calss to unzip/rezip the docx file
- A template engine to replace values, I am using mustache (https://github.com/bobthecow/mustache.php)
- If you are planning to replace images as well you need to more advanced file, reference and xml handling. Php's SimpleXMLElement should be sufficient to handle all the xml manipulation.
Off course you can always convert the docx into a more accessible format, but that will greatly mess with any styling. If thats not an issue I recommend to use libreoffice to convert your docx into any format that libreoffice supports. on linux based servers you can easily access it via command line, here an example with symfony for command execution:
$command = "soffice --headless --convert-to html ".$inputfile.' --outdir '.$outputfile.'/');
$process = new \Symfony\Component\Process\Process($command);
$process->start();
while ($process->isRunning()) {}
// executes after the command finishes
if (!$process->isSuccessful()) {
throw new \Symfony\Component\Process\Exception\ProcessFailedException($process);
}
Check out my package wrklst/docxmustache if you want to see this in context.
good luck!

Related

Create Word Document from PHP Documentation

To document my code I thought it would be best practice to use phpDoc syntax, because there are several parsers out there and some IDEs create IntelliSense out of it.
Now I need to put the documentation (API) into a word file, but I don't know which parser is able to output .doc or similar.
I tried DoxyGen, which outputs .rtf and phpDocumentor2, which can only export to .html and .xml (?).
Is there a way to generate a .doc(x) file from phpDoc? Or a simple way to get a document which can be imported to word?
I would appreciate if I don't have to change the phpDoc syntax, because my documentation is very long.
Edit: The prefered parser would be phpDocumentor2, because it supports PHP 5.3 functionalities and it's faster than DoxyGen, but phpDocumentor2 has less features than phpDocumentor, which is no longer maintained, related to output formats.
Edit: I tried to copy content from the .rtf file into the .docx file, but when I select 'Use Destination Styles', both Word instances suspend and do not respond.
Presumably you want one large Word doc that contains all the info for your project in the one doc/file... therefore just opening the phpDoc2 HTML output into Word in order to convert it to docx will not meet your need, since that would be one docx per phpdoc2 HTML page.
You might try altering your searches to be for a tool that can spider a given HTML page, recursively pick up all its target page hierarchy, and convert it all into a single docx. You might have more luck finding a tool that does this but produces a PDF... then you could just use Word to convert the PDF into docx.

PHP PDF template library with PDF output? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
Is there any PHP PDF library that can replace placeholder variables in an existing PDF, ODT or DOCX document, and generate a PDF file as the end result, without screwing up the layout?
Requirements:
Needs no 3rd party web service
Ability to run on shared web hosting would be ideal (no binary installations / packages required)
Mind you, a library that is able to load an existing PDF file and insert text programmatically at a specific position is not enough for my use case.
As far as my research shows, there is no library that can do this:
TCPDF can only generate documents from scratch
FPDI can read existing PDF templates, but can only add contents programmatically (no template variable replacement)
There are various DOCX/ODT template libraries out there but they don't output PDF
PHPDOCx claims to be able to do exactly what I need - but they don't offer a trial version and I'm not going to buy a cat in a bag, especially not when there seems to be no other product on the web that does this. I find it hard to believe they can do this without problems - if you have successfully done this using the product, please drop a line here.
Am I overlooking something?
Is there a way to do this using PDF forms? I am creating the source documents in OpenOffice 3.
I may be able to use standard Linux commands (pdftk is available for example, trying that out right now.)
Update: *Argh!* I was called out of the office and the bounty expired in the meantime. Starting a new bounty: As far as my testing shows, no solution works for me perfectly yet.
Update II: I will be looking the pdftk approach soon, but I am also starting another bounty for one more round of collecting additional input. This question has now seen 1300 rep points in bounties, must be some kind of a record :)
This is not very practical, but for completeness: If you already have an ODT template, then you might very well retain that as template. Modifying the OpenDocument content.xml and replacing placeholders therein is pretty simple. If so, you could use unoconv or pyodconverter to transform the ODT into a final PDF.
unoconv -f pdf -o final.pdf template.odt
Very obviously this requires a full OpenOffice setup (UNO and Writer) on the webserver. And obviously not every webhoster would go with that! haha. Even if it's simple on any Debian or Fedora setup. The execution speed would probably not be stellar either. But then it might be the cleanest approach, since OOo governs both formats way better than any PHP class ever could.
Pekka,
I looked in to this previously, I think you can use pdftk (a command line utility), to fill in a PDF form using FDF/XFDF data files, which you could easily generate from within PHP. That was the best option I've seen so far, though there may well be a native library.
pdftk is quite useful in general, worth having a look at.
Update: Have a look here: http://php.net/manual/en/book.fdf.php
Have you considered using something like XSL:Formatting Objects (XSL:FO)? Basically they're XML documents that are processed and turned into PDFs. Doing string - or better, DOM - replacements within that should be pretty simple. It supports embedding images, links, annotations, etc.
It's not PHP but there are a number of PHP wrappers for it along with ways of using it via exec, etc. Not an ideal but it takes care of the template portion completely. For some more info: http://techportal.inviqa.com/2009/12/16/transforming-xml-with-php-and-xsl/
There's an implementation available as an Apache project - http://xmlgraphics.apache.org/fop/
fpdf and there is another extention on top of it, which I can't remember, which allows you to import templates
Your best bet would be to generate the entire document on the fly, with the template defined programatically using fpdf or something similar. That way, your text will not be cut off by paragraphs or anything like that, and you can easily position images/other elements as required.
Late, but you can use OpenSource template designer https://github.com/applicius/dhek/releases , to define pkaceholders/areas over any existing PDF, then load it in PHP (as it's JSON format) and write accordingly on original PDF using fpdf lib, to generate custom PDF with dynamic data written on.
Altough not exactly thing you asked, you may consider to make it at two steps: using some php templating sytem (smarty, dwoo) to generate html page and then using tools like Html2Pdf convert it to pdf. I am using it, and results are good (no problems with page layout etc)
Of course it depends of your input documents (can you use html instead of PDF/ ODT as source ) and complexity of the layout of those.
Ok I'm trying to help you solve the problem a little.
First the answer for couple of your question.
Q - Am I overlooking something?
A - No. There is a PHP PDF library that can replace placeholder variables in an existing PDF and generate a PDF file as the end result, without screwing up the layout
Q - Is there a way to do this using PDF forms?
A - Yes. absolutelly the tric to doing this is by using a PDF Forms
For both answer you can use Justin Koivisto fill pdf form field php library.
For more detail you please go to http://koivi.com/fill-pdf-form-fields/tutorial.php.
Take a look there for additional information.
Credit to Justin Koivisto for his work
P.S
For workaround for displaying a table like output from pdf form
please consider to take some reading on Oracle Business Intelligence Publisher User's Guide - Creating a PDF Template
I'll add this new answer since the FDF PHP extension is now dead.
I've just followed these instructions and ended up executing one perl script then the pdftk command
I'm pretty aware it's far from being a real PHP solution but it's reliable and fairly easy to implement on any *nix platform.
The tools described there are also available on Debian, just in case you were wondering.
It's a litte bit late but have a look at the PDFTemplate Library it does exatly what you want. You can create Open Document files (odt) and add placeholders in it. The PDFTemplate library can fill out these placeholders (even with images) and create a PDF file.
ODT Files with placeholders to PDF

Manipulating Microsoft Word Office 2007 .docx document from PHP

I need an option from within PHP to Manipulate .docx (Microsoft Office 2007) document.
I need to:
Read the internal text
Convert to .html
To view them inside a browser.
To replace text.
I know I can use Word Automation, creating a COM object of Microsoft Word, but it's too slow, unstable and I have to have it installed on the server.
Is there any library or code that can do it from PHP?
There is PHPWord for that by the authors of PHPExcel.
Docx is just a ZIP file containing multiple XML files and embedded media files like images. Because of this, you can read and edit the document with ease. Just unzip it, open word/document.xml, do reading & writing, and repack the files.
Convet to HTML may be difficult. But you'll find a thumbnail of the first page in docProps/thumbnail.jpeg.
Note that you'll have to familiarize yourself with the XML structure to do any complex edits. There's a summary XML docProps/app.xml which has some metadata for the file so don't forget to update it. Read more from Wikipedia: http://en.wikipedia.org/wiki/Office_Open_XML
You may have a look at PHPDocX I believe it does all you are asking for.
You may replace variables in a template or just plain text from a prexisting Word document.
It offers quite a few conversion options.
You can also extract the text.
You can work with the internal format directly.
DOCX is just a zip file, and inside that there's word/document.xml containing the actual document.
It's quite trivial to unzip the file, read document.xml, str_replace() what you're looking for, save it and re-zip the directory, and it makes for a lightweight, quick and easy mail merge capability for word documents. This also works for other office formats.
Here's the official docs on the internal structure for more information.
There is also a PHP class for merging new content into an existing .docx file. It is available here: http://www.tinybutstrong.com/ . The documentation is pretty good as well as having many examples and it is all free and open source. It does require familiarity with the .docx concepts, though.

Generate ODT documents with dynamic images in PHP

I maintain a couple of web databases based on PHP and mySQL on a shared hosting package.
The databases have a mechanism for the user to upload OpenOffice documents with placeholders:
[person.name] [person.address] [person.postcode]
I then use this great PHP tool to run through the OpenOffice document and insert values from the database into it. The result is again, an OpenOffice document.
What it can't do is dynamic images.
Does anybody know a - preferably PHP-only - solution to insert images into OpenOffice documents?
I know PUNO. Can't use it in this context because it's shared hosting.
I know OpenOffice can be run as a daemon - ditto.
I know phpDocWriter. It was great for SXW files but is dead now.
I know OpenDocument is a collection of XML files in a ZIP file. I once tried to programmatically add a caption to every image in a ODT document. It drove me fricking crazy. I look with admiration upon developers who work with the format, but it's not for me.
I would really appreciate any hints on existing solutions.
I think odtPHP might be what you're looking for
is seems to be able to insert images on a placeholder in the document and reads simply from an array to see which image to place.
http://www.odtphp.com/index.php?i=tutorials&p=tutorial5
Now, if you do this as a post-process after your current code, or simply use it instead of TBS, you got everything you need IMHO
Alternatively, you can include a default image with a certain filename in your document, and simply replace that imagefile in the archive.
There is a new version of TbsOOo, it's OpenTBS and it has a feature for inserting/changing a picture in the file.
http://www.tinybutstrong.com/opentbs.php
Did you try to use the AddFileToDoc method to add an image to the document?
The documentation on this method is here:
http://www.tinybutstrong.com/tbsooo.php#AddFileToDoc

How can i convert a php page into .doc file with php

Recently i worked in a project. On this project I need convert page into a Microsoft word document (.doc file) and offer the document for download, all using PHP. But I can't solve this problem.
Please help me. Thank You very much, Arif
This is not easy to solve.
First off, if you want to write real word documents, you will have to do on Windows. You can use COM to talk to Word and this is how you manage to get good results. I've tried all the unix/linux based solutions and the results were not so great.
Otherwise, I'd suggest you write RTF -- which is just as good. And in the end, you can call the .rtf-file, .doc and no one will notice it. RTF has a couple limitations (formatting), but on the flipside -- it's all ASCII and the RTF standard is pretty comprehensive and well documented.
There's a class which does it pretty nicely -- phpLiveDocx (this is a great introduction). And this class also claims to write PDF and DOC -- but I haven't tried those yet. I use another solution for PDF.
I would recommend using the RTF format instead of the .doc - it's much simpler to write to, and all text editors understand it. Similar recommendation for .csv when you want to output an Excel file.
Perhaps not the answer you seek, but still interesting to note, there is a open source word processor out there called abiword that has a CLI (Command Line Interface). You can use it to easily convert between document formats. I know that at least one website uses it to convert text files into various formats.
It is actively getting developed and could easily be used as a 3de party black box solution to converting documents server side.
Here is a blog from one of the developers on how to integrate it with PHP
Server-Side AbiWord
abiword home page

Categories