I am trying to automate some stuff and need to then replace some text in a PDF document generated from InDesign. It's on a linux server so I am looking for a PHP script to replace the text - either in the Indesign document or even better the PDF file - if that is even possible?
Why do you rather replace your text in PDF instead of the Indesign file? Isn't is necessary to have your sourcefiles up-to-date?
You probably want to look at IDML which is de XML-based format of Indesign (indd) files. You can probably convert your indd files serverside to idml files.
There are several PHP libraries to read and write within idml files (not tested by me, not a PHP guy).
https://github.com/jorisros/IDMLlib
https://github.com/deathlyfrantic/php-idml
Related
Please see the screenshot below. I have been sent this file from a client. The file is exported from another piece of software and it loads up in word as a fully styled and presented document, but the filename as "xml" extension.
This is a bit confusing, the document starts in the source as XML but then has a load of what looks like encrypted garbage after it, so it's not XML at all is it? So I don't quite understand why their software marks it as XML when it's not.
My second question is: Is this actually a word document? Normally it would be .doc or .docx, I have never seen a word document saved with .xml before.
Final question: How would we now then use php to convert this to a PDF? I don't mind using third party software installed on the server and run via command line, I just need to figure out how to convert this. I could do it in C# but it's preferred in PHP so that our backend server can do the work rather than the desktop software we have that links with it.
I just want to ask if there are any free/opensource or are there even any existing PHP class library that can convert HTML to PHP using the tags as its image source?
Because I have created a web app that contains Google Maps in it and now I want to save the generated map, instead of saving it using printscrn I would like to convert it to PDF. Is it even possible?
Can you give me guys a link to the library if ever there are existing ones?
I have already looked for fpdf and tcpdf but on their examples all images is contained on the file directory.
Thanks
There is the library HTML2PDF (on github) licensed under LGPL. It is based on TCPDF and very easy to use.
I have tried and successfully converted html to pdf using wkhtmltopdf[1] binary sometime back (including images).
Download wkhtmltopdf binary which is compatible to your base server o/s and use a php wrapper class similar to this[2] (i haven't tested this wrapper class[2] but it should serve your purpose)
There are plenty of php wrappers available for wkhtmltopdf binary, so you can find one easily if this[2] didn't work well for your purpose.
[1] http://wkhtmltopdf.org/
[2] http://mikehaertl.github.io/phpwkhtmltopdf/
I have a module which merges a document from database records and .docx or .odt document model.
I have to output .docx, .odt or .pdf. For outputting to Microsoft and Open formats, there is no problem, all works properly.
But what I want to know is, can I output to a format (like XML or HTML) which I can use to subsequently build a PDF document?
If I can't, are there any libraries which provide a merge document capability like:
DOCX (or ODT) + database record => PDF
And I don't want to use phplivedocx.
I successfully put a portable version of libreoffice on my host's webserver, which I call with PHP to do a commandline conversion from .docx, etc. to pdf. on the fly. I do not have admin rights on my host's webserver. Here is my blog post of what I did:
http://geekswithblogs.net/robertphyatt/archive/2011/11/19/converting-.docx-to-pdf-or-.doc-to-pdf-or-.doc.aspx
Yay! Convert directly from .docx or .odt to .pdf using PHP with LibreOffice (OpenOffice's successor)!
I don't know any PHP library that does DOCX => PDF. In fact, the DOCX conversion to something else in PHP is an opened problem today. This is independent from how you made the DOCX.
But as you said, they are PHP libraries for HTML => PDF.
Html2Pdf is a well reputed PHP library that does HTML => PDF.
There is also DomPdf.
So if you can found a PHP library for DOCX => HTML, then it would work.
Of course it has some limitations because even if both PDF and DOCX are opened format, they have very specific features, they need huge rendering process, and the editors keep some good tips for them.
Converting DOCX to HTML is theoretically possible. There is a Windows software that does it by EpingSoft. If you need to do it in PHP, some web articles tell you how to make it, but since I cannot found any PHP code doing this, I guess it is more theoretical than practical.
http://www.quepublishing.com/articles/article.aspx?p=691502
How complicated that process would be
depends on how much of Word's native
formatting you need to preserve during
the conversion.
If you want to try this way, it's good to know that OpenTBS enables you to read the XML before and after the merge. It is based on a PHP class names TbsZip that can read any XML file in the DOCX since it's in fact a zip archive.
There is also posible to use PDF files directly in TBS after decompressing:
qpdf --qdf --object-streams=disable in.pdf out.pdf
I need a library to extract text from documents(doc, doxc, pdf, html, rtf, odt.....). Is there one library(for all document types) for this purpose?
Do batch conversions of the files to one format, using either
odtphp http://www.odtphp.com/index.php?i=tutorials&p=tutorial1
or
PyODConverter (run this using the PHP command line executable tool to make it 'work with' php) http://www.oooninja.com/2008/02/batch-command-line-file-conversion-with.html
Then run that last result through any generic pdf2txt library, or an phpOCR.
A safer bet would be to convert your documents to plain text first, and then parse the contents of the plain text version to do whatever you want. There's a lot of command line converters around that allow you to convert from different formats to plain text (Word to txt, PDF to txt, etc.), on ANY operating system.
BTW Regarding PDFs : not all of them actually contain plain text, some are just a collection of scanned images, so in that case you'll be out of luck (unless you would use OCR on them).
OpenTBS is a PHP tool that can read an modify the contents of any OpenDocument files (ODT, ODS, ODG, ODF, ODM, ODP, OTT, OTS, OTG, OTP). But also OpenXML files (DOCX, XLSX, PPTX).
If you can convert files having an unsupported format you need to one of those supported by OpenTBS, then it's done.
On systems other than Windows, there is no such library to do this for you, and there is a high probability there won't be as such in the future. Main reason is that the document formats you specified are continuously updated from time to time.
On Windows however, if you have php installed, you can definitely use activex extensions to read all of these formats with ease, and you will only need the proper office application to be installed apart from php on the machine to get this to work. This will also make sure future versions of documents continue to work in your php code, as long as your office applications can read those document. Look for 'php win32' libraries in php library collections and you should find some nice one there
I have a web project where I must import text and images from a user-supplied document, and one of the possible formats is Microsoft Office 2007. There's also a need to generate documents in this format.
The server runs CentOS 5.2 and has PHP/Perl/Python installed. I can execute local binaries and shell scripts if I must. We use Apache 2.2 but will be switching over to Nginx once it goes live.
What are my options? Anyone had experience with this?
The Office 2007 file formats are open and well documented. Roughly speaking, all of the new file formats ending in "x" are zip compressed XML documents. For example:
To open a Word 2007 XML file Create a
temporary folder in which to store the
file and its parts.
Save a Word 2007 document, containing
text, pictures, and other elements, as
a .docx file.
Add a .zip extension to the end of the
file name.
Double-click the file. It will open in
the ZIP application. You can see the
parts that comprise the file.
Extract the parts to the folder that
you created previously.
The other file formats are roughly similar. I don't know of any open source libraries for interacting with them as yet - but depending on your exact requirements, it doesn't look too difficult to read and write simple documents. Certainly it should be a lot easier than with the older formats.
If you need to read the older formats, OpenOffice has an API and can read and write Office 2003 and older documents with more or less success.
The python docx module can generate formatted Microsoft office docx files from pure Python. Out of the box, it does headers, paragraphs, tables, and bullets, but the makeelement() module can be extended to do arbitrary elements like images.
from docx import *
document = newdocument()
# This location is where most document content lives
docbody = document.xpath('/w:document/w:body',namespaces=wordnamespaces)[0]
# Append two headings
docbody.append(heading('Heading',1) )
docbody.append(heading('Subheading',2))
docbody.append(paragraph('Some text')
I have successfully used the OpenXML Format SDK in a project to modify an Excel spreadsheet via code. This would require .NET and I'm not sure about how well it would work under Mono.
You can probably check the code for Sphider. They docs and pdfs, so I'm sure they can read them. Might also lead you in the right direction for other Office formats.