Automate XML parsing and converting docx to pdf

Automate XML parsing and converting docx to pdf - php

I have not been programming for many years but need to get this following process automated.
A government medicine authority publishes an xml file on their website.
I need to download it and parse it and catch one of the fields that has a url to a docx file.
I need to then store it on our local filesystem as a pdf.
Need to repeat this process every n days.
I used to know PHP quite well but what would that be ok for this task. Would python be better.
As I don't have a server at work so was thinking of getting a Raspberry Pi.
What would you suggest on how I would get about this.
I have a few ideas of using wget or curl through a cron job to get the xml file. Then use perhaps php or python or bash to parse the xml file, call the docx with wget or curl nad then use a pdf command line tool. If it would be on a website should I load the results in a sql db or just list them as files in a directory.
Would appreciate any ideas.
Martin

I, personally, would go with node.js. It is easy to setup a node server on a raspberry pi and node.js has a library for just about anything. There is a lot of simple setup tutorials out there and SO has a lot of info like xml parsing in node. JavaScript is pretty easy to code in.
For example if you need a docx converter, here is one: mammoth.js
Good Luck!

Related

Convert docx to pdf in PHP

Right first a little background that will help put this all in focus.
I have several indd files (indesign). I can convert these to pdf and then to docx.
Using the phpword library I can then effectively do a mail merge and replace several areas of my document with text and one image.
I then want to convert that to a pdf, which I can then stitch several pdfs together for printing with ghostscript.
I have a word macro that I can execute just find via standard command line functions. If I try that same command line in php it just hangs.
I've tried various forms of that, using system, exec, passthru - using Psexec all either hang and then timeout, or don't work and skip through.
I've seen other examples using COM objects thing like this.
http://www.sitepoint.com/make-microsoft-word-documents-php/
all either hang or give me problems with the com object that I'm trying to make.
Am I trying for the impossible, or perhaps is there another way.
I've also given e-PDF Document Converter v2.1 a go but without success.
Currently I'm thinking that there is some permission thing going on but I'm really at a loss as to how to get around it or what to do.
I would maybe like to use either the libreoffice or the openoffice as they both seem to have command line tools but when I open the pdf or the doc file they display very poorly.
Any help.
Thanks
Richard
Update
Just thinking maybe I'll stitch the word documents together and then just allow the user to download it and then they can print it.
Job done easy!
But if there is a better way - I'm open to it.
Update 2
On a windows platform

Maybe something like next ?
sudo apt-get install unoconv
doc2pdf respondus-docx-sample-file.docx
In php :
exec("doc2pdf \"" . $youPdfFile . "\"");

extract images from PDF with PHP

The thing is that the client wants to upload a pdf with images as a way of batch processing multiple images at once.
I already looked around and out of the box PHP can't read PDF's.
What are my alternatives?
I already know the host has not installed imageMagick or any pdf library and the exec function is disabled. That's basicly leaving me with nothing to work with, I guess?
Does anyone know if there is an online service that can do this, with an api of sorts?
thanks in adv

AFAIK, there is no PHP module to do it. There is a command line tool, pdfimages (part of xpdf). For reference, here's how that works:
pdfimages -j source.pdf image
Which will extract all images from source.pdf as image-000.jpg, image-001.jpg, etc. Note the output format is always Jpeg.
Possible Options
Being a command line tool, you need exec (or system, passthru, any of the command executing functions built into PHP). As your environment doesn't have that, I see four options:
Beg that exec be turned on for you (your hosting provider can limit what you can exec to a single command)
Change the design -- how about a ZIP upload?
Roll your own, using the source code of pdfimages as a model
Let pdfimages do the heavy lifting, by running it on a remote host you do control
Regarding #3, rolling your own, I don't think rolling your own, to solve a very narrow definition of requirements, would be too difficult. I seem to recall that the image boundaries in PDF are well defined: just read in the file to a boundary, cut to the end of the boundary, base64_decode, and write to a file -- repeat. However, that may be too much...
If rolling your own is too complicated, then option #4 is kind of like what Joel Spolsky describes for working with complicated Excel objects (see the numbered list under the bold heading "Let Office do the heavy work for you").
Find a cheap hosting environment (eg Amazon EC2) that let's you exec and curl
Install pdfimages
Write a PHP script that takes a URL to a PDF, curl opens that PDF, writes it to disk, passes it to pdfimages, then returns the URL to the resulting images.
An example exchange could look like this:
GET http://www.cheaphost.com/pdfimages.php?extract=http://www.limitedhost.com/path/to/uploaded.pdf
Content-type: text/html
<html>
<body>
<ul>
<li>http://www.cheaphost.com/pdfimages.php?retrieve=ab9895v/image-000.jpg</li>
<li>http://www.cheaphost.com/pdfimages.php?retrieve=ab9895v/image-001.jpg</li>
</ul>
</body>
</html>
So your single pdfimages.php script (running on the host with the exec functionality) can both extract images, and give you access to the extracted images. When extracting, it reads a PDF you tell it, runs pdfimages on it, and gives you back a list of URL to call to retrieve the extracted images. When retrieving, it just gives you back a straight image.
You would need to deal with cleanup, perhaps the thing to do would be to delete the image after retrieval. You would also need to handle security -- don't know what's in these images, but the content might need to be wrapped in SSL and other precautions taken.

You can use pdfimages and install it this way:
apt install poppler-utils
Then use it this way to get all the images as PNG files:
pdfimages -j mypdf.pdf image -png
Images will be placed in the same folder under image-000.png, image-001.png, etc.
There are many options available, including some to change the output format, more information here.
I hope this helps!

Word to xml conversion

Here is my problem: My organization wants to upload word documents from users to the server. On the server side, the word document (enforced with styles) needs to be converted to XML format files. Next, I need to use php to parse the open xml formats files and put the content into the database. Does anyone know how to convert word to XML on server side automatically?Is there any API or sample codes for php to parse Open XML Formats? Your suggestions are appreciated.

Have you looked at using VBA?
I have had to do similar work and I've used VBA within a WSF or VBS file. If you're server is a Windows environment it will run right from the OS. You can execute this from PHP (not recommended) or drop the Docx file into a hot folder outside of the web server environment. I recommend the latter since the web server env. can introduce security issues.
Another note, if you want to separate content from styling, you're going to need to perform some post-processing on the output markup. Word is a "word" processor so styling is what it is designed to do. If this is a requirement, I would suggest moving to a structured, XML-based authoring tool instead.
Hope this helps!

I want to write multiple images to an odt file either in python or php

I want to write multiple image files to a odt file. I will be specifying a dir and the script will take it from there thru a loop. But where do i start? I have never done anything like this before!
I found this python code, which can convert html 2 python... so we can parse an html first and then call this one. But there is no reference on how to use this.
html2odt code

Atlast I found a PHP way to write odt direct! Its well documented.
http://www.odtphp.com/
I have also written a complete practical solution in php. You can upload multiple images and get the odt document generated.
The code is hosted at http://code.google.com/p/images2odt/
The first post is done here.

For anyone wanting to use the Python code will need a Python interpreter version 2.6. It might also work with version 2.7. It's mainly used in Linux but there are Windows and Mac versions as well. You will also need the files listed in the from and import statements. These files are in some of the other folders. It looks like it is a part of a much bigger Linux package. One last thing, Python scripts usually takes their arguments from a command line.
Additional info:
I looked over the setup.py file and it told me that this is an API library for open documents called odfpy. The version is 0.9.2. The link it has for the documentation is broken. A google search for odfpy came up with a place to download a more recent version (0.9.4) in a tarbell here:
http://pypi.python.org/pypi/odfpy
The documentation can be found here in an Open Office document:
https://joinup.ec.europa.eu/software/odfpy/document/api-odfpyodt

openoffice document (odt) to PDF with command line on Linux?

we are building a PHP script that we need at work to create reports in PDFs
the reports will be created by using templates from postgrSQL.
so far I found that it can be done with the use of php and odt (openoffice) files [http://www.odtphp.com/] (do you have any other suggestions?)
now how I can convert the results to PDF so teachers will get the final reports as PDF
any tips? the server has no GUI and I want to make it as simple as possible
we tried using PHP to PDF directly with FPDF [http://www.fpdf.org/] but it is really a CPU killer!

http://www.artofsolving.com/opensource/pyodconverter
this may help you, it needs to start OpenOffice as service, and the python script is merely utilizing its api, maybe you can write one in PHP too

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.