get data from documents

get data from documents - php

I want to make a web app that can get the values from a commonly used file type (such as xsl or ppt) to allow me to convert it into a custom format (like Google Drive). With an xsl (excel document) file, for example, I want to be able to get the value for each cell. I would be fine getting html for a file (like getting the html code that would display a word document) because values can be extracted out of that. I would like to be able to do it on the client side, but I am okay with using it on the server side with PHP.

Another approach would be to import the file as XML. PHP has great support for XML and could make short work of this. If you can get the files uploaded as Open Doc Format you can parse just about any of the types you listed (XLS, PPT, DOC, etc).

A pretty easy way to get data out of an excel sheet online is to use a Google Apps Script. The process would be a lot to explain here, but with a bit of google searching, you can find all your answers.
As for a PPT, I can't think of an easy way.
As for documents (i.e. pdf, doc, docx), you can use Google Apps Script as well.
Although, if you're making your own tool for this, you may want to just research how the data is stored in the file and work from there.

Related

Extracting text from PDFs in PHP

I'm creating a php based web application which allows the user to upload a PDF file. This file will then be read and checked for certain data (text).
The problem is I can't figure out how to even open a PDF file in PHP. There are some PDF libraries mainly for creating PDF's, but they don't seem to be very good at reading them.
An alternative solution would be to use an already available solution in Python or something else (as described in other threads on this site) but I'd really like to stay as much as possible in PHP as I intend to later export the data to mysql, etc.
Any input on how to read a PDF and extract data from it would be much appreciated.

I personally haven't tried this out, but it looks like this one works: http://www.pdfparser.org/documentation
It's just a matter of downloading and telling your code to include it, just like the documentation shows.
Or you could try the class.pdf2text.php found in http://www.phpclasses.org/browse/file/31030.html

Getting google documents as downloadUrl

I am currently trying to get the downloadURL from a response sent via my server of which, whenever $file->getdownloadUrl() is used it returns ['downloadURL'] =>
My question is, is it possible to download Google Documents in the application/vnd.google-apps.document MIME Type?
My assumption is, these would contain a link to the online version of the document, but it would be good to be able to edit the document in the correct format so that any formatting done would be retained when re-uploaded to drive,
Regards,

Nope, you cannot download Google Documents in application/vnd.google-apps.document MIME type. You only can export it to other formats.
Some workarounds:
Apps script Document Services provide a little bit better control over the document, but you won't be able to get full control over all formatting for now.
Export file as known formats such as Microsoft words and edit it. When you upload it back to Drive, you can request to convert it back to Google Docs format. Although you might possibly lose or corrupt with some formatting.

Website uploads converting of word and image files to html or pdf on the fly

A client has given me the task of creating a site with the ability to convert their file uploads into html or pdf for storage on the web server. I want them to be able to upload (.doc, .tiff, .jpg, etc) and have it convert these files on the fly, again... into either html or pdf.
I am open to software and api's that do the trick but the file MUST BE STORED ON THE CLIENTS WEB SERVER after conversion. The client is using godaddy with an ssl if that helps. Any input is greatly appreciated as I have been looking for a long term solution to this problem that I will be able to use in future projects.
Things I have looked into but have had trouble using this way... Scribd, open office api
Places I've found the most help so far here

Well...the matter of storing the file as HTML...as you upload images all you need to do is store the file somewhere and then create a HTML file that looks something like this:
<html>
<body>
<img src="path/to/the/image/file.png" />
</body>
</html>
It might be worth it to convert the large files (especially TIFF) to another format. Converting .doc-Files might be a little more tricky. Have a look here: Convert .doc to html in php
Maybe also take a look at the Document zetaComponent, which is able to cobvert between different document types, although not all of those you mentioned are supported so far.
Creating a PDF should be almost as easy as there are several libraries for PHP that can aid you. Just poke around on SO: Convert HTML + CSS to PDF with PHP?
Overall you will have to mix up a whole lot of stuff to get that job done. There is no "simple" solution to this.

Get PDF output from XML generated by a PHP file and translated with an XSLT

I've used a couple of days to think of a best practice to generate a PDF, which end users can customize the layout for themselves. The PDF output needs to be saved on the server or sent back to the PHP file so the PHP file can save it, and the PHP file needs to know that it went OK.
I thought the best way to do this was to use XML, XSLT and Apache Cocoon. But I'm not sure if this is possible or if it's a good idea since I can't find any information of people doing anything similar. It cannot be an uncommon problem.
The idea came when I read about Cocoon converting XML through XSLT to PDF:
http://cocoon.apache.org/2.1/howto/howto-html-pdf-publishing.html
and being able to take in variables:
http://old.nabble.com/how-to-access-post-parameters-from-sitemap-td31478752.html
This is what I had in mind:
A php file gets called by a user, the php file generates a source XML file with a specific name
The php file then makes a request to Cocoon (on the same web server) to apply the user defined XSLT on the XML file. A parameter will be needed here to know which XSLT to apply.
The request is handled by the PHP file and then saved as a PDF on the server, and can later be mailed away.
Will this work at all? Is there a better way to handle this?
The core problem is that the users need to be able to customize the layout on the PDFs themselves, and I need the server to save the PDF and to mail it later on. The users will use it for order confirmations, invoices, etc. And I wouldn't like to hard code the layout for each user.

I've had some good results in the past by setting up JasperReports Server and creating reports using iReport Designer. They're both available in F/OSS ("community") editions, though you can pay for support and value-adds if you need those things.
This was a good solution for us, since we could access it via the Java API for our Java system, and via SOAP for our PHP system. The GUI designer made tweaking reports very easy for non-technical business staff too.

I use webkithtml2pdf to generate my PDF:s. Just create a document with HTML and CSS for printing like you would usually do, the run it through the converter.
It works great for generating things like invoices. You can use SVG for logos and illustrations, and they will look great in print since they are vector based. Even rounded corners with dotted outlines works perfectly.
A minor gotcha is that the input html must have th htm or html file name suffix, so you can't use the default tempfile functions.

Is it possible to extract Meta information from MS office files and/or PDFs with PHP?

So I have files....
.doc
.docx
.xls
.xlsx
and .pdf
that are on the my server.
Is it possible (and if it is, how) to extract the meta data from those files using PHP?
I'm looking for things like Author, keywords, title, etc...
In office documents it's the information stored along with the document properties (File...Properties...Summary for 2003, Prepare...Properties for 2007).
In PDFs it's information found in Document Properties.
This is not on a Windows server.

I have managed to extract a lot of Meta information using XPDF on a linux system a few years back. Nowadays, though, I would say Zend_PDF is your best bet. Haven't used it myself but looks good and promises everything you need. Seems to have no library dependencies, either.
For Word .DOCs, if you don't find a better way, plug into an OpenOffice server instance / command line and convert the files to ODT, which is XML and parseable. If it's not possible to extract the meta data per Macro - it should be, but I don't know how much work it is. This OpenOffice Forum entry gives a ton of starting points for automated conversion.
The ...X formats are some sort of XML, so it should be easily possible to fetch the meta data from them. Alternatively, you should be able to use OpenOffice's conversion filters here as well, if they transport the meta data.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.