I need a library to extract text from documents(doc, doxc, pdf, html, rtf, odt.....). Is there one library(for all document types) for this purpose?
Do batch conversions of the files to one format, using either
odtphp http://www.odtphp.com/index.php?i=tutorials&p=tutorial1
or
PyODConverter (run this using the PHP command line executable tool to make it 'work with' php) http://www.oooninja.com/2008/02/batch-command-line-file-conversion-with.html
Then run that last result through any generic pdf2txt library, or an phpOCR.
A safer bet would be to convert your documents to plain text first, and then parse the contents of the plain text version to do whatever you want. There's a lot of command line converters around that allow you to convert from different formats to plain text (Word to txt, PDF to txt, etc.), on ANY operating system.
BTW Regarding PDFs : not all of them actually contain plain text, some are just a collection of scanned images, so in that case you'll be out of luck (unless you would use OCR on them).
OpenTBS is a PHP tool that can read an modify the contents of any OpenDocument files (ODT, ODS, ODG, ODF, ODM, ODP, OTT, OTS, OTG, OTP). But also OpenXML files (DOCX, XLSX, PPTX).
If you can convert files having an unsupported format you need to one of those supported by OpenTBS, then it's done.
On systems other than Windows, there is no such library to do this for you, and there is a high probability there won't be as such in the future. Main reason is that the document formats you specified are continuously updated from time to time.
On Windows however, if you have php installed, you can definitely use activex extensions to read all of these formats with ease, and you will only need the proper office application to be installed apart from php on the machine to get this to work. This will also make sure future versions of documents continue to work in your php code, as long as your office applications can read those document. Look for 'php win32' libraries in php library collections and you should find some nice one there
Related
I have relatively sensitive data in .docx, .xlsx and PDF files that all need to be converted to a single PDF file locally. Sending these files off to phpdocx or Google Docs or anything like this is not an option.
The only other option I am seeing is OpenOffice / LibreOffice but I am not satisfied with how they are converting the documents.
Is there any other alternative anyone is aware of? Thanks!
Definitely a difficult task. The very recent release of LibreOffice 3.6 has fixes to it's docx processing if that might help, but you haven't specified what the actual problems you encountered when you tried OpenOffice.
If you have time to experiment (and bring in any tools/languages you need to get the job done) you could try LibreOffice to produce PDFS, then use one of the many PDF libs to stitch the PDFs into the single file you require.
You could also look at ODFConverter which has traditionally been much better with DOCX than either OpenOffice or LibreOffice. This would allow you docx -> odt -> pdf. I think it can do the xlsx also. Then do the PDF stitching again.
I suggest testing the stages manually at first and if promising, try something like JODConverter (requires Java) to allow you to automate the process via scripts.
Good luck.
I have an 'Excel' file (with a .xls extension) which turns out to be a plain text HTML file masquerading as a spreadsheet (if I run 'file [filename]' I get 'HTML document text' as the type). The file comes from a third party supplier and I have no control over the format.
I want to convert the file into Excel 97-2003 format so that I can read it in a PHP library (PHPExcel). I can do this by opening the file in Excel, ignoring the warning message and then explicitly saving it as Excel 97-2003, but I want to automate the whole process from the initial file coming in to extracting the cell data and dumping it into a database.
Ideally I'd like to use a PHP library for the conversion, because that would integrate better with the rest of the codebase, but libraries written in Perl, Java or (at a pinch) C# would also work, provided they don't rely on the server running Windows and Office.
Is there a tool or library available which can provide this functionality?
PhpExcel http://phpexcel.codeplex.com/ is decent but you'll have issues with it gobbling up memory with large sheets. For large sheets or speed I'd recommend perl writeExcel http://search.cpan.org/~jmcnamara/Spreadsheet-WriteExcel-2.37/lib/Spreadsheet/WriteExcel.pm
The perl writeExcel library is faster and uses less memory than PhpExcel. I then use
<?php
echo passthru('perl filename.pl');
?>
to run the perl script through PHP.
It looks like for the moment the only answer is to manually process the file by opening it in Excel and re-saving it, which does work but doesn't allow for complete automation.
I'll take a look at the new version of PHPExcel with HTML support once it has been released though as that sounds promising.
What is, according to you, the best way to convert uploaded files of any kind (.doc, .docx,...) into a pdf-file using nothing but php. Is it even possible to do so?
I looked at FPDF, but this creates the pdf files from text.
An other solution previously given was to use the PDFlib library on your server, but unfortunately, my server doesn't support this library...
What is the best way to convert to files my users upload on my site to pdf files?
A simpler approach would be to restrict uploads to .PDF format programmatically and require your users to only upload .pdf files. Provide a link on the upload page to a free and open source pdf printer (e.g. Cuteftp) that the user can install to create .pdf documents from any file that can be printed.
Trying to do it through PHP will be problematic because the uploads could be generated from many different programs that would be impossible to cater for in their entirety. e.g. How would it handle Scribus or ABC Flowcharter or any other 'non-standard' application someone used to create a document?
Much better to filter the upload upfront.
The best server-side PDF generator from those I tried was, so far, wkhtmltopdf, a WebKit-based, self-contained invisible browser that can render any HTML+CSS and generate a PDF from it. Reasonably fast and fairly reliable, has some useful PDF options, such as page size, orientation, etc.
The second part of the job in your case is to convert documents to HTML prior to feeding them to wkhtmltopdf. If possible, have your users upload the docs in HTML (Word and Co. can export (crappy) HTML). If this is not an option, you will have to find a tool just for that, which, in my opinion, is much easier than finding a tool that converts Word docs directly into PDF.
Good thing about wkhtmltopdf is also that you can feed the output of your PHP script to it using the ob_xxx() functions.
PHP Excel best simple way to create doc, docx, xls, xlsx, pdf files with PHP. Its lot easier with clear documentation.
Use Microsoft Office to render Microsoft Office documents, if you care about accuracy at all. This is easily done by invoking Office over COM.
Get access to your server, and install what you need. Doing so would be far easier than monkeying around with sub-par solutions.
Well... I can think of one way of doing it quite easily, but it doesn't involve using PHP.
Upload your documents to a folder on your server, that are browsable by your users.
EG: http://mysite.com/docs/
Then get your users to install a virtual printer driver such as Primo PDF
http://www.primopdf.com/index.aspx
then they can load the document into their browser, and print to PDF for offline browsing.
If this is not an option, and your dealing with office documents that conform to the openXML standard, you could attempt to parse the XML doc into a PHP page for display in the browser, then use JavaScript to trigger a print.
Unfortunately, it does still depend on your user having a PDF printer installed.
Alternatively, you could just load the docs natively, and print to your own PDF printer, then upload the PDF's to the web server for download.
I can't think of any easy way of doing this otherwise, without installing all sorts of different document parser tool-kits and doing a huge amount of behind the scenes work.
I have a module which merges a document from database records and .docx or .odt document model.
I have to output .docx, .odt or .pdf. For outputting to Microsoft and Open formats, there is no problem, all works properly.
But what I want to know is, can I output to a format (like XML or HTML) which I can use to subsequently build a PDF document?
If I can't, are there any libraries which provide a merge document capability like:
DOCX (or ODT) + database record => PDF
And I don't want to use phplivedocx.
I successfully put a portable version of libreoffice on my host's webserver, which I call with PHP to do a commandline conversion from .docx, etc. to pdf. on the fly. I do not have admin rights on my host's webserver. Here is my blog post of what I did:
http://geekswithblogs.net/robertphyatt/archive/2011/11/19/converting-.docx-to-pdf-or-.doc-to-pdf-or-.doc.aspx
Yay! Convert directly from .docx or .odt to .pdf using PHP with LibreOffice (OpenOffice's successor)!
I don't know any PHP library that does DOCX => PDF. In fact, the DOCX conversion to something else in PHP is an opened problem today. This is independent from how you made the DOCX.
But as you said, they are PHP libraries for HTML => PDF.
Html2Pdf is a well reputed PHP library that does HTML => PDF.
There is also DomPdf.
So if you can found a PHP library for DOCX => HTML, then it would work.
Of course it has some limitations because even if both PDF and DOCX are opened format, they have very specific features, they need huge rendering process, and the editors keep some good tips for them.
Converting DOCX to HTML is theoretically possible. There is a Windows software that does it by EpingSoft. If you need to do it in PHP, some web articles tell you how to make it, but since I cannot found any PHP code doing this, I guess it is more theoretical than practical.
http://www.quepublishing.com/articles/article.aspx?p=691502
How complicated that process would be
depends on how much of Word's native
formatting you need to preserve during
the conversion.
If you want to try this way, it's good to know that OpenTBS enables you to read the XML before and after the merge. It is based on a PHP class names TbsZip that can read any XML file in the DOCX since it's in fact a zip archive.
There is also posible to use PDF files directly in TBS after decompressing:
qpdf --qdf --object-streams=disable in.pdf out.pdf
I have a web project where I must import text and images from a user-supplied document, and one of the possible formats is Microsoft Office 2007. There's also a need to generate documents in this format.
The server runs CentOS 5.2 and has PHP/Perl/Python installed. I can execute local binaries and shell scripts if I must. We use Apache 2.2 but will be switching over to Nginx once it goes live.
What are my options? Anyone had experience with this?
The Office 2007 file formats are open and well documented. Roughly speaking, all of the new file formats ending in "x" are zip compressed XML documents. For example:
To open a Word 2007 XML file Create a
temporary folder in which to store the
file and its parts.
Save a Word 2007 document, containing
text, pictures, and other elements, as
a .docx file.
Add a .zip extension to the end of the
file name.
Double-click the file. It will open in
the ZIP application. You can see the
parts that comprise the file.
Extract the parts to the folder that
you created previously.
The other file formats are roughly similar. I don't know of any open source libraries for interacting with them as yet - but depending on your exact requirements, it doesn't look too difficult to read and write simple documents. Certainly it should be a lot easier than with the older formats.
If you need to read the older formats, OpenOffice has an API and can read and write Office 2003 and older documents with more or less success.
The python docx module can generate formatted Microsoft office docx files from pure Python. Out of the box, it does headers, paragraphs, tables, and bullets, but the makeelement() module can be extended to do arbitrary elements like images.
from docx import *
document = newdocument()
# This location is where most document content lives
docbody = document.xpath('/w:document/w:body',namespaces=wordnamespaces)[0]
# Append two headings
docbody.append(heading('Heading',1) )
docbody.append(heading('Subheading',2))
docbody.append(paragraph('Some text')
I have successfully used the OpenXML Format SDK in a project to modify an Excel spreadsheet via code. This would require .NET and I'm not sure about how well it would work under Mono.
You can probably check the code for Sphider. They docs and pdfs, so I'm sure they can read them. Might also lead you in the right direction for other Office formats.