I have an 'Excel' file (with a .xls extension) which turns out to be a plain text HTML file masquerading as a spreadsheet (if I run 'file [filename]' I get 'HTML document text' as the type). The file comes from a third party supplier and I have no control over the format.
I want to convert the file into Excel 97-2003 format so that I can read it in a PHP library (PHPExcel). I can do this by opening the file in Excel, ignoring the warning message and then explicitly saving it as Excel 97-2003, but I want to automate the whole process from the initial file coming in to extracting the cell data and dumping it into a database.
Ideally I'd like to use a PHP library for the conversion, because that would integrate better with the rest of the codebase, but libraries written in Perl, Java or (at a pinch) C# would also work, provided they don't rely on the server running Windows and Office.
Is there a tool or library available which can provide this functionality?
PhpExcel http://phpexcel.codeplex.com/ is decent but you'll have issues with it gobbling up memory with large sheets. For large sheets or speed I'd recommend perl writeExcel http://search.cpan.org/~jmcnamara/Spreadsheet-WriteExcel-2.37/lib/Spreadsheet/WriteExcel.pm
The perl writeExcel library is faster and uses less memory than PhpExcel. I then use
<?php
echo passthru('perl filename.pl');
?>
to run the perl script through PHP.
It looks like for the moment the only answer is to manually process the file by opening it in Excel and re-saving it, which does work but doesn't allow for complete automation.
I'll take a look at the new version of PHPExcel with HTML support once it has been released though as that sounds promising.
Related
At the moment I am doing a mass interface of files/data and some files are in XLS format, which I need to normalize them into csv (so basically, convert XLS to CSV files)
The problem is that PHPExcel (and similar libraries) load the entire sheet data at once thus exhausting memory.
So far I tried various libraries (in the meantime negotiating to have the data in csv though no luck so far)
I am running my tests on various large file sizes, my memory allocation is set properly before and after my script runs using ini_set etc.
Is there a way that I can read an xls line by line or in chunks (like fgetcsv or fread) please?
I am programming this so it can work with any filesize (even if it takes ages to run) as this is a fully automated system.
PS: I checked this post and various others already
Reading an Excel file in PHP
Possible ways...
Get help from other languages. e.g. find a Python excel library and use it. Then call Python from PHP.
Modify the source code of those Excel readers
Use a command line tool to convert excel to csv, e.g. Pandoc maybe, and use the csv in PHP
Since xls file is nothing but a zip file, maybe it can be unzipped and found the values
First decompose one xls into many small xls files via non-PHP solution, e.g. VBA in excel, then read each of them.
I need to manipulate XLS files from PHP.
I found the great library PHPExcel which does 99% all of the job I need. Sadly I miss that 1%, which is writing the name of the "Creating application" (gnumeric sees it as meta:generator) in Excel 2007 files (Excel5 as called in the library).
Is there any other way to write this tag?
I can work around this by converting the XLS to gnumeric (using ssconvert), adding the tag to the XML file and convert back to XLS, but since I should use this script in a headless server I'd rather avoid installing gnumeric.
Any help is appreciated, even if another method exists (a command line tool, another PHP library).
I have relatively sensitive data in .docx, .xlsx and PDF files that all need to be converted to a single PDF file locally. Sending these files off to phpdocx or Google Docs or anything like this is not an option.
The only other option I am seeing is OpenOffice / LibreOffice but I am not satisfied with how they are converting the documents.
Is there any other alternative anyone is aware of? Thanks!
Definitely a difficult task. The very recent release of LibreOffice 3.6 has fixes to it's docx processing if that might help, but you haven't specified what the actual problems you encountered when you tried OpenOffice.
If you have time to experiment (and bring in any tools/languages you need to get the job done) you could try LibreOffice to produce PDFS, then use one of the many PDF libs to stitch the PDFs into the single file you require.
You could also look at ODFConverter which has traditionally been much better with DOCX than either OpenOffice or LibreOffice. This would allow you docx -> odt -> pdf. I think it can do the xlsx also. Then do the PDF stitching again.
I suggest testing the stages manually at first and if promising, try something like JODConverter (requires Java) to allow you to automate the process via scripts.
Good luck.
I need a library to extract text from documents(doc, doxc, pdf, html, rtf, odt.....). Is there one library(for all document types) for this purpose?
Do batch conversions of the files to one format, using either
odtphp http://www.odtphp.com/index.php?i=tutorials&p=tutorial1
or
PyODConverter (run this using the PHP command line executable tool to make it 'work with' php) http://www.oooninja.com/2008/02/batch-command-line-file-conversion-with.html
Then run that last result through any generic pdf2txt library, or an phpOCR.
A safer bet would be to convert your documents to plain text first, and then parse the contents of the plain text version to do whatever you want. There's a lot of command line converters around that allow you to convert from different formats to plain text (Word to txt, PDF to txt, etc.), on ANY operating system.
BTW Regarding PDFs : not all of them actually contain plain text, some are just a collection of scanned images, so in that case you'll be out of luck (unless you would use OCR on them).
OpenTBS is a PHP tool that can read an modify the contents of any OpenDocument files (ODT, ODS, ODG, ODF, ODM, ODP, OTT, OTS, OTG, OTP). But also OpenXML files (DOCX, XLSX, PPTX).
If you can convert files having an unsupported format you need to one of those supported by OpenTBS, then it's done.
On systems other than Windows, there is no such library to do this for you, and there is a high probability there won't be as such in the future. Main reason is that the document formats you specified are continuously updated from time to time.
On Windows however, if you have php installed, you can definitely use activex extensions to read all of these formats with ease, and you will only need the proper office application to be installed apart from php on the machine to get this to work. This will also make sure future versions of documents continue to work in your php code, as long as your office applications can read those document. Look for 'php win32' libraries in php library collections and you should find some nice one there
My application generates some .xls files and until now I was using PHPExcel lib. One of the SO has recommend me to use this approach. The problem is that I have to use some .xls templates and to append some data to them.
Who can help me with some pointers. I don't get how xlsBOF() and xlsEOF() works or have to work in my case.
If the approach you use right now works for you, don't bother with anything else.
PHPExcel writes XML files (or more accurately zip files containing XML files), in the new Excel 2007 format. For this reason, it's not compatible with older office versions (unless you install the compatibility plugin in the older office).
What this code does is write a binary XLS file in Excel 97 (BIFF8) format. It's a bit of a hack though. This won't deal correctly with unicode issues and so on. xlsBOF writes the binary header of the XLS file, and xlsEOF the footer.
If you want to write binary XLS files, you're better off using PEAR Excel Writer. I have mixed experiences with that. It gets the job done, but to use it with unicode you have to look through the bug list for a few patches that fix BIFF8 format bugs (the package is poorly maintained). It's still better than the code you linked to though.
Update: PHPExcel supports export as Excel 97 also. I remember that it used to be limited to the office 2007 file format, but apparently currently it's not. So I would recommend using PHPExcel.