I want to add an word import function to our CMS, the only problem I cannot seems to find a good library for reading docx files (Word 2007).
Do anyone has some recommendations, the library should be able to extract content of the document and basic styling like italic, bold, superscript?
Thanks for your help
docx files are actually just containers for the document's XML. You should be able to unzip the docx file and then go to the word folder inside, then to the document.xml. This has the actual text. But things like the fonts and styles are in other xml files in the docx container, so you'll probably want to mess around a bit and figure out what is what and how to match it up (start by using namespaces, I bet).
But yea, unzip the file, then use simplexml to convert it into something you can actually mess around with.
PHPDocX PRO includes a TransformDoc class that can read .docx (zip) files and generate XHTML (or PDF) from it:
...
require_once 'phpdocx_pro/classes/TransformDoc.inc';
$doc = new TransformDoc();
$doc->setStrFile($file->filepath);
$doc->generateXHTML();
$html = $doc->getStrXHTML();
There is a library to do this but it works with Zend framework may be it will help you
It is called phpLiveDocx : http://www.phplivedocx.org/downloads/
The library is licensed under New Bcd
I have just find a library that has both reading and writing support check it on the codeplex forge http://openxmlapi.codeplex.com and it is licensed under GPLv2 .
Or, since you requested a library, you may want to look into something like Docvert. I was just looking around based on your question, and it's my favorite so far for PHP. You input the word file location, it transforms it into something simple with the attributes and all that good stuff.
Convert a docx document to a odt using OpenOffice. Use then eZ Components to do the parsing and import. They actually use the import in their CMZ eZ Publish.
Here is a simple working solution I found
http://webcheatsheet.com/php/reading_the_clean_text_from_docx_odt.php
Related
I am trying to convert a DOCX file to PDF with PHPWord. When I execute the script it looks like that some style elements are not converted. In the DOCX file I have one image, two tables with border 1px and hidden borders and I am using Tabs.
When I execute the script I get a PDF file without the image, all the Tabs are replaced with Space and all the tables have a border 3px.
Does someone know why I am missing these styles?
Here is my script:
while ($data2 = mysql_fetch_array($rsSql)){
$countLines=$countLines+1;
$templateProcessor->setValue('quantity#'.$countLines, $data2['quantity']);
$templateProcessor->setValue('name#'.$countLines, $data2['name']);
$templateProcessor->setValue('price#'.$countLines, "€ " .$data2['price'] ."");
}
\PhpOffice\PhpWord\Settings::setPdfRenderer('./dompdf');
\PhpOffice\PhpWord\Settings::setPdfRendererPath('./dompdf');
\PhpOffice\PhpWord\Settings::setPdfRendererName('DOMPDF');
$temp_file = tempnam(sys_get_temp_dir(), 'Word');
\$templateProcessor->saveAS($temp_file);
$phpWord = \PhpOffice\PhpWord\IOFactory::load($temp_file);
$xmlWriter = \PhpOffice\PhpWord\IOFactory::createWriter($phpWord , 'PDF');
$xmlWriter->save('result.pdf');
header("Content-type:application/pdf");
header("Content-Disposition:attachment;filename='result.pdf'");
readfile("result.pdf");
After a look on the source code, it seems that PHPWord previously converts the document into an HTML representation before letting it be saved it into PDF by dompdf, another converter.
That's what the opened issue #1139 confirms, moreover it deals with styles missing:
The PDF writers being used are taking in the HTML output, which also lacks the styling. The classes are being defined in the <style> tag, but they are just not being used.
Also the last message adds:
This still seems to be an issue. html and pdf outputs do not replicate the some styles in docx (header / footers).
Concerning your border problem, another SO question shows a similar issue in a conversion HTML -> PDF. A solution was to edit the CSS style, which you obviously cannot perform in your sample code, unless you proceed to pre-convert into HTML.
In conclusion, you may not solve your problem in the short term. If you won't be a part of the dev team, you could submit bug reports to them (and not to dompdf, since it's an HTML-to-PDF converter and they are outside the scope). Github lets you to add DOCX files to the issue report.
Alternatives
You could check out a SO question 204860 about server sides PDF editing library. Below two alternatives, one is free software, the other is closed source and priced.
LibreOffice
Another way is to use LibreOffice in headless mode (command line execution without interface):
libreoffice --headless --convert-to pdf <filename_to_convert>
A PHP wrapper for LibreOffice, Office Converter is also available here if you don't want to bother using libreoffice through exec().
Check if LibreOffice conversion will suit your needs (it may not cover all cases, but be satisfying your scope).
Aspose
The best converter I ever used at work is Aspose, an API covering Documents with Aspose.Words package, Worksheets with Aspose.Cells, Presentations with Aspose.Slides and so on. But it's closed-source and pretty expensive (and you'll pay for updates if you want them after your license expiration).
There is a way to use it in PHP through Java (Aspose.Words and Aspose.Cells) or .NET (Aspose.Words same seems to go with Aspose.Cells).
I want to convert any pdf,docx,doc file into html code using php. with same style as in pdf. I am not getting proper solution.
Config::set('pdftohtml.bin', 'C:/poppler-0.37/bin/pdftohtml.exe');
// change pdfinfo bin location
Config::set('pdfinfo.bin', 'C:/poppler-0.37/bin/pdfinfo.exe');
// initiate
$pdf = new Gufy\PdfToHtml\Pdf($item);
// convert to html and return it as [Dom Object](https://github.com/paquettg/php-html-parser)
$html = $pdf->html();
Not working for me.
I had a similar problem and i found a github that i used with word docs. It worked fairly good then but i havent tested it of late. try it.
https://github.com/benbalter/Convert-Word-Documents-to-HTML
I think that this post could help you in a first time. With this one, you'll be able to convert any pdf into HTML code using PHP.
After this, you can use the help provided by this post to convert .doc and .docx to PDF using PHP.
I think that you can now built a function for each document extension that you want to convert into HTML.
Good luck.
I've come across a web service which presents an API for converting documents. I haven't tested it very thoroughly but it does seem to produce decent results at converting Word to HTML:
https://cloudconvert.org/
I have been trying to find a simple way to create OpenOffice calc files with no success.
I have tried:
openTBS - Seems to work writing an xml and a template file but can't find anything about how the xml file format.
Ods php generator - I tried this one as it provides clear examples, but when I copy the files to my server I always get corrupted files
Php doc writer - Tried an example and got an sxw file. I don't even know what that is
ODS-PHP - No documentation, only one example for creating 4 cells
Everything looks old, stalled and undocumented. ¿Any suggestion?
I have used opentbs successfully.
You can generate both excel and calc files. It also nice that you can "reuse" your html implementation so to speak.
Maybe this thread could get you going http://www.tinybutstrong.com/forum.php?thr=3069
Do the html version first.. then edit for calc/excel
Spout from Box works well enough for me. There are some missing features but it is simple to use, has a fluent API, and has no dependencies (it supports composer but you can use it standalone and its dependency graph has zero depth 😉 ).
Here's my "array of objects to ODS" pipeline, using Spout:
(I'm not using their recommended use import because all this code fits in a much larger file that I didn't want to contaminate and the $factory pattern looks cleaner to me anyway)
$factory = 'Box\Spout\Writer\Common\Creator\WriterEntityFactory';
$factory::createODSWriter()
->openToBrowser('filename.ods')
->addRow($factory::createRow([
$factory::createCell(__('Heading 1')),
$factory::createCell(__('Heading 2')),
$factory::createCell(__('Heading 3')),
]))
->addRows(array_map(function($row) use ($factory) {
return $factory::createRow([
$factory::createCell($row->first_val),
$factory::createCell($row->second_val),
$factory::createCell($row->third_val),
]);
}, loadDataFromSomewhere()))
->close();
I would like to merge multiple doc or rtf files into a single file which should be the same format of multiple files.
What I mean is that if a user selects multiple rtf template files from a list box and clicks on a button on web page, the output should be a single rtf file which combines multiple rtf template files, I should use php for this.
I haven't decided the format of template files, but it should be either rtf or doc, and also I assume that template file has some images as well.
I have spent many hours to research the library for this, but still can't find it out.
Please help me out here!! :(
Thanks in advance.
If you are searching for a solution for handling RTF documents only, you can find a PHP package to merge multiple RTF documents here :
www.rtftools.com
Here is a short example on how to merge multiple documents together :
include ( 'path/to/RtfMerger.phpclass' ) ;
$merger = new RtfMerger ( 'sample1.rtf', 'sample2.rtf' ) ; // You can specify docs to be merged to the class constructor...
$merger -> Add ( 'sample3.rtf' ) ; // or by using the Add() method
$merger [] = 'sample4.rtf' ; // or by using the array access methods
$merger -> SaveTo ( 'output.rtf' ) ; // Will save files 'sample1' to 'sample4' into 'output.rtf'
This package allows you to handle documents that are bigger than the available memory.
I've been working on a similar project and havne't managed to find any PHP (or any other open source language) libraries for manipulating MSWord files. The way I approach it is kind of complicated, but works. Here's how I would do it (assuming you have a Linux server):
Setup:
Install JODConverter and OpenOffice
Start open office as a server (see http://www.artofsolving.com/node/10)
Approach (ie. what to do in your PHP code):
Convert your MSWord or RTF files into ODT format by calling JODConverter via backticks or exec()
Unzip each file into a temporary directory of its own
Read the contents.xml file from each unzipped document using a DOM Parser
Extract the <office:text> contents from each, and concatenate
Put this concatenated xml back into the right spot in one of the content.xml files
Re-zip the contents of that temporary directory and give it an .odt extension
Use JODConverter to convert this file back to MSWord again
As I said, it's not pretty, but it does the job.
If you're looking to go down the RTF route, this question may also help: Concatenate RTF files in PHP (REGEX)
I guess no one was lucky to found the best solution of handling reports in php, specialy when it's a .doc/x report or file .... i searched for sometime and then i found phpdocx.com .. amazing php script, but it just doesn't work, and i don't know exactly where to find the output file ... and unfortunately the documentation doesn't help at any level ...
Now i need to know the way this script work .. i mean how results come out and become usable ... and what needs it take the script to work .. because it simply doesn't work on my local host .. i am using appache 2, php 5.2.6 ..
I don't actually need more than writing html with in ( a real doc format file, not rename a html file to .doc !! ), so if there is any solution ( without the COM Lib ... i am not on a windows server ) to generate real doc file with HTML .. please but it here
Thanks very much in advance :)
I guess no one was lucky to found the best solution of handling
reports in php, specialy when it's a .doc/x report or file
This is not the question corresponding to the title, but you should try OpenTBS.
It's an open source PHP library which builds DOCX with the technique of templates.
No temp directory, no extra exe needed. First create your DOCX, XLSX, PPTX with Ms Office, (ODT, ODS, ODP are also supported, that's OpenOffice files). Then you use OpenTBS to load the template and change the content using the Template Engine (easy, see the demo). At the end, you save the result where you need. It can be a new file, a download flow, a PHP binary string.
OpenTBS can also change pictures and charts in a document.
Demo page
Documentation
The documentation of PHPDocX has been greatly improved.
Have you tried to look at the PHPDocX tutorial?
You may also have a look at the Forum.
require_once "Path of phpdocx library/CreateDocx.inc";
$docx = new CreateDocx();
$html = 'your data will store in this variable';
$docx->embedHTML(
$html,
array(
'parseDivsAsPs' => true,
'downloadImages' => true,
'WordStyles' => array(
'<table>' => 'MediumGrid3-accent5PHPDOCX'
),
'tableStyle' => 'NormalTablePHPDOCX'
)
);
$docx->createDocx($varPublicPath.'/word_export_file/example1_'.time());
// this is location where your docx file will generate(inside word_export_file docx file will store)