No styling when converting DOCX into PDF with PHPWord

No styling when converting DOCX into PDF with PHPWord - php

I am trying to convert a DOCX file to PDF with PHPWord. When I execute the script it looks like that some style elements are not converted. In the DOCX file I have one image, two tables with border 1px and hidden borders and I am using Tabs.
When I execute the script I get a PDF file without the image, all the Tabs are replaced with Space and all the tables have a border 3px.
Does someone know why I am missing these styles?
Here is my script:
while ($data2 = mysql_fetch_array($rsSql)){
$countLines=$countLines+1;
$templateProcessor->setValue('quantity#'.$countLines, $data2['quantity']);
$templateProcessor->setValue('name#'.$countLines, $data2['name']);
$templateProcessor->setValue('price#'.$countLines, "€ " .$data2['price'] ."");
}
\PhpOffice\PhpWord\Settings::setPdfRenderer('./dompdf');
\PhpOffice\PhpWord\Settings::setPdfRendererPath('./dompdf');
\PhpOffice\PhpWord\Settings::setPdfRendererName('DOMPDF');
$temp_file = tempnam(sys_get_temp_dir(), 'Word');
\$templateProcessor->saveAS($temp_file);
$phpWord = \PhpOffice\PhpWord\IOFactory::load($temp_file);
$xmlWriter = \PhpOffice\PhpWord\IOFactory::createWriter($phpWord , 'PDF');
$xmlWriter->save('result.pdf');
header("Content-type:application/pdf");
header("Content-Disposition:attachment;filename='result.pdf'");
readfile("result.pdf");

After a look on the source code, it seems that PHPWord previously converts the document into an HTML representation before letting it be saved it into PDF by dompdf, another converter.
That's what the opened issue #1139 confirms, moreover it deals with styles missing:
The PDF writers being used are taking in the HTML output, which also lacks the styling. The classes are being defined in the <style> tag, but they are just not being used.
Also the last message adds:
This still seems to be an issue. html and pdf outputs do not replicate the some styles in docx (header / footers).
Concerning your border problem, another SO question shows a similar issue in a conversion HTML -> PDF. A solution was to edit the CSS style, which you obviously cannot perform in your sample code, unless you proceed to pre-convert into HTML.
In conclusion, you may not solve your problem in the short term. If you won't be a part of the dev team, you could submit bug reports to them (and not to dompdf, since it's an HTML-to-PDF converter and they are outside the scope). Github lets you to add DOCX files to the issue report.
Alternatives
You could check out a SO question 204860 about server sides PDF editing library. Below two alternatives, one is free software, the other is closed source and priced.
LibreOffice
Another way is to use LibreOffice in headless mode (command line execution without interface):
libreoffice --headless --convert-to pdf <filename_to_convert>
A PHP wrapper for LibreOffice, Office Converter is also available here if you don't want to bother using libreoffice through exec().
Check if LibreOffice conversion will suit your needs (it may not cover all cases, but be satisfying your scope).
Aspose
The best converter I ever used at work is Aspose, an API covering Documents with Aspose.Words package, Worksheets with Aspose.Cells, Presentations with Aspose.Slides and so on. But it's closed-source and pretty expensive (and you'll pay for updates if you want them after your license expiration).
There is a way to use it in PHP through Java (Aspose.Words and Aspose.Cells) or .NET (Aspose.Words same seems to go with Aspose.Cells).

Related

PDF manipulation - images are distorted after few consecutive operations on PDF file

I've run into this weird issue with PDF file handling. Not sure if SO is the right place to ask this, but I couldn't find any specific sites for this. I hope that someone can shed some light on the issue.
This happens with the following specific process, if some of steps are omitted - the issue is not observed.
I have a PHP application that serves PDF files to users. These files are created by authors in MS Word 2007, then printed to protected PDF (using pdf995, most likely, I can confirm if needed).
I'll call this initial PDF file as 'source' hereinafter.
Upon request, the source file is processed in PHP the following way:
we decrypt it using qpdf:
qpdf --decrypt "source.pdf" "tmp_output.pdf"
Then we add security label / wartermark to it, encrypt and output to browser using mPDF 6.0:
$mpdf = new mPDF();
$mpdf->SetImportUse();
$pagecount = $mpdf->SetSourceFile($fpath);
if ($pagecount) {
for ($i=1;$i<=$pagecount;$i++){
$tplId = $mpdf->ImportPage($i);
$mpdf->UseTemplate($tplId);
$html = '[security label / watermark contents...]';
$mpdf->WriteHTML($html);
}
}
$mpdf->SetProtection(array('copy','print'), '', 'password',128);
$mpdf->Output('final_output.pdf','I');
With the exact steps described above, images in the output that were pasted in the Word doc appear as follows:
In the source PDF, tmp_output (qpdf decrypted file) the pasted images look correct:
The distortion doesn't take place if any of the following occurs:
Word doc printed to PDF without protection
mPDF output is not protected.
As you can see there too many factors, so I don't know where to look for a bug.
Each component works correctly on it's own and I cannot find any info on the issue. Any insights are greatly appreciated.
EDIT 1
After some more testing, it appears that this only happens to screenshots taken from web browser, Windows explorer, MS Word. Cannot reproduce this with screenshots from Gimp.
It appears that something along the way attempts to convert white to alpha and fails.

The current version (6.1) of Mpdf has a bug which does not handle escaped PDF strings (imported via FPDI) correct if they should be encrypted.
A pull request, which fixes this issue is available here.

How to convert a PDF into image exactly similar to the PDF with PHP/Imagemagik/Ghostscript

Im generating PDF documents with PHP(TCPDF is the library behind) and for displaying them Im converting them as images using ghostscript, and displaying the previews, but the preview doesnt actually similar to the PDF document.
The code Im using to convert is here
$pdf = 'my_report.pdf';
$output = 'my_preview.jpg';
$quality=90;
$res='300x300';
$exportPath=$output;
set_time_limit(900);
exec("'gs' '-dNOPAUSE' '-sDEVICE=jpeg' '-dUseCIEColor' '-dTextAlphaBits=4' '-dGraphicsAlphaBits=4' '-o$exportPath' '-r$res' '-dJPEGQ=$quality' '$pdf'",$output);
and the preview generated with the code for this document is right below
where as my actual PDF file looks like below
You can see a lot of inequalities between, I need a way to convert like just a copy of it.
and im sure there is nothing wrong in the PDf report, I tried it uploading it into Google mail, that gave a perfect image, and I did convert the PDf into jpeg here
http://pdf2jpg.net/
That to gave a perfect copy of the document, only the Imagemagick/Gjostscript is unable to generate an exact one.
Any help would be helpful.

What are you using to view the 'correct' display of the PDF ? Does Ghostscript issue you any warnings when rendering ?
It looks to me like there 'may' be fonts missing in your original PDF file, which will lead to font substitution.
Why are you using -dUseCIEColor ? This will almost certainly lead to colour shifts, which I also see in your images. If you have a good reason for using this, what is it ? If you don't have a good reason, don't do that.
Is the second image a JPEG ? The first clearly is, and jpeg is a lossy compression, have you tried using TIFF instead ?
It is always useful with these sorts of questions to post a link to the original PDF file, so that some investigation can be done, without that, this is all guesswork I'm afraid.

How to add text on pdf using PDFJam

I am using PDF Jam for manipulating pdf. I need to add a text line at the bottom of generated file. I tried it but not able to made it.
Can anybody guide me how to do it?
I did in my php code as
$command = '-----------------';
exec($command);

As you know, PDFJAM is for manipulating pds. It is a small collection of shell scripts which provide a simple interface to much of the functionality of the excellent pdf pages. See the Ubuntu Manual
pdfjam - A shell script for manipulating PDF files
You should create your sheet as your doing (5x6) and create a separate sheet of minimal page size with required information than merge both the file into one.
Else in first step create your sheet and use pdflib to add text as second step. It very good tool. I hope its a good solution of your problem.

I love pdftk and so wanted to find a solution using that. The following worked for me.
pdfjam --preamble '\usepackage{fancyhdr} \topmargin 85pt \oddsidemargin 140pt \cfoot{\thepage}' --pagecommand '\thispagestyle{plain}' --landscape --nup 2x1 --frame false --clip true --trim ".5in 0.5in 0.5in .65in" --delta '-0.25in 0' tmp.pdf
I cribbed it from: Page Numbering with the "{page} of {pages}", removing the "of pages" part.
Command converts pdf to 2x1, trims margins, and crops. Output is landscape.
\topmargin and \oddsidemargin seem to tell pdflatex where to put the numbers.

eps image (from inkscape) not showing up in tcpdf

Using php and TCPDF to generate a pdf file. Everything works great except when I try to write an EPS image to the pdf using ImageEPS(). Nothing shows up. No errors (it can definitely find the file). It just shows up as white space.
Raster images (like PNG/JPG) work just fine.
I'm using Inkscape to save the .eps file. When I open the file up in any other program, it opens just fine. Its only TCPDF that its not showing up with.

I had open my *.ai file in Adobe Illustrator and save the file as "Illustrator 3" version to overcome that issue. Any more current version produced the results you describe (except "Illustrator 8," which gave me the B&W version of my *.ai file).

A bit late, but I had the same problem.
For me, the workaround was to export as PDF and reuse this PDF in TCPDF/FPDI with:
$num_pages = $pdf->setSourceFile(path_to_file);
$template_id = $pdf->importPage(1); //if the grafic is on page 1
$pdf->useTemplate($template_id,$x,$y,$width,$height);

The ImageEPS function in TCPDF (6.0.004) is not fully implemented and the documentation states the following:
/**
* Embed vector-based Adobe Illustrator (AI) or AI-compatible EPS files.
* NOTE: EPS is not yet fully implemented, use the
* setRasterizeVectorImages() method to enable/disable rasterization of
* vector images using ImageMagick library.
* ...
*/
public function ImageEps(...){/*...*/}
TCPDF (6.0.004) checks an eps meta-data for its creator. If the creator is Adobe Illustrator, a version check is made and if the version is above 8 an error is generated.
Creators other than Adobe Illustrator are not checked and the function is allowed to continue. It does not seems like TCPDF parses the PS prolog and this is probably one reason why not all AI versions are supported. Here is what PostScript Language Reference says about the prolog section:
The prolog is a set of application-specific procedure definitions that an applica-
tion may use in the execution of its script. It is included as the first part of every
PostScript file generated by the application. It contains definitions that match
the output functions of the application with the capabilities supported by the
PostScript language.
Since the prolog is not parsed, it is troublesome to interpret the file correctly.
Inkscape (0.48.3.1 r9886) creates epses with cairo and no error will occur and the function will continue. TCPDF will partly interpret the eps, but since it does not output anything, the output is probably removed by some error handling. But that is just a guess.
I had more success with exporting my eps to a svg with
inkscape -D --file=filename.eps --export-plain-svg=filename.svg
and using ImageSVG instead. Note: this function is not fully implemented either, so I can't guarantee that it will work. I have only tested a pretty basic eps.

reading docx (Office Open XML) in PHP

I want to add an word import function to our CMS, the only problem I cannot seems to find a good library for reading docx files (Word 2007).
Do anyone has some recommendations, the library should be able to extract content of the document and basic styling like italic, bold, superscript?
Thanks for your help

docx files are actually just containers for the document's XML. You should be able to unzip the docx file and then go to the word folder inside, then to the document.xml. This has the actual text. But things like the fonts and styles are in other xml files in the docx container, so you'll probably want to mess around a bit and figure out what is what and how to match it up (start by using namespaces, I bet).
But yea, unzip the file, then use simplexml to convert it into something you can actually mess around with.

PHPDocX PRO includes a TransformDoc class that can read .docx (zip) files and generate XHTML (or PDF) from it:
...
require_once 'phpdocx_pro/classes/TransformDoc.inc';
$doc = new TransformDoc();
$doc->setStrFile($file->filepath);
$doc->generateXHTML();
$html = $doc->getStrXHTML();

There is a library to do this but it works with Zend framework may be it will help you
It is called phpLiveDocx : http://www.phplivedocx.org/downloads/
The library is licensed under New Bcd

I have just find a library that has both reading and writing support check it on the codeplex forge http://openxmlapi.codeplex.com and it is licensed under GPLv2 .

Or, since you requested a library, you may want to look into something like Docvert. I was just looking around based on your question, and it's my favorite so far for PHP. You input the word file location, it transforms it into something simple with the attributes and all that good stuff.

Convert a docx document to a odt using OpenOffice. Use then eZ Components to do the parsing and import. They actually use the import in their CMZ eZ Publish.

Here is a simple working solution I found
http://webcheatsheet.com/php/reading_the_clean_text_from_docx_odt.php

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.