Phpdocx word documents are corrupt when adding images

Phpdocx word documents are corrupt when adding images - php

I'm using Phpdocx 2.5 to convert html to docx. I'm using the embedHTML method with 'downloadImages' parameter set to true;
When the html doesn't contain any images, document is generated just fine. When images are added, the resulting document seems to be corrupt. The details of the errors show a standard "File is corrupt and cannot be opened" message. However, Word also displays a "Do you want to recover the content" message which if clicked, displays the content (with images) which look just fine.
Basically the content seems to be rendered correctly but there is an issue somewhere that causes Word to see the document as corrupted. Anyone had this problem or can you give me any ideas how to solve this?
Thanks in advance.

Related

Issue converting to .pdf a merged .docx file that opens fine in Word

So, I have the following scenario.
I am working on a system for academical papers. I have several inputs that are for stuff like author name, coauthors, title, type of paper, introduction, objectives and so on. I store all that information in a database. The user has a Preview button which when clicked, generates a Word asynchronously and sends the file location back to the user and that file is afterwards shown to the user in an iframe using Google Doc Viewer.
There's a specific use case where the user/author of the paper can attach a .docx file with a table, or a .jpeg file for a figure. That table/figure has to be included inside the final .docx file.
For the .docx generation process I am using PHPWord.
So up until this point everything works fine, but my issues start when I try to mix everything and put together the .docx file.
Approach Number One
My first approach on doing this was to do everything with PHPWord. I create the file, add the texts where required and in the case of the image just insert the image and after that the figure caption below the image.
Things get tricky though, when I try doing the same thing with the .docx table file. My only option was to get the table XML using this. It did the trick, but the problem I ran into was that when I opened the resulting Word file, the table was there, but had lost all of its styling and had transparent borders. Because of those transparent borders, afterwards when converting it to PDF the borders were ignored and the table info is just scrambled text.
Approach Number Two (current one)
After fighting with Approach Number One and just complicating stuff more, I decided to do something different. Since I already generated one docx file with the main paper information and I needed to add another docx file, I decided to use the DocX Merge Library.
So, what i basically did was I have three generated word files, one for the main paper information, one for the table and one for the table caption (that last one is mainly to not overcomplicated the order of information). Also, that data is not in the table .docx file.
Then I run this:
$dm->merge( [
'paper-info.docx',
'attached-table.docx',
'attached-table-caption.docx'
], 'complete-file.docx');
So, afterwards, I check and the Word file is generated just as I need it with the table maintaining its original styles and dimensions.
If I open it in LibreOffice though, I get this error message:
Then if I continue and open the file, the file opens correctly with all the data with the only exception that it no longer respects the fonts of the file as they appear in Word.
So, the problem comes in the next step. Since I need to present a preview of the file using Google Doc Viewer using this syntax:
<iframe src="https://docs.google.com/gview?embedded=true&hl=es_LA&url=https://usersite.net/complete-file.docx?pid=explorer&efh=false&a=v&chrome=false&embedded=true" width="100%" height="600" style="border: none;"></iframe>
The document gets loaded fine, but when I review it what I see is that it only shows the content of the first paper-info.docx file and ends right where the table and table caption should appear. I open the exact same file in Word and it shows the table and caption.
The other issue is when I try to convert the file to PDF.
If I use PHPWord's method of conversion in combination with DomPDF I get the exact same issue as with the Google Docs Viewer, I just have the content of the first file, using this code:
$phpWordPDF = \PhpOffice\PhpWord\IOFactory::load('complete-file.docx');
$xmlWriterPDF = \PhpOffice\PhpWord\IOFactory::createWriter($phpWordPDF, 'PDF');
$xmlWriterPDF->save('complete-file-pdf');
So my only other viable route was to use LibreOffice's command line using this command:
soffice --headless --convert-to pdf complete-file.docx
This converts the file correctly, but has the issue mentioned when trying to open the .docx file in LibreOffice, the font styles are disconfigured.
Also weird part is that if I try to run this in my PHP script:
shell_exec('soffice --headless --convert-to pdf complete-file.docx');
Nothing happens.
I am running Apache 2.4.25, PHP 7.4.11 on Windows 10 x64.
Conclusion
Until now my best result was by merging the files, but it also caused this issue. So maybe the issue is coming from the merging process I am using. What would be ideal is to be able to just insert the table with styles and everything using PHPWord, but I haven't been able to and haven't found any examples on how to do that.
Another option that I've seen is this library, but the merge features is only in the license that's $599 USD, and since I am pretty close to solving this, I am not sure if it would solve my issue. If it does, I'd invest in it since I need to get this done ASAP, but I wanted to check with you guys what your recommendations would be for this case. Maybe another merging library or doing everything via PHPWord.
Help is appreciated!

After a lot of attempts to fix it, I wasn't able to achieve what I wanted with PHPWord and the merging library I mentioned.
Since I needed to fix this I decided to invest in the paid library I mentioned in my question. It was an expensive purchase, but for those who are interested, it does exactly what was required and it does it perfectly.
The two main functions I required were document merging and importing of content to a .docx file.
So I had to purchase the Premium package. Once there, the library literally does everything for you.
Example for docx files merge code:
require_once 'classes/MultiMerge.php';
$merge = new MultiMerge();
$merge->mergeDocx('document.docx', array('second.docx', 'other.docx'), 'output.docx', array());
Example for how to import a table from another docx file
require_once 'classes/CreateDocx.php';
$docx = new CreateDocxFromTemplate('document.docx');
// import tables
$referenceNode = array(
'type' => 'table',
);
$docx->importContents('document_1.docx', $referenceNode);
$docx->createDocx('output');
As you can see it is pretty easy. This answer is by no means an ad for this library, but for those that have the same problem as me, this is a life saver.

Moodle: Files uploaded via File API get corrupted when viewed

So I am developing a new course-format, in which a picture is associated with each activity in a course, and presented visually. I created the course format, overrode the renderer etc. That worked all fine. However, the images are supposed to be custom generated and since it has to work for all existing and future, I put some additional code into the general course module form, enabling an image upload.
After admittedly some struggle on my part to get the File API working, it now all works fine. Only in my course format, there is an additional heading, under which you can upload a single image. This gets saved to the database fine, it is not in draft and it is viewable in my dataroots filedir perfectly if I follow the contenthash in the database. It even gets loaded into the form as a default fine. However, if I try to work with the image, all tests run fine (.is_valid_img()etc) and I even get offered to download a file. However, when I do it is corrupted and my file viewer says: "Critical Error: Not a png file". Needless to say it is not displayed on my actual course site.
When I look at the file in filedir, it very clearly is a png. Please, I would be thankful for any help, since I have tried alot and am at my wits end.

It sounds to me like you are getting some sort of output on the page before the PNG file is sent - that would be added to the start of the file and cause it not to work as a PNG file.
I would suggest you open the file in a hex editor and check the start of the file - it should look like https://en.wikipedia.org/wiki/Portable_Network_Graphics#File_header, so look for extra characters before that.
As for where the extra characters come from - they may be an obvious warning / error message (which should be easy to track down and fix). Alternatively, you may have some stray 'echo' statements (again, fairly easy to track down). The worst problems to find are extra characters before the opening 'php' tags of a file somewhere in your install or after the closing tag at the end of a file (which is why you should never use closing PHP tags). Finding these will come down to searching through all your customised code files to locate them.

Read complete content of .rtf document

I have to create a document using user inserted data and including data from a .rtf document into a web page layout i created (HTML+CSS and PHP for scripting).
My problem is, i can't find any way to obtain the full content of the .rtf document.
Being a technical document symbols, tables, graphs and images are very often included: with the methods I've found i could obtain the text with symbols in a decent formatting but i had no luck with images.
So what i need is a way to obtain the full content of a .rtf file, possibly maintaining the document formatting, so i can display and organize it in a webpage; preferrably in pure PHP but use of js/executables via php is fine.
I've tried:
-rtf to html converters but the best i could get is clear text and symbols but no images;
using COM extension to open the .rtf in ms word and saving it as .html (i noticed that if i open up the .rtf then save it as webpage in word it creates a perfect html page) but it only changed the extension and didn't create a html page;
extracting text and image sperately: works but again being the document a technical document image placement is very important.
It's my first question here, after many research; please bear with me in case of errors.

Acrobat Reader prompt saving pdf file when closing it (created with TCPDF)

my problem is with the pdf opened in Acrobat Reader, created with TCPDF on ZF2.
The file is created fine (except the size of the file, around 500kb), content is fine, but when trying to close the file, Acrobat prompts for saving the changes, though there is no changes. After saving the file and overwriting, the file size drops to around 40kb. So the file size is reduced over 10 times, but there is no visible change in the contents or otherwise.
Closest I got to any related answer was this thread here http://forums.planetpdf.com/save-file-prompt-when-closing_topic36.html
As I understand the issue is related to "The xref table is malformed", but my experience with pdf is not enough to understand the root of my problem. Sample file is available here https://dl.dropboxusercontent.com/u/29072870/test_pdf.pdf
Thanks in advance!

Only the first 7036 bytes of your file make up your actual pdf. Everything thereafter is some HTML code. Thus, you should check your pdf creation code, it seems to contain some HTML creation code (leftover from copy&paste? Added by the framework?), too.
The Adobe Reader shows these leading 7KB and eventually offers to save them as a repaired file encoded like the Reader prefers it (exploding those 7KB to your 40KB).
PS: I just saw that after the HTML code there additionally are about 80KB of null bytes.
It looks like you received a whole byte buffer 0x80000 (= 524288 decimally) bytes in size containing your PDF, some HTML, and some yet unused space.

problem actually not quite solved yet :)
the issue got much more strange now. on chrome everything works perfect, created pdf is solid and no additional data. whereas in firefox the output of the pdf is fine, saving the file works fine, opening the file with acrobat fine, closing produces same result in prompt for saving without any changes made. apparently there is still the portion of null bytes present in the end of the file. when using the "download as file" option in TCPDF output the result is correct, no additional data after EOF. only happens when pdf is output in the browser (firefox) and saved from there. could it be some firefox's issue? can one check the file for this kind of excess data and remove it somehow?

Is it possible to save a word file in MYSQL database and to view the content "AS IT LOOKS" in the Browser

For example, i am uploading a word file with some FORMATTED contents in the database. The content in the word document is aligned.
I done up to the above level . My issue is how can i able to view the CONTENTS AS IT LOOKS EXACTLY (means the exact formatted contents) IN A BROWSER.
Kindly help me out of this issue.
Thanks in Advance
Fero

You may stream the content in the body of the request as an attachment setting the correct MIME Type. If the user's client is configured to handle the content type it will show (after asking for permissions).
PHP MIME Content Type

Word is a format for word processing, whereas the browser is a client for displaying web pages. So no, you can't. There are some similarities between the two formats, so you can transform between them, but usually at a loss. Since Word is a proprietary format, transforming it to html can be tricky, but you can generally use open office for the job.

Another alternative is, instead of uploading the file, upload the content of the file through the use of a javascript WYSIWYG editor like TinyMCE. Since you will be storing the HTML markups that the editor converts from the formatted contents that you copy-paste in it, it will be very straight-forward to display the contents.

If the content doesn't need to be edited, why not convert it to a .PDF/.JPG on the fly, or do it once upon upload and cache the result?

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.