my problem is with the pdf opened in Acrobat Reader, created with TCPDF on ZF2.
The file is created fine (except the size of the file, around 500kb), content is fine, but when trying to close the file, Acrobat prompts for saving the changes, though there is no changes. After saving the file and overwriting, the file size drops to around 40kb. So the file size is reduced over 10 times, but there is no visible change in the contents or otherwise.
Closest I got to any related answer was this thread here http://forums.planetpdf.com/save-file-prompt-when-closing_topic36.html
As I understand the issue is related to "The xref table is malformed", but my experience with pdf is not enough to understand the root of my problem. Sample file is available here https://dl.dropboxusercontent.com/u/29072870/test_pdf.pdf
Thanks in advance!
Only the first 7036 bytes of your file make up your actual pdf. Everything thereafter is some HTML code. Thus, you should check your pdf creation code, it seems to contain some HTML creation code (leftover from copy&paste? Added by the framework?), too.
The Adobe Reader shows these leading 7KB and eventually offers to save them as a repaired file encoded like the Reader prefers it (exploding those 7KB to your 40KB).
PS: I just saw that after the HTML code there additionally are about 80KB of null bytes.
It looks like you received a whole byte buffer 0x80000 (= 524288 decimally) bytes in size containing your PDF, some HTML, and some yet unused space.
problem actually not quite solved yet :)
the issue got much more strange now. on chrome everything works perfect, created pdf is solid and no additional data. whereas in firefox the output of the pdf is fine, saving the file works fine, opening the file with acrobat fine, closing produces same result in prompt for saving without any changes made. apparently there is still the portion of null bytes present in the end of the file. when using the "download as file" option in TCPDF output the result is correct, no additional data after EOF. only happens when pdf is output in the browser (firefox) and saved from there. could it be some firefox's issue? can one check the file for this kind of excess data and remove it somehow?
Related
So, I have the following scenario.
I am working on a system for academical papers. I have several inputs that are for stuff like author name, coauthors, title, type of paper, introduction, objectives and so on. I store all that information in a database. The user has a Preview button which when clicked, generates a Word asynchronously and sends the file location back to the user and that file is afterwards shown to the user in an iframe using Google Doc Viewer.
There's a specific use case where the user/author of the paper can attach a .docx file with a table, or a .jpeg file for a figure. That table/figure has to be included inside the final .docx file.
For the .docx generation process I am using PHPWord.
So up until this point everything works fine, but my issues start when I try to mix everything and put together the .docx file.
Approach Number One
My first approach on doing this was to do everything with PHPWord. I create the file, add the texts where required and in the case of the image just insert the image and after that the figure caption below the image.
Things get tricky though, when I try doing the same thing with the .docx table file. My only option was to get the table XML using this. It did the trick, but the problem I ran into was that when I opened the resulting Word file, the table was there, but had lost all of its styling and had transparent borders. Because of those transparent borders, afterwards when converting it to PDF the borders were ignored and the table info is just scrambled text.
Approach Number Two (current one)
After fighting with Approach Number One and just complicating stuff more, I decided to do something different. Since I already generated one docx file with the main paper information and I needed to add another docx file, I decided to use the DocX Merge Library.
So, what i basically did was I have three generated word files, one for the main paper information, one for the table and one for the table caption (that last one is mainly to not overcomplicated the order of information). Also, that data is not in the table .docx file.
Then I run this:
$dm->merge( [
'paper-info.docx',
'attached-table.docx',
'attached-table-caption.docx'
], 'complete-file.docx');
So, afterwards, I check and the Word file is generated just as I need it with the table maintaining its original styles and dimensions.
If I open it in LibreOffice though, I get this error message:
Then if I continue and open the file, the file opens correctly with all the data with the only exception that it no longer respects the fonts of the file as they appear in Word.
So, the problem comes in the next step. Since I need to present a preview of the file using Google Doc Viewer using this syntax:
<iframe src="https://docs.google.com/gview?embedded=true&hl=es_LA&url=https://usersite.net/complete-file.docx?pid=explorer&efh=false&a=v&chrome=false&embedded=true" width="100%" height="600" style="border: none;"></iframe>
The document gets loaded fine, but when I review it what I see is that it only shows the content of the first paper-info.docx file and ends right where the table and table caption should appear. I open the exact same file in Word and it shows the table and caption.
The other issue is when I try to convert the file to PDF.
If I use PHPWord's method of conversion in combination with DomPDF I get the exact same issue as with the Google Docs Viewer, I just have the content of the first file, using this code:
$phpWordPDF = \PhpOffice\PhpWord\IOFactory::load('complete-file.docx');
$xmlWriterPDF = \PhpOffice\PhpWord\IOFactory::createWriter($phpWordPDF, 'PDF');
$xmlWriterPDF->save('complete-file-pdf');
So my only other viable route was to use LibreOffice's command line using this command:
soffice --headless --convert-to pdf complete-file.docx
This converts the file correctly, but has the issue mentioned when trying to open the .docx file in LibreOffice, the font styles are disconfigured.
Also weird part is that if I try to run this in my PHP script:
shell_exec('soffice --headless --convert-to pdf complete-file.docx');
Nothing happens.
I am running Apache 2.4.25, PHP 7.4.11 on Windows 10 x64.
Conclusion
Until now my best result was by merging the files, but it also caused this issue. So maybe the issue is coming from the merging process I am using. What would be ideal is to be able to just insert the table with styles and everything using PHPWord, but I haven't been able to and haven't found any examples on how to do that.
Another option that I've seen is this library, but the merge features is only in the license that's $599 USD, and since I am pretty close to solving this, I am not sure if it would solve my issue. If it does, I'd invest in it since I need to get this done ASAP, but I wanted to check with you guys what your recommendations would be for this case. Maybe another merging library or doing everything via PHPWord.
Help is appreciated!
After a lot of attempts to fix it, I wasn't able to achieve what I wanted with PHPWord and the merging library I mentioned.
Since I needed to fix this I decided to invest in the paid library I mentioned in my question. It was an expensive purchase, but for those who are interested, it does exactly what was required and it does it perfectly.
The two main functions I required were document merging and importing of content to a .docx file.
So I had to purchase the Premium package. Once there, the library literally does everything for you.
Example for docx files merge code:
require_once 'classes/MultiMerge.php';
$merge = new MultiMerge();
$merge->mergeDocx('document.docx', array('second.docx', 'other.docx'), 'output.docx', array());
Example for how to import a table from another docx file
require_once 'classes/CreateDocx.php';
$docx = new CreateDocxFromTemplate('document.docx');
// import tables
$referenceNode = array(
'type' => 'table',
);
$docx->importContents('document_1.docx', $referenceNode);
$docx->createDocx('output');
As you can see it is pretty easy. This answer is by no means an ad for this library, but for those that have the same problem as me, this is a life saver.
So I am developing a new course-format, in which a picture is associated with each activity in a course, and presented visually. I created the course format, overrode the renderer etc. That worked all fine. However, the images are supposed to be custom generated and since it has to work for all existing and future, I put some additional code into the general course module form, enabling an image upload.
After admittedly some struggle on my part to get the File API working, it now all works fine. Only in my course format, there is an additional heading, under which you can upload a single image. This gets saved to the database fine, it is not in draft and it is viewable in my dataroots filedir perfectly if I follow the contenthash in the database. It even gets loaded into the form as a default fine. However, if I try to work with the image, all tests run fine (.is_valid_img()etc) and I even get offered to download a file. However, when I do it is corrupted and my file viewer says: "Critical Error: Not a png file". Needless to say it is not displayed on my actual course site.
When I look at the file in filedir, it very clearly is a png. Please, I would be thankful for any help, since I have tried alot and am at my wits end.
It sounds to me like you are getting some sort of output on the page before the PNG file is sent - that would be added to the start of the file and cause it not to work as a PNG file.
I would suggest you open the file in a hex editor and check the start of the file - it should look like https://en.wikipedia.org/wiki/Portable_Network_Graphics#File_header, so look for extra characters before that.
As for where the extra characters come from - they may be an obvious warning / error message (which should be easy to track down and fix). Alternatively, you may have some stray 'echo' statements (again, fairly easy to track down). The worst problems to find are extra characters before the opening 'php' tags of a file somewhere in your install or after the closing tag at the end of a file (which is why you should never use closing PHP tags). Finding these will come down to searching through all your customised code files to locate them.
I'm using Phpdocx 2.5 to convert html to docx. I'm using the embedHTML method with 'downloadImages' parameter set to true;
When the html doesn't contain any images, document is generated just fine. When images are added, the resulting document seems to be corrupt. The details of the errors show a standard "File is corrupt and cannot be opened" message. However, Word also displays a "Do you want to recover the content" message which if clicked, displays the content (with images) which look just fine.
Basically the content seems to be rendered correctly but there is an issue somewhere that causes Word to see the document as corrupted. Anyone had this problem or can you give me any ideas how to solve this?
Thanks in advance.
I need to insert an hyperlink into a few thousand existing pdfs. I'm working with zend_pdf which apparently is not able to set an invisible border. The only way I found to make the link borders invisible (found it somewhere else on this site, here, to be precise) is to search for each link "element" of the pdf and add a /Border annotation, like this:
echo str_replace('/Annot /Subtype /Link', '/Annot /Subtype /Link /Border[0 0 0]', $pdf->render());
Since I need to work on files that reside on my filesystem, I'm using the sed command for the search & replace operation.
Now, at first sight this works, as the documents are displayed correctly by Acrobat 8, osx 10.6's Viewer and Ubuntu's document viewer. However, tools such as pdftk (1.41) and pdfinfo (0.12.1) report the structure is corrupted. This is annoying since it means that no further manipulation of the pdf using pdftk will be possible, since the tool refuses to work on the file as there are errors in it.
I looked into the files using a binary editor and I found out that if I add more than two bytes after "/Link", the file gets corrupted. This confuses me a lot, since based on the pdf specifications (I'm using 1.4) there is no checksum except for streams, which should mean that one can add as much bytes as he wants, as long as he's not doing that inside a stream and the inserted bytes are valid pdf syntax.
What am I missing here?
Here is an example:
the original pdf
the processed pdf
Adding the additional "/Border" element in the file actually corrupts the pdf's xref table. The xref table references all the objects by their position, measured in bytes from the beginning of the file. Inserting the additional element of course shifts the position (offset) of the subsequent contents by a few bytes.
To fix the xref table after the edit, I can use pdftk from pdf labs (http://www.pdftk.com) to fix the xref table:
$ pdftk corrupted_file.pdf output fixed_file.pdf
As a matter of fact, I was not able to find a comprehensive Pdf solution for php, and I had to use several different kinds of tools (zend_pdf, pdftk, sed) to implement my workflow.
Okay. So I have about 250,000 high resolution images. What I want to do is go through all of them and find ones that are corrupted. If you know what 4scrape is, then you know the nature of the images I.
Corrupted, to me, is the image is loaded into Firefox and it says
The image “such and such image” cannot be displayed, because it contains errors.
Now, I could select all of my 250,000 images (~150gb) and drag-n-drop them into Firefox. That would be bad though, because I don't think Mozilla designed Firefox to open 250,000 tabs. No, I need a way to programmatically check whether an image is corrupted.
Does anyone know a PHP or Python library which can do something along these lines? Or an existing piece of software for Windows?
I have already removed obviously corrupted images (such as ones that are 0 bytes) but I'm about 99.9% sure that there are more diseased images floating around in my throng of a collection.
An easy way would be to try loading and verifying the files with PIL (Python Imaging Library).
from PIL import Image
v_image = Image.open(file)
v_image.verify()
Catch the exceptions...
From the documentation:
im.verify()
Attempts to determine if the file is broken, without actually decoding the image data. If this method finds any problems, it raises suitable exceptions. This method only works on a newly opened image; if the image has already been loaded, the result is undefined. Also, if you need to load the image after using this method, you must reopen the image file.
i suggest you check out imagemagick for this: http://www.imagemagick.org/
there you have a tool called identify which you can either use in combination with a script/stdout or you can use the programming interface provided
In PHP, with exif_imagetype():
if (exif_imagetype($filename) === false)
{
unlink($filename); // image is corrupted
}
EDIT: Or you can try to fully load the image with ImageCreateFromString():
if (ImageCreateFromString(file_get_contents($filename)) === false)
{
unlink($filename); // image is corrupted
}
An image resource will be returned on
success. FALSE is returned if the
image type is unsupported, the data is
not in a recognized format, or the
image is corrupt and cannot be loaded.
If your exact requirements are that it show correctly in FireFox you may have a difficult time - the only way to be sure would be to link to the exact same image loading source code as FireFox.
Basic image corruption (file is incomplete) can be detected simply by trying to open the file using any number of image libraries.
However many images can fail to display simply because they stretch a part of the file format that the particular viewer you are using can't handle (GIF in particular has a lot of these edge cases, but you can find JPEG and the rare PNG file that can only be displayed in specific viewers). There are also some ugly JPEG edge cases where the file appears to be uncorrupted in viewer X, but in reality the file has been cut short and is only displaying correctly because very little information has been lost (FireFox can show some cut off JPEGs correctly [you get a grey bottom], but others result in FireFox seeming the load them half way and then display the error message instead of the partial image)
You could use imagemagick if it is available:
if you want to do a whole folder
identify "./myfolder/*" >log.txt 2>&1
if you want to just check a file:
identify myfile.jpg