How to validate a PDF?

How to validate a PDF? - php

I have a PDF file, generated in php (using tFDF) that opens fine in chrome, firefox and safari. But not in Edge or Adobe acrobat reader. Edge displays a white page, and Adobe reports "An error exists on this page. Acrobat may not be able to display the page correctly".
Is it possible to retrieve what that error might be? The PDF file uses some embedded fonts that might be the problem, but im not sure.

You can use Ghostscript like this (paths are pure conjectures):
# On Linux
gs -o /dev/null -sDEVICE=nullpage -dBATCH -dNOPAUSE /home/ebaars/sample.pdf
# On Windows, using gswin32
gswin32 -o nul -sDEVICE=nullpage -dBATCH -dNOPAUSE C:\Users\Eric\Desktop\Sample.pdf
Or you can use iText (that is, pdftk) and ask to, say, (un)compress the file and reparse it to another file. Meanwhile, the library will performs the checks.
You can also check out this other answer.
update
This error, "'0,686 is not an operator" -- it means that it found a number where it expected an operator. I assume by "tFDF" you mean "tcPDF"? I suspect - I might be wrong - that we're looking at a i18n error, where a number such as 2/3, which should be "0.66666", is represented by the server code with a decimal comma, making it what the PDF interpreter believes a list ("0,666").
To be sure, I would either need the PDF - I would uncompress it with iText, then rewrite 0,686 etc. as 0.686 etc., then see whether this way it works or not - or the exact PHP code that generated the file, plus the configuration of the server (to verify whether locale settings are appropriate).
My guess is that it is a library bug. Check software versions, in case it is possible to update the code and maybe get rid of the problem that way.
I have met this bug several times, since I am from Italy and "one thousand and one cent" is written here as "1.000,01" or "1'000,01" instead of "1000.01".

Related

ImageMagick / GraphicsMagick / libvips Images randomly corrupted

We are using ImageMagick for resizing/thumbnailing JPGs to a specific size. The source file is loaded via HTTP. It's working as expected, but from time to time some images are partially broken.
We already tried different software like GraphicsMagick or VIPS, but the problem is still there. It also only seems to happen if there are parallel processes. So the whole script is locked via sempahores, but it also does not help
We found multiple similar problems, but all without any solution: https://legacy.imagemagick.org/discourse-server/viewtopic.php?t=22506
We also wonder, why it is the same behaviour in all these softwares. We also tried different PHP versions. It seems to happen more often on source images with a huge dimension/filesize.
Any idea what to do here?
Example 1 Example 2 Example 3

I would guess the source image has been truncated for some reason. Perhaps something timed out during the download?
libvips is normally permissive, meaning that it'll try to give you something, even if the input is damaged. You can make it strict with the fail flag (ie. fail on the first warning).
For example:
$ head -c 10000 shark.jpg > truncated.jpg
$ vipsthumbnail truncated.jpg
(vipsthumbnail:9391): VIPS-WARNING **: 11:24:50.439: read gave 2 warnings
(vipsthumbnail:9391): VIPS-WARNING **: 11:24:50.439: VipsJpeg: Premature end of JPEG file
$ echo $?
0
I made a truncated jpg file, then ran thumbnail. It gave a warning, but did not fail. If I run:
$ vipsthumbnail truncated.jpg[fail]
VipsJpeg: Premature end of input file
$ echo $?
1
Or in php:
$thumb = Vips\Image::thumbnail('truncated.jpg[fail]', 128);
Now there's no output, and there's an error code. I'm sure there's an imagemagick equivalent, though I don't know it.
There's a downside: thumbnailing will now fail if there's anything wrong with the image, and it might be something you don't care about, like invalid resolution.

After some additional investigation we discovered that indeed the sourceimage was already damaged. It was downloaded via a vpn connection which was not stable enough. Sometimes the download stopped, so the JPG was only half written.

Bash sanitize_file_name function

I'm attempting to find a way to sanitize/filter file names in a Bash script the exact same way as the sanitize_file_name function from WordPress works. It has to take a filename string and spit out a clean version that is identical to that function.
You can see the function here.
GNU bash, version 4.3.11(1)-release (x86_64-pc-linux-gnu)
Ubuntu 14.04.5 LTS (GNU/Linux 3.13.0-57-generic x86_64)
This is perl 5, version 18, subversion 2 (v5.18.2) built for x86_64-linux-gnu-thread-multi
Example input file names
These can be and often are practically anything you can make a filename on any operating system, especially Mac and Windows.
This File + Name.mov
Some, Other - File & Name.mov
ANOTHER FILE 2 NAME vs2_.m4v
some & file-name Alpha.m4v
Some Strange & File ++ Name__.mp4
This is a - Weird -# Filename!.mp4
Example output file names
These are how the WordPress sanitize_file_name function makes the examples above.
This-File-Name.mov
Some-Other-File-Name.mov
ANOTHER-FILE-2-NAME-vs2_.m4v
some-file-name-Alpha.m4v
Some-Strange-File-Name__.mp4
This-is-a-Weird-#-Filename.mp4
It doesn't just have to solve these cases, it has perform the same functions that the sanitize_file_name function does or it will produce duplicate files and they won't be updated on the site.
Some thoughts I've had are maybe I can somehow use that function itself but this video encoding server doesn't have PHP on it since it's quite a tiny server and normally just encodes videos and uploads them. It doesn't have much memory, CPU power or disk space, it's a DigitalOcean 512MB RAM server. Maybe I can somehow create a remote PHP script on the web server to handle it through HTTP but again I'm not entirely sure how to do that either through Bash.
It's too complicated for my limited Bash skills so I'm wondering if anyone can help or knows where a script is that does this already. I couldn't find one. All I could find are scripts that change spaces or special characters into underscores or dashes. But this isn't all the sanitize_file_name function does.
In case you are curious, the filenames have to be compatible with this WordPress function because of the way this website is setup to handle videos. It allows people to upload videos through WordPress that are then sent to a separate video server for encoding and then sent to Amazon S3 and CloudFront for serving on the site. However it also allows adding videos through Dropbox using the External Media plugin (which actually is duplicating the video upload with the Dropbox sync now but that's another minor issue). This video server is also syncing to a Dropbox account and whitelisting the folders in it and has this Bash script watching a VideoServer Dropbox folder using inotifywait which copies videos from it to another folder temporarily where the video encoder encodes them. This way when they update the videos in their Dropbox it will automatically re-encode and update the video shown on the site. They could just upload the files through WordPress but they don't seem to want to or don't know how to do that for some reason.

If you have Perl installed, try with:
#!/bin/bash
function sanitize_file_name {
echo -n $1 | perl -pe 's/[\?\[\]\/\\=<>:;,''"&\$#*()|~`!{}%+]//g;' -pe 's/[\r\n\t -]+/-/g;'
}
filename="Wh00t? it's a -- re#lly-weird {file&name} (with + Plus and__1% #of# [\$qRots\$!]).mov"
cleaned=$(sanitize_file_name "$filename")
echo original : "$filename"
echo sanitised: "$cleaned"
Result is:
original : Wh00t? it's a -- re#lly-weird {file&name} (with + Plus and__1% #of# [$qRots$!]).mov
sanitised: Wh00t-it's-a-re#lly-weird-filename-with-Plus-and__1-of-qRots.mov
Looking at WP function, this emulates it quite well.

Inspired by the answer.
EscapeFilename()
{
printf '%s' "$#" | perl -pe 's/[:;,\?\[\]\/\\=<>''"&\$#*()|~`!{}%+]//g; s/[\s-]+/-/g;';
}

PHP: Save Dynamic URL Image to Disk

Having trouble capturing the following dynamic image on disk, all I get is a 1K size file
http://water.weather.gov/precip/save.php?timetype=RECENT&loctype=NWS&units=engl&timeframe=current&product=observed&loc=regionER
I have setup PHP cURL feature to work just fine on static imagery, but does not work for the above link. Similarly, also copy function, file_put_contents (file_get_contents)...they all work fine for static image. Plenty of references in SO for usage of these PHP functions, so I will not get into details here. Just the copy command:
copy('http://water.weather.gov/precip/save.php?timetype=RECENT&loctype=NWS&units=engl&timeframe=current&product=observed&loc=regionER', 'precip5.png');
Behavior is same, getting precip5.png size 760 bytes, on my windows development box and linux staging box, so can rule OS issues out. Again, all PHP functions do exactly the same thing - generate a file - but empty. Command line curl program is also generating that same junk 1K file.
So, the issue seems to be source and the best I can tell is that it is a dynamic (streaming?) image.
Ideally, I would like this be done in PHP or some command line utility like curl. I am trying to avoid adding java (imageio) dependency just for this...until I absolutely have have to go there...
I am trying to understand the nature of the beast (the image) first ;-)...

The URL you are saving produces HTML output, not the image. You are missing the parameter &print=1
http://water.weather.gov/precip/save.php?timetype=RECENT&loctype=NWS&units=engl&timeframe=current&product=observed&loc=regionER&print=1

Error when unzipping a group of images

I am importing public domain books from archive.org to my site, and have a php import script set up to do it. However, when I attempt to import the images and run
exec( "unzip $images_file_arg -d $book_dir_arg", $output, $status );
it will occasionally return me a $status of 1. Is this ok? I have not had any problems with the imported images so far. I looked up the man page for unzip, but it didn't tell me much. Could this possibly cause problems, and do I have to check each picture individually, or am I safe?

EDIT: Oops. I should have checked the manpage straight away. They tell us what the error codes mean:
The exit status (or error level) approximates the exit codes defined by PKWARE and takes on the following values, except under VMS:
normal; no errors or warnings detected.
one or more warning errors were encountered, but processing completed successfully anyway. This includes zipfiles where one or more files was skipped due to unsupported compression method or encryption with an unknown password.
a generic error in the zipfile format was detected. Processing may have completed successfully anyway; some broken zipfiles created by other archivers have simple work-arounds.
a severe error in the zipfile format was detected. Processing probably failed immediately.
(many more)
So, apparently some archives may have had files in them skipped, but zip didn't break down; it just did all it could.
It really should work, but there are complications with certain filenames. Are any of them potentially tricky with unusual characters? escapeshellarg is certainly something to look into. If you get a bad return status, you should be concerned because that means zip exited with some error or other. At the very least, I would suggest you log the filenames in those case (error_log($filename)) and see if there is anything that might cause problems. zip itself runs totally independantly of PHP and will do everything fine if it's getting passed the right arguments by the shell, and the files really are downloaded and ready to unzip.

Maybe you are better suited with the php integrated ziparchive-class.
http://www.php.net/manual/de/class.ziparchive.php
Especially http://www.php.net/manual/de/function.ziparchive-extractto.php it returns you TRUE, if extracting was successful, otherwise FALSE.

PHP: Problem using passthru to stream a zip on mac os x only

I'm trying to put together a zip streaming solution through the use of Unix's zip command and PHP's passthru function, but I've hit a snag.
The script looks something like this:
<?php
header("Content-Type: application/octet-stream");
header("Content-Disposition: attachement; filename=myfile.zip");
passthru("zip -r -0 - /stuff/to/zip/");
exit();
?>
The zip command works OK and the output is received by the browser and saved as a zip file.
The zip can then be extracted fine on Windows and Unix, but on Mac OS X the build in extractor (BOMArchiveHelper) can't extract the file. Using other applications on OS X works fine though.
The error given by BOMArchiveHelper is the same it gives if a zip is password protected (not handled by the application). I used some kind of zip analyzer program and it indicated that some of the files in the zip archive were flagged as password protected.
Like I said though, no other extraction application pays attention to that apparently.
When examening the zip closer I found that the one generated by the PHP files is a few bytes larger than one generated directly by the zip command on the server.
It seems that the stream process with passthru adds something to the file that probably causes the problems with BOMArchiveHelper.
To test this, I used passthru to stream a zip I had already created on the server: passthru("cat stuff.zip")
That worked fine with BOMArchiveHelper.
So the problem seems to lie somewhere in the process where the passthru function takes the binary data generated on the fly by the zip command and passes that to the browser.
I've tried to eliminate all the sources where the extra bytes could be generated (setting zip command to quiet and so on), but the added data still remains.
A binary diff of the streamed zip and a pre generated zip shows that the extra data is scattered all over the zip, not just in the end or the beginning.
Anyone have a clue, or seen this problem before and decided it's impossible to solve?
NB: Since someone else has already encountered and very well described this issue before me without any answer I just copied/paste his message here and made sure that all his tests did effectively fail and neither any of mine passed ...
Apparently the only way to get this to work would be to ask people to use either unzip or suffitexpander ...

If you are using nginx then take a look at http://wiki.nginx.org/NginxNgxZip

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.