Get the number of pages in a PDF document - php

This question is for referencing and comparing. The solution is the accepted answer below.
Many hours have I searched for a fast and easy, but mostly accurate, way to get the number of pages in a PDF document. Since I work for a graphic printing and reproduction company that works a lot with PDFs, the number of pages in a document must be precisely known before they are processed. PDF documents come from many different clients, so they aren't generated with the same application and/or don't use the same compression method.
Here are some of the answers I found insufficient or simply NOT working:
Using Imagick (a PHP extension)
Imagick requires a lot of installation, apache needs to restart, and when I finally had it working, it took amazingly long to process (2-3 minutes per document) and it always returned 1 page in every document (haven't seen a working copy of Imagick so far), so I threw it away. That was with both the getNumberImages() and identifyImage() methods.
Using FPDI (a PHP library)
FPDI is easy to use and install (just extract files and call a PHP script), BUT many of the compression techniques are not supported by FPDI. It then returns an error:
FPDF error: This document (test_1.pdf) probably uses a compression technique which is not supported by the free parser shipped with FPDI.
Opening a stream and search with a regular expression:
This opens the PDF file in a stream and searches for some kind of string, containing the pagecount or something similar.
$f = "test1.pdf";
$stream = fopen($f, "r");
$content = fread ($stream, filesize($f));
if(!$stream || !$content)
return 0;
$count = 0;
// Regular Expressions found by Googling (all linked to SO answers):
$regex = "/\/Count\s+(\d+)/";
$regex2 = "/\/Page\W*(\d+)/";
$regex3 = "/\/N\s+(\d+)/";
if(preg_match_all($regex, $content, $matches))
$count = max($matches);
return $count;
/\/Count\s+(\d+)/ (looks for /Count <number>) doesn't work because only a few documents have the parameter /Count inside, so most of the time it doesn't return anything. Source.
/\/Page\W*(\d+)/ (looks for /Page<number>) doesn't get the number of pages, mostly contains some other data. Source.
/\/N\s+(\d+)/ (looks for /N <number>) doesn't work either, as the documents can contain multiple values of /N ; most, if not all, not containing the pagecount. Source.
So, what does work reliable and accurate?
See the answer below

A simple command line executable called: pdfinfo.
It is downloadable for Linux and Windows. You download a compressed file containing several little PDF-related programs. Extract it somewhere.
One of those files is pdfinfo (or pdfinfo.exe for Windows). An example of data returned by running it on a PDF document:
Title: test1.pdf
Author: John Smith
Creator: PScript5.dll Version 5.2.2
Producer: Acrobat Distiller 9.2.0 (Windows)
CreationDate: 01/09/13 19:46:57
ModDate: 01/09/13 19:46:57
Tagged: yes
Form: none
Pages: 13 <-- This is what we need
Encrypted: no
Page size: 2384 x 3370 pts (A0)
File size: 17569259 bytes
Optimized: yes
PDF version: 1.6
I haven't seen a PDF document where it returned a false pagecount (yet). It is also really fast, even with big documents of 200+ MB the response time is a just a few seconds or less.
There is an easy way of extracting the pagecount from the output, here in PHP:
// Make a function for convenience
function getPDFPages($document)
{
$cmd = "/path/to/pdfinfo"; // Linux
$cmd = "C:\\path\\to\\pdfinfo.exe"; // Windows
// Parse entire output
// Surround with double quotes if file name has spaces
exec("$cmd \"$document\"", $output);
// Iterate through lines
$pagecount = 0;
foreach($output as $op)
{
// Extract the number
if(preg_match("/Pages:\s*(\d+)/i", $op, $matches) === 1)
{
$pagecount = intval($matches[1]);
break;
}
}
return $pagecount;
}
// Use the function
echo getPDFPages("test 1.pdf"); // Output: 13
Of course this command line tool can be used in other languages that can parse output from an external program, but I use it in PHP.
I know its not pure PHP, but external programs are way better in PDF handling (as seen in the question).
I hope this can help people, because I have spent a whole lot of time trying to find the solution to this and I have seen a lot of questions about PDF pagecount in which I didn't find the answer I was looking for. That's why I made this question and answered it myself.
Security Notice: Use escapeshellarg on $document if document name is being fed from user input or file uploads.

Simplest of all is using ImageMagick
here is a sample code
$image = new Imagick();
$image->pingImage('myPdfFile.pdf');
echo $image->getNumberImages();
otherwise you can also use PDF libraries like MPDF or TCPDF for PHP

You can use qpdf like below. If a file file_name.pdf has 100 pages,
$ qpdf --show-npages file_name.pdf
100

Here is a simple example to get the number of pages in PDF with PHP.
<?php
function count_pdf_pages($pdfname) {
$pdftext = file_get_contents($pdfname);
$num = preg_match_all("/\/Page\W/", $pdftext, $dummy);
return $num;
}
$pdfname = 'example.pdf'; // Put your PDF path
$pages = count_pdf_pages($pdfname);
echo $pages;
?>

if you can't install any additional packages, you can use this simple one-liner:
foundPages=$(strings < $PDF_FILE | sed -n 's|.*Count -\{0,1\}\([0-9]\{1,\}\).*|\1|p' | sort -rn | head -n 1)

This seems to work pretty well, without the need for special packages or parsing command output.
<?php
$target_pdf = "multi-page-test.pdf";
$cmd = sprintf("identify %s", $target_pdf);
exec($cmd, $output);
$pages = count($output);

Since you're ok with using command line utilities, you can use cpdf (Microsoft Windows/Linux/Mac OS X). To obtain the number of pages in one PDF:
cpdf.exe -pages "my file.pdf"

I created a wrapper class for pdfinfo in case it's useful to anyone, based on Richard's answer#
/**
* Wrapper for pdfinfo program, part of xpdf bundle
* http://www.xpdfreader.com/about.html
*
* this will put all pdfinfo output into keyed array, then make them accessible via getValue
*/
class PDFInfoWrapper {
const PDFINFO_CMD = 'pdfinfo';
/**
* keyed array to hold all the info
*/
protected $info = array();
/**
* raw output in case we need it
*/
public $raw = "";
/**
* Constructor
* #param string $filePath - path to file
*/
public function __construct($filePath) {
exec(self::PDFINFO_CMD . ' "' . $filePath . '"', $output);
//loop each line and split into key and value
foreach($output as $line) {
$colon = strpos($line, ':');
if($colon) {
$key = trim(substr($line, 0, $colon));
$val = trim(substr($line, $colon + 1));
//use strtolower to make case insensitive
$this->info[strtolower($key)] = $val;
}
}
//store the raw output
$this->raw = implode("\n", $output);
}
/**
* get a value
* #param string $key - key name, case insensitive
* #returns string value
*/
public function getValue($key) {
return #$this->info[strtolower($key)];
}
/**
* list all the keys
* #returns array of key names
*/
public function getAllKeys() {
return array_keys($this->info);
}
}

this simple 1 liner seems to do the job well:
strings $path_to_pdf | grep Kids | grep -o R | wc -l
there is a block in the PDF file which details the number of pages in this funky string:
/Kids [3 0 R 4 0 R 5 0 R 6 0 R 7 0 R 8 0 R 9 0 R 10 0 R 11 0 R 12 0 R 13 0 R 14 0 R 15 0 R 16 0 R 17 0 R 18 0 R 19 0 R 20 0 R 21 0 R 22 0 R 23 0 R 24 0 R 25 0 R 26 0 R 27 0 R 28 0 R 29 0 R 30 0 R 31 0 R 32 0 R 33 0 R 34 0 R 35 0 R 36 0 R 37 0 R 38 0 R 39 0 R 40 0 R 41 0 R]
The number of 'R' characters is the number of pages
screenshot of terminal showing output from strings

You can use mutool.
mutool show FILE.pdf trailer/Root/Pages/Count
mutool is part of the MuPDF software package.

Here is a R function that reports the PDF file page number by using the pdfinfo command.
pdf.file.page.number <- function(fname) {
a <- pipe(paste("pdfinfo", fname, "| grep Pages | cut -d: -f2"))
page.number <- as.numeric(readLines(a))
close(a)
page.number
}
if (F) {
pdf.file.page.number("a.pdf")
}

Here is a Windows command script using gsscript that reports the PDF file page number
#echo off
echo.
rem
rem this file: getlastpagenumber.cmd
rem version 0.1 from commander 2015-11-03
rem need Ghostscript e.g. download and install from http://www.ghostscript.com/download/
rem Install path "C:\prg\ghostscript" for using the script without changes \\ and have less problems with UAC
rem
:vars
set __gs__="C:\prg\ghostscript\bin\gswin64c.exe"
set __lastpagenumber__=1
set __pdffile__="%~1"
set __pdffilename__="%~n1"
set __datetime__=%date%%time%
set __datetime__=%__datetime__:.=%
set __datetime__=%__datetime__::=%
set __datetime__=%__datetime__:,=%
set __datetime__=%__datetime__:/=%
set __datetime__=%__datetime__: =%
set __tmpfile__="%tmp%\%~n0_%__datetime__%.tmp"
:check
if %__pdffile__%=="" goto error1
if not exist %__pdffile__% goto error2
if not exist %__gs__% goto error3
:main
%__gs__% -dBATCH -dFirstPage=9999999 -dQUIET -dNODISPLAY -dNOPAUSE -sstdout=%__tmpfile__% %__pdffile__%
FOR /F " tokens=2,3* usebackq delims=:" %%A IN (`findstr /i "number" test.txt`) DO set __lastpagenumber__=%%A
set __lastpagenumber__=%__lastpagenumber__: =%
if exist %__tmpfile__% del %__tmpfile__%
:output
echo The PDF-File: %__pdffilename__% contains %__lastpagenumber__% pages
goto end
:error1
echo no pdf file selected
echo usage: %~n0 PDFFILE
goto end
:error2
echo no pdf file found
echo usage: %~n0 PDFFILE
goto end
:error3
echo.can not find the ghostscript bin file
echo. %__gs__%
echo.please download it from:
echo. http://www.ghostscript.com/download/
echo.and install to "C:\prg\ghostscript"
goto end
:end
exit /b

The R package pdftools and the function pdf_info() provides information on the number of pages in a pdf.
library(pdftools)
pdf_file <- file.path(R.home("doc"), "NEWS.pdf")
info <- pdf_info(pdf_file)
nbpages <- info[2]
nbpages
$pages
[1] 65

If you have access to shell, a simplest (but not usable on 100% of PDFs) approach would be to use grep.
This should return just the number of pages:
grep -m 1 -aoP '(?<=\/N )\d+(?=\/)' file.pdf
Example: https://regex101.com/r/BrUTKn/1
Switches description:
-m 1 is neccessary as some files can have more than one match of regex pattern (volonteer needed to replace this with match-only-first regex solution extension)
-a is neccessary to treat the binary file as text
-o to show only the match
-P to use Perl regular expression
Regex explanation:
starting "delimiter": (?<=\/N ) lookbehind of /N (nb. space character not seen here)
actual result: \d+ any number of digits
ending "delimiter": (?=\/) lookahead of /
Nota bene: if in some case match is not found, it's safe to assume only 1 page exists.

I got problems with imagemagick installations on production server. After hours of attempts, I decided to get rid of IM, and found another approach:
Install poppler-utils:
$ sudo apt install poppler-utils [On Debian/Ubuntu & Mint]
$ sudo dnf install poppler-utils [On RHEL/CentOS & Fedora]
$ sudo zypper install poppler-tools [On OpenSUSE]
$ sudo pacman -S poppler [On Arch Linux]
Then execute via shell in your PL ( e.g. PHP):
shell_exec("pdfinfo $filePath | grep Pages | cut -f 2 -d':' | xargs");

This works fine in Imagemagick.
convert image.pdf -format "%n\n" info: | head -n 1

Often you read regex /\/Page\W/ but it won't work for me for several pdf files.
So here is an other regex expression, that works for me.
$pdf = file_get_contents($path_pdf);
return preg_match_all("/[<|>][\r\n|\r|\n]*\/Type\s*\/Page\W/", $path_pdf, $dummy);

Related

Image sequence breaks in gif animation

I am trying to animate all images in a directory with imagemagick's morph operator. I am calling the command-line arguments with PHP's exec command. The images hold a specific file name such as 000000.jpg, 000001.jpg, 000002.jpg, 000003.jpg, and so on.
the following code works fine from image sequence 000000.jpg to 000009.jpg, but when I have images such as 0000010.jpg, the 0000010.jpg comes after 000001.jpg while I want 000002.jpg should come after 000001.jpg. Can you please point out how to modify the code so that I can retain a sequence in gif animation?
$command = "convert -set delay 5 -loop 0 $img_dir/00000*.jpg -morph 10 $img_dir/morph.gif";
exec($command);
Normally, the length of the name is kept constant, so if you add an extra digit in going from 9 to 10, you remove one of the leading zeroes to keep the length the same. I mean, most people do this:
000008
000009
000010
000011
whereas you appear to have done this where the length changes at 10:
000008
000009
0000010
0000011
I would suggest you change the program that generates the naming so it generates a constant length name then everything will work properly.
Failing that, if you wish to, or are obliged to, stick with your naming system, ImageMagick will accept a list of files in any order you like, so you can create a file called filelist.txt or somesuch, with contents like this
000000.jpg
000001.jpg
000002.jpg
...
000009.jpg
0000010.jpg
0000011.jpg
and use it like this:
convert -delay 100 -loop 0 #filelist.txt anim.gif
Alternatively, you could let bash print your filenames always with 5 leading spaces using a command such as this:
printf "00000%d.jpg " {0..4}
000000.jpg 000001.jpg 000002.jpg 000003.jpg 000004.jpg
and embed that in your ImageMagick command
convert -delay 100 -loop 0 $(printf "00000%d.jpg " {0..4}) anim.gif

pdftk Error: Failed to open PDF file:

I am using pdftk library to extract the form fields from the pdf .Everything is just running fine except the one issue that i got a pdf file pdf file link. which causes the error is given bellow
Error: Failed to open PDF file:
http://www.uscis.gov/sites/default/files/files/form/i-9.pdf
Done. Input errors, so no output created.
command for this is
root#ri8-MS-7788:/home/ri-8# pdftk http://192.168.1.43/form/i-9.pdf dump_data_fields
the same command is working for all other forms .
Attempt1
I have tried to encrypt the pdf to unsafe version but it produce the same error . here is the command
pdftk http://192.168.1.43/forms/i-9.pdf input_pw foopass output /var/www/forms/un-i-9.pdf
Update
this is my full function to handle this
public function Formanalysis($pdfname)
{
$pdffile=Yii::app()->getBaseUrl(true).'/uploads/forms/'.$pdfname;
exec("pdftk ".$pdffile." dump_data_fields 2>&1", $output,$retval);
//got an error for some pdf if these are secure
if(strpos($output[0],'Error') !== false)
{
$unsafepdf=Yii::getPathOfAlias('webroot').'/uploads/forms/un-'.$pdfname;
//echo "pdftk ".$pdffile." input_pw foopass output ".$unsafepdf;
exec("pdftk ".$pdffile." input_pw foopass output ".$unsafepdf);
exec("pdftk ".$unsafepdf." dump_data_fields 2>&1", $outputunsafe,$retval);
return $outputunsafe ;
//$response=array('0'=>'error','error'=>$output[0]);
//return $response;
}
//if (strpos($output[0],'Error') !== false){ echo "error to run" ; } // this is the option to handle error
return $output;
}
PdfTk is a tool that was created by compiling an obsolete version of iText to an executable using the GNU Compiler for Java (GCJ) (PdfTk is not endorsed by iText Group NV).
I have examined your PDF and it uses two technologies that weren't supported by iText at the time PdfTk was created: XFA and compressed cross-reference tables.
The latter is what causes your problem. PdfTk expects your file to end like this:
xref
0 7
0000000000 65535 f
0000000258 00000 n
0000000015 00000 n
0000000346 00000 n
0000000146 00000 n
0000000397 00000 n
0000000442 00000 n
trailer
<</ID [<c8bf0ac531b0fc7b5b9ec5daf0296834><ec4dde54d00305ebbec62f3f6bbca974>]/Root 5 0 R/Size 7/Info 6 0 R>>
%iText-5.4.3
startxref
595
%%EOF
In this snippet startxref marks the byte offset of xref which is where the cross-reference table starts. This table contains the byte-offsets of all the objects in the PDF.
When you look at the PDF you refer to, you see that it ends like this:
64 0 obj
<</DecodeParms<</Columns 5/Predictor 12>>/Encrypt 972 0 R/Filter/FlateDecode/ID[<85C47EA3EFE49E4CB0F087350055FDDC><C3F1748360D0464FBA02D711DE864630>]/Info 970 0 R/Length 283/Root 973 0 R/Size 971/Type/XRef/W[1 3 1]>>stream
hÞìÒ±JQЙ·»7J¢©ÕØ(Xþ„ù »h%¤É¤¶”€mZ+;ÁN,,ÁÆ6 XÁ&‚("î½YŒI‘Bî‡áμ]ö1Áð÷³cfþ‹ûÐÚLî`z„Ýôœùw÷N×X?ÙkNv`hÁÒj¦G[œiÀå»›œ?b½Än…ÉëàÍþ gY—i7WW‡òj®îÍ°u¸Ò‡Ñ:óÆÛ™ñÎë&'×݈§ü†ù!ÿñ€ù%,\ácçÙ9˜ì±Þ€S¼Ãd—‰Áy~×.ø¶Åìþßn_˜$9Ôüw£X9#åxzçgRüüóÙwÝ¡œÄNJ©½’Ú+©½’R{%µWR{%ÿ·á”;`_ z6Ø
endstream
endobj
startxref
116
%%EOF
In this case, startxref still refers to where the first cross-reference table starts (it's a linearized PDF), but the cross reference table is stored inside an object, and that object is compressed (see the gibberish between the stream and endstream keywords).
Compressed cross-reference tables and compressed objects were introduced in PDF 1.5 (2003), but they aren't supported by PdfTk. You'll have to find a tool that can deal with such streams (e.g. a recent version of iText, which is the real stuff when compared to PdfTk), or you have to save your PDF as a PDF 1.4 before you treat it with PdfTk (but you'll lose the XFA, because XFA was also introduced in PDF 1.5).
Update:
Since you are asking about form fields, I'm adding the following attachment:
This screenshot was taken using iText RUPS (which proves that iText can open the document). To the right, you see that the same form is defined twice:
If you would walk down the tree under Fields, you'd find all the fields that are stored in the PDF using AcroForm technology. To the left, you can see the description of such a field:
If you look under XFA, you notice that the same form is also defined using the XML Forms Architecture. If you click on datasets, you see the XML description of the dataset in the lower panel:
All of this information can be accessed programmatically using iText (Java) or iTextSharp (C#). PdfTk is merely a tool based on a very old version of this technology.
this may be a little trick solution but should work for you . as #bruno said that this is encrypted file . You should decrypt this before you use for the pdftk . For this i found a way to decrypt that is qpdf a free opem source library to decrypt the pdf, remove the owner and user passwords etc and many more. You can find this here Qpdf. install it on your system . and run this command
qpdf --decrypt input.pdf output.pdf
then use the output file in the pdftk command . it should work .

module of a too long var in php

How can i do to the module of a var that contents 24 digits in php
00120345030000067890142807 % 97
the result of this operation must be 1 but the problem is that the var that contains de value is too long.
You can use the BCMath PHP functions to handle large numbers. Depending on your environment, the BCMath extension is often already bundled with PHP, you might be able to search for it in your package manager of choice if you're using Linux. If you're on an older version of PHP, let me know and I can hopefully show you how to compile the extension manually from the php source tree.
The function to use is bcmod - http://www.php.net/manual/en/function.bcmod.php
You can use it like this:
<?php
$bigNumber = '00120345030000067890142807';
echo($bigNumber % 97 . "\n");
echo(bcmod($bigNumber, 97) . "\n");
If you run this, you'll see it outputs the expected result and that the standard mod doesn't:
$ php -q test.php
65
1

Return ranges of lines from text file PHP or LINUX

I have a small issue here, I need to be able to read a file of unknown size it could be a a few hundred lines or many more the log files change all the time and depending on when i check. I would like to have a method that is in php or in linux that i can read a range of lines from a file. I dont want to have to read the entire file in to php memory then remove the lines because the file may be larger then the allowed memory of php.
I also want it to be using default php modules or default linux tools dont want to need to install anything because it needs to be portable.
Edit:
For the linux based options I would like to be able to supply more then one range, i may need to get a few different ranges of lines I know how to do it in php by not in linux and to avoid reading past lines i have already read?
With awk:
awk 'NR>=10 && NR<=15' FILE
With awk (two ranges):
awk 'NR>=10 && NR<=15 || NR>=26 && NR<=28' FILE
With ed:
echo 2,5p | ed -s FILE
With ed and two ranges :
echo -e "2,5p\n7,8p" | ed -s FILE
Last but not least, a sed solution with two ranges (fastest solution, tested with time):
sed -n '2,5p;7,8p' FILE
What about something like
head -100 | tail -15
gives you lines 86-100
$ cat input.txt
q
w
e
r
t
y
u
i
$ sed -n 2,5p input.txt
w
e
r
t
while ($lines_read < last_line_desired) {
while ($line = fgets($filehandle, $buffersize) !== false)
if (line >= first_desired_line) {
push($interesting_lines, $line)
}
}
$lines_read++
}
Opening the file handle, selecting the appropriately large enough buffer size to cover any expected line lengths, etc. is up to you.
If you're reading files that are regularly appended to, you should look into the ftell and fseek functions to note where you are in your data and skip past all the old stuff before reading more.

Error retrieving exec() output

I'm using exec to grab curl output (I need to use curl as linux command).
When I start my file using php_cli I see a curl output:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 75480 100 75480 0 0 55411 0 0:00:01 0:00:01 --:--:-- 60432
It means that all the file has been downloaded correctly (~ 75 KB).
I have this code:
$page = exec('curl http://www.example.com/test.html');
I get a really strange output, I only get: </html>
(that's the end of my test.html file)
I really do not understand the reason, CURL seems to download all the file, but in $page I only get 7 characters (the lastest 7 characters).
Why?
P.S. I know I can download the source code using other php functions, but I must to use curl (as linux command).
Unless this is a really weird requirement, why not use PHP cURL library instead? You get much finer control on what happens, as well as call parameters (timeout, etc.).
If you really must use curl command line binary from PHP:
1) Use shell_exec() (this solves your problem)
2) Use 2>&1 at end of command (you might need stderr output as well as stdout)
3) Use the full path to curl utility: do not rely on PATH setting.
RTM for exec()
It returns
The last line from the result of the command.
You have to set the second parameter to exec() that will contain all the output from the command executed.
Example:
<?php
$allOutputLines = array();
$returnCode = 0;
$lastOutputLine = exec(
'curl http://www.example.com/test.html',
$allOutputLines,
$returnCode
);
echo 'The command was executed and with return code: ' . $returnCode . "\n";
echo 'The last line outputted by the command was: ' . $lastOutputLine . "\n";
echo 'The full command output was: ' . "\n";
echo implode("\n", $allOutputLines) . "\n";
?>

Categories