I'm trying to access Windows SAPI5 or Text to speech (TTS) using PHP. The standard approach is to create a COM object for "SAPI.SpVoice", then get the installed voices.
Sample PHP code:
<?php
$obj = new COM('SAPI.SpVoice');
$voices = $obj->GetVoices;
$count = $voices->Count;
print $count; #prints "1"
Unfortunately the output returned from PHP's COM object is incorrect because I have 5 voices installed on my system, but PHP only returns 1.
So, just to check if this a PHP specific issue, I wrote the same code in Perl 5.8 (strawberry).
Sample Perl code:
#!/usr/bin/perl
use Win32::OLE;
my $obj = Win32::OLE->new('SAPI.SpVoice');
my $voices = $obj->GetVoices;
my $count = $voices->Count;
print $count; #print "5" which is correct.
So the perl code correctly returns that I have 5 TTS voices on my system, but PHP returns only 1?
Is this a bug or am I doing something wrong? What could be the possible cause of this?
P.S. I've tried this on two different computers and results are the same.
I figured this after some trial error. It looks like if I use the 32-bit version of PHP then I get the correct results (5 voices). But since I had installed the 64-bit version by default I only get 1 voice.
I think the TTS voices are mostly 32 bit (like those installed on my system) and so when running with a 64-bit php.exe it only returns 64-bit voices. With 32 php.exe it returns all voices.
Posting this as answer in case someone faces a similar issue in future.
This question is for referencing and comparing. The solution is the accepted answer below.
Many hours have I searched for a fast and easy, but mostly accurate, way to get the number of pages in a PDF document. Since I work for a graphic printing and reproduction company that works a lot with PDFs, the number of pages in a document must be precisely known before they are processed. PDF documents come from many different clients, so they aren't generated with the same application and/or don't use the same compression method.
Here are some of the answers I found insufficient or simply NOT working:
Using Imagick (a PHP extension)
Imagick requires a lot of installation, apache needs to restart, and when I finally had it working, it took amazingly long to process (2-3 minutes per document) and it always returned 1 page in every document (haven't seen a working copy of Imagick so far), so I threw it away. That was with both the getNumberImages() and identifyImage() methods.
Using FPDI (a PHP library)
FPDI is easy to use and install (just extract files and call a PHP script), BUT many of the compression techniques are not supported by FPDI. It then returns an error:
FPDF error: This document (test_1.pdf) probably uses a compression technique which is not supported by the free parser shipped with FPDI.
Opening a stream and search with a regular expression:
This opens the PDF file in a stream and searches for some kind of string, containing the pagecount or something similar.
$f = "test1.pdf";
$stream = fopen($f, "r");
$content = fread ($stream, filesize($f));
if(!$stream || !$content)
return 0;
$count = 0;
// Regular Expressions found by Googling (all linked to SO answers):
$regex = "/\/Count\s+(\d+)/";
$regex2 = "/\/Page\W*(\d+)/";
$regex3 = "/\/N\s+(\d+)/";
if(preg_match_all($regex, $content, $matches))
$count = max($matches);
return $count;
/\/Count\s+(\d+)/ (looks for /Count <number>) doesn't work because only a few documents have the parameter /Count inside, so most of the time it doesn't return anything. Source.
/\/Page\W*(\d+)/ (looks for /Page<number>) doesn't get the number of pages, mostly contains some other data. Source.
/\/N\s+(\d+)/ (looks for /N <number>) doesn't work either, as the documents can contain multiple values of /N ; most, if not all, not containing the pagecount. Source.
So, what does work reliable and accurate?
See the answer below
A simple command line executable called: pdfinfo.
It is downloadable for Linux and Windows. You download a compressed file containing several little PDF-related programs. Extract it somewhere.
One of those files is pdfinfo (or pdfinfo.exe for Windows). An example of data returned by running it on a PDF document:
Title: test1.pdf
Author: John Smith
Creator: PScript5.dll Version 5.2.2
Producer: Acrobat Distiller 9.2.0 (Windows)
CreationDate: 01/09/13 19:46:57
ModDate: 01/09/13 19:46:57
Tagged: yes
Form: none
Pages: 13 <-- This is what we need
Encrypted: no
Page size: 2384 x 3370 pts (A0)
File size: 17569259 bytes
Optimized: yes
PDF version: 1.6
I haven't seen a PDF document where it returned a false pagecount (yet). It is also really fast, even with big documents of 200+ MB the response time is a just a few seconds or less.
There is an easy way of extracting the pagecount from the output, here in PHP:
// Make a function for convenience
function getPDFPages($document)
{
$cmd = "/path/to/pdfinfo"; // Linux
$cmd = "C:\\path\\to\\pdfinfo.exe"; // Windows
// Parse entire output
// Surround with double quotes if file name has spaces
exec("$cmd \"$document\"", $output);
// Iterate through lines
$pagecount = 0;
foreach($output as $op)
{
// Extract the number
if(preg_match("/Pages:\s*(\d+)/i", $op, $matches) === 1)
{
$pagecount = intval($matches[1]);
break;
}
}
return $pagecount;
}
// Use the function
echo getPDFPages("test 1.pdf"); // Output: 13
Of course this command line tool can be used in other languages that can parse output from an external program, but I use it in PHP.
I know its not pure PHP, but external programs are way better in PDF handling (as seen in the question).
I hope this can help people, because I have spent a whole lot of time trying to find the solution to this and I have seen a lot of questions about PDF pagecount in which I didn't find the answer I was looking for. That's why I made this question and answered it myself.
Security Notice: Use escapeshellarg on $document if document name is being fed from user input or file uploads.
Simplest of all is using ImageMagick
here is a sample code
$image = new Imagick();
$image->pingImage('myPdfFile.pdf');
echo $image->getNumberImages();
otherwise you can also use PDF libraries like MPDF or TCPDF for PHP
You can use qpdf like below. If a file file_name.pdf has 100 pages,
$ qpdf --show-npages file_name.pdf
100
Here is a simple example to get the number of pages in PDF with PHP.
<?php
function count_pdf_pages($pdfname) {
$pdftext = file_get_contents($pdfname);
$num = preg_match_all("/\/Page\W/", $pdftext, $dummy);
return $num;
}
$pdfname = 'example.pdf'; // Put your PDF path
$pages = count_pdf_pages($pdfname);
echo $pages;
?>
if you can't install any additional packages, you can use this simple one-liner:
foundPages=$(strings < $PDF_FILE | sed -n 's|.*Count -\{0,1\}\([0-9]\{1,\}\).*|\1|p' | sort -rn | head -n 1)
This seems to work pretty well, without the need for special packages or parsing command output.
<?php
$target_pdf = "multi-page-test.pdf";
$cmd = sprintf("identify %s", $target_pdf);
exec($cmd, $output);
$pages = count($output);
Since you're ok with using command line utilities, you can use cpdf (Microsoft Windows/Linux/Mac OS X). To obtain the number of pages in one PDF:
cpdf.exe -pages "my file.pdf"
I created a wrapper class for pdfinfo in case it's useful to anyone, based on Richard's answer#
/**
* Wrapper for pdfinfo program, part of xpdf bundle
* http://www.xpdfreader.com/about.html
*
* this will put all pdfinfo output into keyed array, then make them accessible via getValue
*/
class PDFInfoWrapper {
const PDFINFO_CMD = 'pdfinfo';
/**
* keyed array to hold all the info
*/
protected $info = array();
/**
* raw output in case we need it
*/
public $raw = "";
/**
* Constructor
* #param string $filePath - path to file
*/
public function __construct($filePath) {
exec(self::PDFINFO_CMD . ' "' . $filePath . '"', $output);
//loop each line and split into key and value
foreach($output as $line) {
$colon = strpos($line, ':');
if($colon) {
$key = trim(substr($line, 0, $colon));
$val = trim(substr($line, $colon + 1));
//use strtolower to make case insensitive
$this->info[strtolower($key)] = $val;
}
}
//store the raw output
$this->raw = implode("\n", $output);
}
/**
* get a value
* #param string $key - key name, case insensitive
* #returns string value
*/
public function getValue($key) {
return #$this->info[strtolower($key)];
}
/**
* list all the keys
* #returns array of key names
*/
public function getAllKeys() {
return array_keys($this->info);
}
}
this simple 1 liner seems to do the job well:
strings $path_to_pdf | grep Kids | grep -o R | wc -l
there is a block in the PDF file which details the number of pages in this funky string:
/Kids [3 0 R 4 0 R 5 0 R 6 0 R 7 0 R 8 0 R 9 0 R 10 0 R 11 0 R 12 0 R 13 0 R 14 0 R 15 0 R 16 0 R 17 0 R 18 0 R 19 0 R 20 0 R 21 0 R 22 0 R 23 0 R 24 0 R 25 0 R 26 0 R 27 0 R 28 0 R 29 0 R 30 0 R 31 0 R 32 0 R 33 0 R 34 0 R 35 0 R 36 0 R 37 0 R 38 0 R 39 0 R 40 0 R 41 0 R]
The number of 'R' characters is the number of pages
screenshot of terminal showing output from strings
You can use mutool.
mutool show FILE.pdf trailer/Root/Pages/Count
mutool is part of the MuPDF software package.
Here is a R function that reports the PDF file page number by using the pdfinfo command.
pdf.file.page.number <- function(fname) {
a <- pipe(paste("pdfinfo", fname, "| grep Pages | cut -d: -f2"))
page.number <- as.numeric(readLines(a))
close(a)
page.number
}
if (F) {
pdf.file.page.number("a.pdf")
}
Here is a Windows command script using gsscript that reports the PDF file page number
#echo off
echo.
rem
rem this file: getlastpagenumber.cmd
rem version 0.1 from commander 2015-11-03
rem need Ghostscript e.g. download and install from http://www.ghostscript.com/download/
rem Install path "C:\prg\ghostscript" for using the script without changes \\ and have less problems with UAC
rem
:vars
set __gs__="C:\prg\ghostscript\bin\gswin64c.exe"
set __lastpagenumber__=1
set __pdffile__="%~1"
set __pdffilename__="%~n1"
set __datetime__=%date%%time%
set __datetime__=%__datetime__:.=%
set __datetime__=%__datetime__::=%
set __datetime__=%__datetime__:,=%
set __datetime__=%__datetime__:/=%
set __datetime__=%__datetime__: =%
set __tmpfile__="%tmp%\%~n0_%__datetime__%.tmp"
:check
if %__pdffile__%=="" goto error1
if not exist %__pdffile__% goto error2
if not exist %__gs__% goto error3
:main
%__gs__% -dBATCH -dFirstPage=9999999 -dQUIET -dNODISPLAY -dNOPAUSE -sstdout=%__tmpfile__% %__pdffile__%
FOR /F " tokens=2,3* usebackq delims=:" %%A IN (`findstr /i "number" test.txt`) DO set __lastpagenumber__=%%A
set __lastpagenumber__=%__lastpagenumber__: =%
if exist %__tmpfile__% del %__tmpfile__%
:output
echo The PDF-File: %__pdffilename__% contains %__lastpagenumber__% pages
goto end
:error1
echo no pdf file selected
echo usage: %~n0 PDFFILE
goto end
:error2
echo no pdf file found
echo usage: %~n0 PDFFILE
goto end
:error3
echo.can not find the ghostscript bin file
echo. %__gs__%
echo.please download it from:
echo. http://www.ghostscript.com/download/
echo.and install to "C:\prg\ghostscript"
goto end
:end
exit /b
The R package pdftools and the function pdf_info() provides information on the number of pages in a pdf.
library(pdftools)
pdf_file <- file.path(R.home("doc"), "NEWS.pdf")
info <- pdf_info(pdf_file)
nbpages <- info[2]
nbpages
$pages
[1] 65
If you have access to shell, a simplest (but not usable on 100% of PDFs) approach would be to use grep.
This should return just the number of pages:
grep -m 1 -aoP '(?<=\/N )\d+(?=\/)' file.pdf
Example: https://regex101.com/r/BrUTKn/1
Switches description:
-m 1 is neccessary as some files can have more than one match of regex pattern (volonteer needed to replace this with match-only-first regex solution extension)
-a is neccessary to treat the binary file as text
-o to show only the match
-P to use Perl regular expression
Regex explanation:
starting "delimiter": (?<=\/N ) lookbehind of /N (nb. space character not seen here)
actual result: \d+ any number of digits
ending "delimiter": (?=\/) lookahead of /
Nota bene: if in some case match is not found, it's safe to assume only 1 page exists.
I got problems with imagemagick installations on production server. After hours of attempts, I decided to get rid of IM, and found another approach:
Install poppler-utils:
$ sudo apt install poppler-utils [On Debian/Ubuntu & Mint]
$ sudo dnf install poppler-utils [On RHEL/CentOS & Fedora]
$ sudo zypper install poppler-tools [On OpenSUSE]
$ sudo pacman -S poppler [On Arch Linux]
Then execute via shell in your PL ( e.g. PHP):
shell_exec("pdfinfo $filePath | grep Pages | cut -f 2 -d':' | xargs");
This works fine in Imagemagick.
convert image.pdf -format "%n\n" info: | head -n 1
Often you read regex /\/Page\W/ but it won't work for me for several pdf files.
So here is an other regex expression, that works for me.
$pdf = file_get_contents($path_pdf);
return preg_match_all("/[<|>][\r\n|\r|\n]*\/Type\s*\/Page\W/", $path_pdf, $dummy);
I'm currently successfully reading out several properties on our switches over SNMP with php. Now i'm looking at making the resulting output of snmpget and snmpwalk actually usefull for the consumers of our API's.
Problem is that the responses look like this: INTEGER: up(1) and INTEGER: 10103 ...
Is there any convention/standard on how to parse this response format or is the response vendor specific for each device we are trying to read?
Is there by any chance already a PHP library, function or extension that can cast these responses in php native variables or at least something usefull that we can work with?
UPDATE:
I've found out a few new things namely that there are indeed several libraries in php that can parse binary ASN.1 strings which basically are BER encoded strings if i'm right. Problem is that i can't seem to find a way to get the binary data from the devices with php ...
You can simply use this function at the beginning of your script :
snmp_set_quick_print(TRUE);
It will returns only the value you are searching for, without the leading "INTEGER" or so ;)
Hope this helps !
I'm not sure about your particular PHP methods, but the difference between your two INTEGER examples is likely to be whether your system has an SNMP MIB corresponding to the OID (e.g. to determine that 1 means "up").
If you only want the integers, you should be able to pass a parameter to your get or walk command. For example, net-snmp's snmpget or snmpwalk commands will take -Oe to remove symbolic labels. From the manpage:
$ snmpget -c public -v 1 localhost ipForwarding.0
IP-MIB::ipForwarding.0 = INTEGER: forwarding(1)
$ snmpget -c public -v 1 -Oe localhost ipForwarding.0
IP-MIB::ipForwarding.0 = INTEGER: 1
If you are parsing net-snmp output, I recommend reading the snmpcmd man page as it has a lot of output options that will interest you especially the display of other types such as timeticks and strings.
If you do want to retrieve SNMP in PHP you could look at how Cacti does it.
I relogin to my server in dreamhost and test some scripts.And I found I couldn't use str_split. Message of Undefined function was given.I checked the version in the server and its PHP Version is 5.2.12.And I just wonder which version is required?Thanks.
Testcode:
<?php
$arr = str_split("lsdjflsdjflsdjflsdjfl");
print_r($arr);
?>
Message:
Fatal error: Call to undefined function: str_split() in /test.php on line 3
Edit #Justin Johnson
I checked the server's system directory,and I found there are two versions of PHP in Dreamhost.In user's webroot,file will be parsed by PHP5 and that's why I got php 5.2.12 by putting a phpinfo.php in the webroot.And if php files are ran in command line directly using php test.php,another php version which is 4.x worked.That's the reason I got an error.When I use
/usr/local/php5/bin/php test.php
Everything is fine.
Rather than use str_split, it's usually much easier to iterate through the characters of the string directly:
$s="abc";
$i=0;
while(isset($s[$i])) {
echo $s[$i++]." ";
}
see?
First off: The PHP documentation will always say what version is required for every function on that function's documentation page directly under the function name.
It is possible that an .htaccess file is somewhere in your path and is causing a previous version (<5) of PHP to be used. To double (or triple) check to make sure that you are running in the proper PHP version, place this code above the line where you call str_split
echo "version:", phpversion(),
"<br/>\nstr_split exists? ",
function_exists("str_split") ? "true" : "false";
However, as shown by Col. Shrapnel, it is not necessary to convert a string to an array of individual characters in order to iterate over the characters of that string. Strings can also be iterated over using traditional iteration methods, thus making the call to str_split unnecessary and wasteful (unless you need to segment the string into fixed length chunks, e.g.: str_split($s, 3))
foreach ( str_split($s) as $c ) {
// do something with character $c
}
can be replaced by
$s = "lsdjflsdjflsdjflsdjfl";
for ( $i=0; isset($s[$i]); ++$i ) {
// do something with character $s[$i]." ";
}
which is equally, if not more clear.
According to dreamhost wiki, you need to switch to php5 manually from control panel, if you created your domain before 2008 sept.
http://wiki.dreamhost.com/Installing_PHP5#Using_DreamHost.27s_PHP_5
PHP 5 was added to all plans by
DreamHost as of June 2005. As of
September 2008, support for PHP4 was
discontinued, so you can no longer
switch back to PHP 4 from PHP 5 from
the panel.
If you haven't switched to PHP 5 yet,
you can do this in the Control Panel.
But, again, you will not be able to
switch back to PHP 4 after switching
to PHP 5.
Here's how to switch from PHP 4 to PHP
5:
Log into the DreamHost Control Panel.
Click Domains, then Manage Domains.
Click the wrench icon next to the domain you want to activate PHP 5
on (under the Web Hosting column).
Select PHP 5.x.x from the dropdown menu.
Click Change fully hosted settings now! at the bottom of the
section.
Repeat steps 3-5 for each additional domain you want to
activate.
you could also check your php version with
<?php
phpinfo();
?>
The version required is PHP 5 or later. So theoretically your program should work.
If you can't get str_split to work, just use a string as an array:
$stuff = "abcdefghijkl";
echo $stuff[3];
will produce
d
This method is fastest, anyway. I don't know if it suits your needs, but if it does, I hope it helps!
Could be anything in your code. How do we know its not a 10 line script or 2000 line script?
You can use preg_split() to split an array into single characters, but it will return an extra empty string at the begining and the end.
$a = preg_split("//","abcdefg");
echo json_encode($a);
prints:
["","a","b","c","d","e","f","g",""]
I am creating a very simple file search, where the search database is a text file with one file name per line. The database is built with PHP, and matches are found by grepping the file (also with PHP).
This works great in Linux, but not on Mac when non-ascii characters are used. It looks like names are encoded differently on HFS+ (MacOSX) than on e.g. ext3 (Linux). Here's a test.php:
<?php
$mystring = "abcóüÚdefå";
file_put_contents($mystring, "");
$h = dir('.');
$h->read(); // "."
$h->read(); // ".."
$filename = $h->read();
print "string: $mystring and filename: $filename are ";
if ($mystring == $filename) print "equal\n";
else print "different\n";
When run MacOSX:
$ php test.php
string: abcóüÚdefå and filename: abcóüÚdefå are different
$ php test.php |cat -evt
string: abcóü?M-^Zdefå$ and filename: abco?M-^Au?M-^HU?M-^Adefa?M-^J are different$
When run on Linux (or on a nfs-mounted ext3 filesystem on MacOSX):
$ php test.php
string: abcóüÚdefå and filename: abcóüÚdefå are equal
$ php test.php |cat -evt
string: abcM-CM-3M-CM-<M-CM-^ZdefM-CM-% and filename: abcM-CM-3M-CM-<M-CM-^ZdefM-CM-% are equal$
Is there a way to make this script return "equal" on both platforms?
MacOSX uses normalization form D (NFD) to encode UTF-8, while most other systems use NFC.
(from unicode.org)
There are several implementations on NFD to NFC conversion. Here I've used the PHP Normalizer class to detect NFD strings and convert them to NFC. It's available in PHP 5.3 or through the PECL Internationalization extension. The following amendment will make the script work:
...
$filename = $h->read();
if (!normalizer_is_normalized($filename)) {
$filename = normalizer_normalize($filename);
}
...
It seems that Mac OS X/HFS+ is using character combinations instead of single characters. So the ó (U+00F3) is instead encoded as o (U+006F) + ´ (U+CC81, COMBINING ACUTE ACCENT). See also Apple’s Unicode Decomposition Table.
Have you checked that both systems use the same locale?
What encoding is the PHP script using on both systems?
I would also try using strcmp instead of the equals operator. I'm not sure if the equals operator uses strcmp internally, but it's a simple thing to test out in your case.