I am using pdftk library to extract the form fields from the pdf .Everything is just running fine except the one issue that i got a pdf file pdf file link. which causes the error is given bellow
Error: Failed to open PDF file:
http://www.uscis.gov/sites/default/files/files/form/i-9.pdf
Done. Input errors, so no output created.
command for this is
root#ri8-MS-7788:/home/ri-8# pdftk http://192.168.1.43/form/i-9.pdf dump_data_fields
the same command is working for all other forms .
Attempt1
I have tried to encrypt the pdf to unsafe version but it produce the same error . here is the command
pdftk http://192.168.1.43/forms/i-9.pdf input_pw foopass output /var/www/forms/un-i-9.pdf
Update
this is my full function to handle this
public function Formanalysis($pdfname)
{
$pdffile=Yii::app()->getBaseUrl(true).'/uploads/forms/'.$pdfname;
exec("pdftk ".$pdffile." dump_data_fields 2>&1", $output,$retval);
//got an error for some pdf if these are secure
if(strpos($output[0],'Error') !== false)
{
$unsafepdf=Yii::getPathOfAlias('webroot').'/uploads/forms/un-'.$pdfname;
//echo "pdftk ".$pdffile." input_pw foopass output ".$unsafepdf;
exec("pdftk ".$pdffile." input_pw foopass output ".$unsafepdf);
exec("pdftk ".$unsafepdf." dump_data_fields 2>&1", $outputunsafe,$retval);
return $outputunsafe ;
//$response=array('0'=>'error','error'=>$output[0]);
//return $response;
}
//if (strpos($output[0],'Error') !== false){ echo "error to run" ; } // this is the option to handle error
return $output;
}
PdfTk is a tool that was created by compiling an obsolete version of iText to an executable using the GNU Compiler for Java (GCJ) (PdfTk is not endorsed by iText Group NV).
I have examined your PDF and it uses two technologies that weren't supported by iText at the time PdfTk was created: XFA and compressed cross-reference tables.
The latter is what causes your problem. PdfTk expects your file to end like this:
xref
0 7
0000000000 65535 f
0000000258 00000 n
0000000015 00000 n
0000000346 00000 n
0000000146 00000 n
0000000397 00000 n
0000000442 00000 n
trailer
<</ID [<c8bf0ac531b0fc7b5b9ec5daf0296834><ec4dde54d00305ebbec62f3f6bbca974>]/Root 5 0 R/Size 7/Info 6 0 R>>
%iText-5.4.3
startxref
595
%%EOF
In this snippet startxref marks the byte offset of xref which is where the cross-reference table starts. This table contains the byte-offsets of all the objects in the PDF.
When you look at the PDF you refer to, you see that it ends like this:
64 0 obj
<</DecodeParms<</Columns 5/Predictor 12>>/Encrypt 972 0 R/Filter/FlateDecode/ID[<85C47EA3EFE49E4CB0F087350055FDDC><C3F1748360D0464FBA02D711DE864630>]/Info 970 0 R/Length 283/Root 973 0 R/Size 971/Type/XRef/W[1 3 1]>>stream
hÞìÒ±JQЙ·»7J¢©ÕØ(Xþ„ù »h%¤É¤¶”€mZ+;ÁN,,ÁÆ6 XÁ&‚("î½YŒI‘Bî‡áμ]ö1Áð÷³cfþ‹ûÐÚLî`z„Ýôœùw÷N×X?ÙkNv`hÁÒj¦G[œiÀå»›œ?b½Än…ÉëàÍþ gY—i7WW‡òj®îÍ°u¸Ò‡Ñ:óÆÛ™ñÎë&'×݈§ü†ù!ÿñ€ù%,\ácçÙ9˜ì±Þ€S¼Ãd—‰Áy~×.ø¶Åìþßn_˜$9Ôüw£X9#åxzçgRüüóÙwÝ¡œÄNJ©½’Ú+©½’R{%µWR{%ÿ·á”;`_ z6Ø
endstream
endobj
startxref
116
%%EOF
In this case, startxref still refers to where the first cross-reference table starts (it's a linearized PDF), but the cross reference table is stored inside an object, and that object is compressed (see the gibberish between the stream and endstream keywords).
Compressed cross-reference tables and compressed objects were introduced in PDF 1.5 (2003), but they aren't supported by PdfTk. You'll have to find a tool that can deal with such streams (e.g. a recent version of iText, which is the real stuff when compared to PdfTk), or you have to save your PDF as a PDF 1.4 before you treat it with PdfTk (but you'll lose the XFA, because XFA was also introduced in PDF 1.5).
Update:
Since you are asking about form fields, I'm adding the following attachment:
This screenshot was taken using iText RUPS (which proves that iText can open the document). To the right, you see that the same form is defined twice:
If you would walk down the tree under Fields, you'd find all the fields that are stored in the PDF using AcroForm technology. To the left, you can see the description of such a field:
If you look under XFA, you notice that the same form is also defined using the XML Forms Architecture. If you click on datasets, you see the XML description of the dataset in the lower panel:
All of this information can be accessed programmatically using iText (Java) or iTextSharp (C#). PdfTk is merely a tool based on a very old version of this technology.
this may be a little trick solution but should work for you . as #bruno said that this is encrypted file . You should decrypt this before you use for the pdftk . For this i found a way to decrypt that is qpdf a free opem source library to decrypt the pdf, remove the owner and user passwords etc and many more. You can find this here Qpdf. install it on your system . and run this command
qpdf --decrypt input.pdf output.pdf
then use the output file in the pdftk command . it should work .
Related
I am trying to animate all images in a directory with imagemagick's morph operator. I am calling the command-line arguments with PHP's exec command. The images hold a specific file name such as 000000.jpg, 000001.jpg, 000002.jpg, 000003.jpg, and so on.
the following code works fine from image sequence 000000.jpg to 000009.jpg, but when I have images such as 0000010.jpg, the 0000010.jpg comes after 000001.jpg while I want 000002.jpg should come after 000001.jpg. Can you please point out how to modify the code so that I can retain a sequence in gif animation?
$command = "convert -set delay 5 -loop 0 $img_dir/00000*.jpg -morph 10 $img_dir/morph.gif";
exec($command);
Normally, the length of the name is kept constant, so if you add an extra digit in going from 9 to 10, you remove one of the leading zeroes to keep the length the same. I mean, most people do this:
000008
000009
000010
000011
whereas you appear to have done this where the length changes at 10:
000008
000009
0000010
0000011
I would suggest you change the program that generates the naming so it generates a constant length name then everything will work properly.
Failing that, if you wish to, or are obliged to, stick with your naming system, ImageMagick will accept a list of files in any order you like, so you can create a file called filelist.txt or somesuch, with contents like this
000000.jpg
000001.jpg
000002.jpg
...
000009.jpg
0000010.jpg
0000011.jpg
and use it like this:
convert -delay 100 -loop 0 #filelist.txt anim.gif
Alternatively, you could let bash print your filenames always with 5 leading spaces using a command such as this:
printf "00000%d.jpg " {0..4}
000000.jpg 000001.jpg 000002.jpg 000003.jpg 000004.jpg
and embed that in your ImageMagick command
convert -delay 100 -loop 0 $(printf "00000%d.jpg " {0..4}) anim.gif
How can I insert a PJL command into PDF without having to convert PDF to PostScript
*STARTPJL
#PJL SET STAPLE=LEFTTOP
*ENDPJL
after I send it to printer via FTP or LPR.
I'm using Zend_Pdf to create PDF documents.
**I tried unsuccessfully this code
$a .= "<ESC>%-12345X#PJL<CR><LF>";
$a .= "#PJL SET OUTBIN=OUTBIN101<CR><LF>";
$a .= "#PJL SET STAPLE=LEFTTOP<CR><LF>";
$a .= "#PJL ENTER LANGUAGE = PDF<CR><LF>";
$a .= file_get_contents("/www/zendsvr/htdocs/GDA/public/pdf/test.pdf");
$a .= "<ESC>%-12345X";
$myfile = fopen("/www/zendsvr/htdocs/GDA/public/pdf/t.pdf", "w");
fwrite($myfile, $a);
fclose($myfile);
the document is printed correctly but does not change the drawe and not clamp, any suggestions?
I'm not going to explain how to achieve the following points with PHP. These points merely explain the most important fundamentals to be familiar with when dealing with PJL and with PJL regarding PDF-based print jobs. You have to 'translate' this generic info to PHP yourself....
You cannot insert PJL commands into PDF. But you can prepend PJL commands to a PDF print job.
Also, it is not meaningful to do this after you send it to a printer via FTP or via LPR. It is only meaningful if you do it before sending the file.
Next, your example PJL code is not valid for most purposes. The standard way to prepend PJL lines to a PDF print job file is this:
<ESC>%-12345X#PJL<CR><LF>
#PJL SET STAPLE=LEFTTOP<CR><LF>
#PJL [... more PJL commands if required ...]
#PJL ENTER LANGUAGE = PDF<CR><LF>
[... all bytes of the PDF file, starting with '%PDF-1.' ...]
[... all bytes of the PDF file ............................]
[... all bytes of the PDF file ............................]
[... all bytes of the PDF file, ending with '%%EOF' .......]
<ESC>%-12345X
Explanations:
Here <ESC> denotes the escape character (27 in decimal, 1B in hex).
<CR> denotes the carriage return character (13 in dec, 0D in hex). It is optional within PJL.
<LF> denotes the line feed charaxter (10 in dec, 0A in hex). It is required within PJL.
<ESC>%-12345X denotes the 'Universal Exit Language' command (UEL). It is required in PJL. It defines beginning and end of any PJL-based data stream.
Lastly, please note:
Not all printers and not all LPR print services are able to deal with PDF-based print jobs.
Also, not all printers and not all LPR print services are able to honor PJL commands which are prepended to print job files.
The case:
Server doesn't support exec/shell_exec (so pdftotext is excluded)
Other libraries don't accept the PDF. Pdftotext works (tested on the files locally)
Here are some excerpts from the (PDF)code:
5 0 obj
>
stream
Gat$ugPXc?%"6H'p]ofd'_qs00UX27?3p0*8m>KOQL4]:u"*$$^'f*q*SGMee*e$5&=alj\#GV7YPq9pg!Lr0>Y2n'&lmd4Br?V9N
P:_",WI.kJ\#'cs>77M9eTkA;,t#f)aaGuNS-6=Wp*uBg,Ft9Tcj#aI]nD[C6&m#9m?m!p6=IBt=o_LGHh!q>f$C.jdOXbSP/796HV`_Y]Y
l)M(]FZ9Ld-J_mMRe2q(D>`V#G`NM]crn#_V?sGC#W9^bnrY$.mqeVN^YEcqK)blO~>
endstream
endobj
About the creator:
%PDF-1.4
1 0 obj
>
endobj
I would like to get some suggestions about how to convert this to plain text in PHP, without using the exec/shell_exec functions.
Thank you.
(Other solutions like http://webcheatsheet.com/php/reading_clean_text_from_pdf.php didn't work, and I couldn't get them to at least convert this code to something looking like ASCII-code.)
You cannot just parse this stream as you need to then decode the data using lots of other data in the file (like font encoding). You really want to use a library to do this...
I am trying to convert a PDF to a JPG with a PHP exec() call, which looks like this:
convert page.pdf -resize 716x716 page.jpg
For some reason, the JPG comes out with janky text, despite the PDF looking just fine in Acrobat and Mac Preview. Here is the original PDF:
http://whit.info/dev/conversion/page.pdf
and here is the janktastic output:
http://whit.info/dev/conversion/page.jpg
The server is a LAMP stack with PHP 5 and ImageMagick 6.2.8.
Can you help this stumped Geek?
Thanks in advance,
Whit
ImageMagick is just going to call out to Ghostscript to convert this PDF to an image. If you run gs on the pdf, you get the same badly-spaced output.
I suspect Ghostscript isn't handling the PDF's embedded TrueType fonts very well. If you could change your output to either embed Type 1 fonts or use a "core" PostScript font, you'd get better results.
I suspect its an encoding/widths issue. Both are a tad off, though I can't put my finger on why.
Here are some suspects:
First
The text stream is defined in UTF-16 LE. charNULLcharNULL, using the normal string drawing command syntax:
(some text) Tj
There's a way to escape any old character value into a () string. You can also define strings in hex thusly:
<203245> Tj
Neither method are used, just the questionable inline nulls. That could cause an issue in GS if it's trying to work with pointers to char without lengths associated with them.
Second
The widths array is dumb. You can define widths in groups thusly:
[ 32 [450 525 500] 37 [600 250] 40 [0] ]
This defines
32: 450
33: 525
34: 500
37: 600
38: 250
40: 0
These fonts defines their consecutive widths in individual arrays. Not illegal, but definitely wasteful/stupid, and if GS were coded to EXPECT gaps between the arrays, it could induce a bug.
There's also some extremely fishy values in the array. 32 through 126 are defined consecutively, but then it starts jumping all over: ...126 [600] 8364 [500] 8216 [222] 402 [500] 8222 [389]. 8230 [1000] 8224 [444]... and then goes back to being consecutive from 160 to 255.
Just weird.
Third
I'm not even remotely sure, but the CIDToGIDMap stream contains an AWEFUL lot of nulls.
Bottom line
Those fonts are fishy. And I've never heard of "Bellflower Books" or "UFPDF 0.1"
That version number makes me cringe. It should make you cringe too.
Googleing for "UFPDF" I found this note from the author:
Note: I wrote UFPDF as an experiment, not as a finished product. If you have problems using it, don't bug me for support. Patches are welcome though, but I don't have much time to maintain this.
UFPDF is a PHP library that sits on top of FPDF. 0.1. Just run away.
Is there a quick-and-dirty way to access the "producer" metadata of a PDF file, using Regex or XML parsing, from a PHP application?
The technique does not have to be infallible. The objective is to prompt the user if they upload a PDF created using TeX.
You can hack the value out by looking for the producer or creator tag but it might be encoded rather than available as ascii.
On the command line, the following outputs a matching line:
$ strings my.pdf | grep TeX
Producer (pdfTeX-1.40.10)
/Creator (TeX)
/PTEX.Fullbanner (This is pdfTeX, Version 3.1415926-1.40.10-2.2 (TeX Live 2009) kpathsea version 5.0.0)
You might do something similar in PHP, see Read plain text from binary file with PHP.