PHP - Convert PDF to Text (No access to exec/shell_exec) - php

The case:
Server doesn't support exec/shell_exec (so pdftotext is excluded)
Other libraries don't accept the PDF. Pdftotext works (tested on the files locally)
Here are some excerpts from the (PDF)code:
5 0 obj
>
stream
Gat$ugPXc?%"6H'p]ofd'_qs00UX27?3p0*8m>KOQL4]:u"*$$^'f*q*SGMee*e$5&=alj\#GV7YPq9pg!Lr0>Y2n'&lmd4Br?V9N
P:_",WI.kJ\#'cs>77M9eTkA;,t#f)aaGuNS-6=Wp*uBg,Ft9Tcj#aI]nD[C6&m#9m?m!p6=IBt=o_LGHh!q>f$C.jdOXbSP/796HV`_Y]Y
l)M(]FZ9Ld-J_mMRe2q(D>`V#G`NM]crn#_V?sGC#W9^bnrY$.mqeVN^YEcqK)blO~>
endstream
endobj
About the creator:
%PDF-1.4
1 0 obj
>
endobj
I would like to get some suggestions about how to convert this to plain text in PHP, without using the exec/shell_exec functions.
Thank you.
(Other solutions like http://webcheatsheet.com/php/reading_clean_text_from_pdf.php didn't work, and I couldn't get them to at least convert this code to something looking like ASCII-code.)

You cannot just parse this stream as you need to then decode the data using lots of other data in the file (like font encoding). You really want to use a library to do this...

Related

How to convert stream into pdf or display stream in pdf format using php or javascript?

How to convert stream into pdf or display stream as pdf on webpage using php or javascript?
Sample stream:
Â
13 0 obj
<<
/Length 66/Filter/FlateDecode
>> stream
[some binary data]
endstream
endobj";
I will be thankful if any body can help
The PDF file format is not at all trivial and even a simple one typically onsists of several objects. It appears your stream has syntax like an excerpt from a pdf, but a bit different. How are you encountering such streams?

pdftk Error: Failed to open PDF file:

I am using pdftk library to extract the form fields from the pdf .Everything is just running fine except the one issue that i got a pdf file pdf file link. which causes the error is given bellow
Error: Failed to open PDF file:
http://www.uscis.gov/sites/default/files/files/form/i-9.pdf
Done. Input errors, so no output created.
command for this is
root#ri8-MS-7788:/home/ri-8# pdftk http://192.168.1.43/form/i-9.pdf dump_data_fields
the same command is working for all other forms .
Attempt1
I have tried to encrypt the pdf to unsafe version but it produce the same error . here is the command
pdftk http://192.168.1.43/forms/i-9.pdf input_pw foopass output /var/www/forms/un-i-9.pdf
Update
this is my full function to handle this
public function Formanalysis($pdfname)
{
$pdffile=Yii::app()->getBaseUrl(true).'/uploads/forms/'.$pdfname;
exec("pdftk ".$pdffile." dump_data_fields 2>&1", $output,$retval);
//got an error for some pdf if these are secure
if(strpos($output[0],'Error') !== false)
{
$unsafepdf=Yii::getPathOfAlias('webroot').'/uploads/forms/un-'.$pdfname;
//echo "pdftk ".$pdffile." input_pw foopass output ".$unsafepdf;
exec("pdftk ".$pdffile." input_pw foopass output ".$unsafepdf);
exec("pdftk ".$unsafepdf." dump_data_fields 2>&1", $outputunsafe,$retval);
return $outputunsafe ;
//$response=array('0'=>'error','error'=>$output[0]);
//return $response;
}
//if (strpos($output[0],'Error') !== false){ echo "error to run" ; } // this is the option to handle error
return $output;
}
PdfTk is a tool that was created by compiling an obsolete version of iText to an executable using the GNU Compiler for Java (GCJ) (PdfTk is not endorsed by iText Group NV).
I have examined your PDF and it uses two technologies that weren't supported by iText at the time PdfTk was created: XFA and compressed cross-reference tables.
The latter is what causes your problem. PdfTk expects your file to end like this:
xref
0 7
0000000000 65535 f
0000000258 00000 n
0000000015 00000 n
0000000346 00000 n
0000000146 00000 n
0000000397 00000 n
0000000442 00000 n
trailer
<</ID [<c8bf0ac531b0fc7b5b9ec5daf0296834><ec4dde54d00305ebbec62f3f6bbca974>]/Root 5 0 R/Size 7/Info 6 0 R>>
%iText-5.4.3
startxref
595
%%EOF
In this snippet startxref marks the byte offset of xref which is where the cross-reference table starts. This table contains the byte-offsets of all the objects in the PDF.
When you look at the PDF you refer to, you see that it ends like this:
64 0 obj
<</DecodeParms<</Columns 5/Predictor 12>>/Encrypt 972 0 R/Filter/FlateDecode/ID[<85C47EA3EFE49E4CB0F087350055FDDC><C3F1748360D0464FBA02D711DE864630>]/Info 970 0 R/Length 283/Root 973 0 R/Size 971/Type/XRef/W[1 3 1]>>stream
hÞìÒ±JQЙ·»7J¢©ÕØ(Xþ„ù »h%¤É¤¶”€mZ+;ÁN,,ÁÆ6 XÁ&‚("î½YŒI‘Bî‡áμ]ö1Áð÷³cfþ‹ûÐÚLî`z„Ýôœùw÷N×X?ÙkNv`hÁÒj¦G[œiÀå»›œ?b½Än…ÉëàÍþ gY—i7WW‡òj®îÍ°u¸Ò‡Ñ:óÆÛ™ñÎë&'×݈§ü†ù!ÿñ€ù%,\ácçÙ9˜ì±Þ€S¼Ãd—‰Áy~×.ø¶Åìþßn_˜$9Ôüw£X9#åxzçgRüüóÙwÝ¡œÄNJ©½’Ú+©½’R{%µWR{%ÿ·á”;`_ z6Ø
endstream
endobj
startxref
116
%%EOF
In this case, startxref still refers to where the first cross-reference table starts (it's a linearized PDF), but the cross reference table is stored inside an object, and that object is compressed (see the gibberish between the stream and endstream keywords).
Compressed cross-reference tables and compressed objects were introduced in PDF 1.5 (2003), but they aren't supported by PdfTk. You'll have to find a tool that can deal with such streams (e.g. a recent version of iText, which is the real stuff when compared to PdfTk), or you have to save your PDF as a PDF 1.4 before you treat it with PdfTk (but you'll lose the XFA, because XFA was also introduced in PDF 1.5).
Update:
Since you are asking about form fields, I'm adding the following attachment:
This screenshot was taken using iText RUPS (which proves that iText can open the document). To the right, you see that the same form is defined twice:
If you would walk down the tree under Fields, you'd find all the fields that are stored in the PDF using AcroForm technology. To the left, you can see the description of such a field:
If you look under XFA, you notice that the same form is also defined using the XML Forms Architecture. If you click on datasets, you see the XML description of the dataset in the lower panel:
All of this information can be accessed programmatically using iText (Java) or iTextSharp (C#). PdfTk is merely a tool based on a very old version of this technology.
this may be a little trick solution but should work for you . as #bruno said that this is encrypted file . You should decrypt this before you use for the pdftk . For this i found a way to decrypt that is qpdf a free opem source library to decrypt the pdf, remove the owner and user passwords etc and many more. You can find this here Qpdf. install it on your system . and run this command
qpdf --decrypt input.pdf output.pdf
then use the output file in the pdftk command . it should work .

using ffmpeg/ffprobe to create a waveform json using php

I have many ogg & opus files on my server and need to generate json-waveform numeric arrays on an as-needed basis (example below).
recently i discovered the node based waveform-util which uses ffmpeg/ffprobe for rendering a JSON waveform and it works perfectly. i am undecided if having a node process constantly running is the optimum solution to my issue.
since ffmpeg seems to be able to handle anything i can throw at it, i wish to stick with an ffmpeg solution.
i have three questions:
1) is there a php equivalent? i have found a couple that generate PNG images but not one that generates JSON-waveform numeric arrays
2) are there any significant advantages of going with the node-based solution rather than a php based solution (assuming there is a php based solution)?
3) is there a way using CLI ffmpeg/ffprobe to generate a json-waveform ? i saw all the -show_ options (-show_data, -show_streams, -show_frames) but nothing looked like it produced what i am looking for.
the json-waveform needs to be in this format:
[ 0.0002, 0.001, 0.15, 0.14, 0.356 .... ]
thank you all.
it sounds as if there is a conflict with the way my server is handling cgi. i am using virtualmin and am using the following setting:
PHP script execution mode: CGI wrapper (run as virtual server owner)
after much research, it appears that using pure node.js is more lightweight rather than using a shell executable. i was able to have some success merely by putting a schbang line to call node, but having a node.js script always memory resident is probably the way to go.
For anyone in the future looking to do this with RN:
// convert the file to pcm
await RNFFmpeg.execute(`-y -i ${filepath} -acodec pcm_s16le -f s16le -ac 1 -ar 1000 ${pcmPath}`)
// you're reading that right, we're reading the file using base64 only to decode the base64, because RN doesnt let us read raw data
const pcmFile = Buffer.from(await RNFS.readFile(pcmPath, 'base64'), 'base64')
let pcmData = []
// byte conversion pulled off stack overflow
for(var i = 0 ; i < pcmFile.length ; i = i + 2){
var byteA = pcmFile[i];
var byteB = pcmFile[i + 1];
var sign = byteB & (1 << 7);
var val = (((byteA & 0xFF) | (byteB & 0xFF) << 8)); // convert to 16 bit signed int
if (sign) { // if negative
val = 0xFFFF0000 | val; // fill in most significant bits with 1's
}
pcmData.push(val)
}
// pcmData is the resulting waveform array

read and search in pdf file

I am trying to get text from pdf file with this code, but it return like below encoded text :-
$fp = fopen($filename, "r");
echo $content = fread($fp, filesize($filename));
fclose($fp);
%PDF-1.3 3 0 obj <> endobj 4 0 obj <> stream xœí\Ks¹¾ûWàâ*¹<„ñ~ø*¯½›ÊVíf«*‡(ZIÌRyHZÑ¿O÷ŠIKŪØ&9 h|Sn“TÑâ©mÐÚ å 6¨Mxø´Ê“üú•wÔ:,WP¥ˆqžzN~ƒÇ)¹zõ¯CF{Wê?¿ß$èQ‡šQ†J_`ù-ÏF‹Ë99NOsòeqw7y ðíÕx’’‹3ò™ãœ\¼YA½ÖG%°Ãå¶QO ³R𯌩8U %æåG]MÀ¥J'{±¢C¾®ÃõÂ÷^S8oQgœxΧÖÊø5›§ï×ÕÙZ‚ðÔ6K ç7#‘ñõ"OgdtÎHvE$ü2Ì/oŠ.£]t~ˆ‚9vêPeb훆LLˆê³ž{ÖvÆ{OYEò”|J'ãïiþcø2ËGØ+sè«;ø5§×äÛb˜Ïa¨]œñÙœ|&ØUo6Ø”¶j¥TóF½ûsIzJÞürl¯w$Kgr­tÑAÄ9&› ÚÒƒ`T¼bÐŒÜ9ü<œÃ8úÀ¿ÇáÈ÷t6K'3¤Oâ¹HŒàdt?Ío†óRlvy“Ž“tDæùš’QÉIø}<%KÖ‘¯‹9ɦ`ïW)°Ó Ì(¢hº›&XÂtÝ°¢M—Ùôjœß–mº~O¡aé4hÚþV ¨`ºOÏáŽ=­Àªœˆ·Ùôúâþf|yó>>wð„7ÊzCßçXì¨Lð•´.ð)E'fœMgµ&jN•.\8A ÅѵÚGÉZPaÃ:úâØþø˜¡Þ”uˆ,‡Ì·ãépæë¹]Ìà ^çiz›Nç4f&`–jÓ¨å™ÓèùÚÎ)*ÊIûimãWÜ#©v‘ŒGs*ÃJNÎxg &b5ã¹+)Ÿ§ù,Oo‡ãé&Ip­Ši,‚ó²Øf='Ǩ¡ƒ1ª•ˆ#%`&Àž J>9*ˆ¹5ì9rñÈ:(Š#yŸ¹yê¨Y®¨S…>ŒFcœäÀâWQÆxmøsI­÷ž9ü½¡Î‡ÕœDš-tM"[û†²rkïÆ“IßÌÈUžÝbLõ}œ-fuHõî!æQS`¹üÖû2 [Ш.—(~ÀXø6›”ÅÀ£K¼­óŒÆ5Ä ÆÿÓH¶ã`ƒÙÂUó¼%+Ë€ÃÛl1c7áæš%¼èOÿ<¯ÃË|xwV}´ÈQ&ˆ(ózx––ïâ²çëzäj’¦sj:U37‹«V0*y£Ù/cõ°±*9åÔú‘’ŸÊþY_FŒš³ÝJÁÏ}Î~Î…ßȵož)š—Bÿ…ÆN>œ[ê`ÖfŽê57sËËågÙмڑÔø°$o&D¬XL³æèö<®Äµ‚9¶ÏÌ=n(6*ÿt?J%‚Ip¥B‹~q6ýø^·_ÓÍ:‹òµ„¬MKGŠÍ‰ñ‚ú5/=’2žj–ô–
I have no idea, what i will do to get text and match the text with user input ?
Thanks a lot...
http://nl3.php.net/manual/en/ref.pdf.php
look to this..!
edit:
and http://davidwalsh.name/read-pdf-doc-file-php
are you try pdflib
A PDF is a structured and compressed file format containing a number of resources, such as plain text and binary data (images, fonts, etc.). Compression is optional. The main problem with attempting to pull text strings out of the PDF is that you don't know if the text structure was maintained during conversion. Some programs do a good job maintaining words/sentences as a string, while others may break things up in a way that makes the raw text from the PDF source unreadable. The source document and PDF rendering app matter in this case.
Before we get into the details of parsing text from a PDF you should just take a quick look around the web. Unless you want the experience there's no need to reinvent the wheel.

Finding the "PDF Producer" or source application of a PDF

Is there a quick-and-dirty way to access the "producer" metadata of a PDF file, using Regex or XML parsing, from a PHP application?
The technique does not have to be infallible. The objective is to prompt the user if they upload a PDF created using TeX.
You can hack the value out by looking for the producer or creator tag but it might be encoded rather than available as ascii.
On the command line, the following outputs a matching line:
$ strings my.pdf | grep TeX
Producer (pdfTeX-1.40.10)
/Creator (TeX)
/PTEX.Fullbanner (This is pdfTeX, Version 3.1415926-1.40.10-2.2 (TeX Live 2009) kpathsea version 5.0.0)
You might do something similar in PHP, see Read plain text from binary file with PHP.

Categories