Finding the "PDF Producer" or source application of a PDF - php

Is there a quick-and-dirty way to access the "producer" metadata of a PDF file, using Regex or XML parsing, from a PHP application?
The technique does not have to be infallible. The objective is to prompt the user if they upload a PDF created using TeX.

You can hack the value out by looking for the producer or creator tag but it might be encoded rather than available as ascii.

On the command line, the following outputs a matching line:
$ strings my.pdf | grep TeX
Producer (pdfTeX-1.40.10)
/Creator (TeX)
/PTEX.Fullbanner (This is pdfTeX, Version 3.1415926-1.40.10-2.2 (TeX Live 2009) kpathsea version 5.0.0)
You might do something similar in PHP, see Read plain text from binary file with PHP.

Related

PHP - Check if pdf contains given text - TcpdfFpdi / pdftk / fpdi

I have a pdf document and I want to check if a specific text occurs (which are tags that I put in while generating the pdf) in the document, however using these libraries (tcpdfFpdi, pdftk or fdpi) I couldn't figure out if it's possible or how to do it.
$str = "{hello}";
$pdf = new TcpdfFpdi();
$pdf->setSourceFile($filePath);
$pdf->searchForText($str); // something like this which returns boolean
If I try without any library to dd(file_get_contents($filePath)), it returns a very long output and doesn't seem to contain the file I want so I think it's better to use one of those libraries.
Just an idea…
It's no actual PHP solution but you could use tools like pdftotext which I know from this post (where a PDF file is converted into a string to count its words): https://superuser.com/a/221367/535203
You can install it and play around with that command and call it from within your PHP application.
As far as I remember (long time ago since I used pdftotext) the output text is not exaclty the PDF's content but to search a few tags in it it's at least a good try.

File reading from PHP using python script

Okay, this is driving me crazy. I have a small file. Here is the dropbox link https://www.dropbox.com/s/74nde57f07jj0zj/transcript.txt?dl=0.
If I try to read the content of the file using python f.read(), I can easily read it. But, if I try to run the same python program using php shell_exec(), the file read fails. This is the error I get.
Traceback (most recent call last):
File "/var/www/python_code.py", line 2, in <module>
transcript = f.read()
File "/opt/anaconda/lib/python3.4/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 107: ordinal not in range(128)
I have checked all the permission issues and there is no problem with that.
Can anyone kindly shed some light?
Here is my python code.
f = open('./transcript/transcript.txt', 'r')
transcript = f.read()
print(transcript)
Here is my PHP code.
$output = shell_exec("/opt/anaconda/bin/python /var/www/python_code.py");
Thank you!
EDIT: I think the problem is in the file content. If I replace the content with simple 'I eat rice', then I can read the content from php. But the current content cannot be read. Still don't know why.
The problem appears is that your file contains non-ASCII characters, but you're trying to read it as ASCII text.
Either it is text, but is in some encoding or other that you haven't told us (probably UTF-8, Latin-1, or cp1252, but there are countless other possibilities), or it's not text at all, but rather arbitrary binary data.
When you open a text file without specifying an encoding, Python has to guess. When you're running from inside the terminal or whatever IDE you use, presumably, it's guessing the same encoding that you used in creating the file, and you're getting lucky. But when you're running from PHP, Python doesn't have as much information, so it's just guessing ASCII, which means it fails to read the file because the file has bytes that aren't valid as ASCII.
If you want to understand how Python guesses, see the docs for open, but briefly: it calls locale.getpreferredencoding(), which, at least on non-Windows platforms, reads it from the locale settings in the environment. On a typical linux system that's not new enough to be based on systemd but not too old, the user's shell will be set up for a UTF-8 locale, but services will be set up for C locale. If all of that makes sense to you, you may see a way to work around your problem. If it all sounds like gobbledegook, just ignore it.
If the file is meant to be text, then the right solution is to just pass the encoding to the open call. For example, if the file is UTF-8, do this:
f = open('./transcript/transcript.txt', 'r', encoding='utf-8')
Then Python doesn't have to guess.
If, on the other hand, the file is arbitrary binary data, then don't open it in text mode:
f = open('./transcript/transcript.txt', 'rb')
In this case, of course, you'll get bytes instead of str every time you read from it, and print is just going to print something ugly like b'aq\x9bz' that makes no sense; you'll have to figure out what you actually want to do with the bytes instead of printing them as a bytes.

PHP library to generate code diff (github style)?

I'm looking for an free php library that can generate code diff HTML. Basically just like GitHub's code diffs pages.
I've been searching all around and can't find anything. Does anyone know of anything out there that does what I'm looking for?
It looks like I found what I'm looking for after doing more Google searches with different wording.
php-diff seems to do exactly what I want. Just a php function that accepts two strings and generates all the HTML do display the diff in a web page.
To add my two cents here...
Unfortunately, there are no really good diff libraries for displaying/generating diffs in PHP. That said, I recently did find a circuitous way to do this using PHP. The solution involved:
A pure JavaScript approach for rendering the Diff
Shelling out to git with PHP to generate the Diff to render
First, there is an excellent JavaScript library for rendering GitHub-style diffs called diff2html. This renders diffs very cleanly and with modern styling. However diff2html requires a true git diff to render as it is intended to literally render git diffs--just like GitHub.
If we let diff2html handle the rendering of the diff, then all we have left to do is create the git diff to have it render.
To do that in PHP, you can shell out to the local git binary running on the server. You can use git to calculate a diff on two arbitrary files using the --no-index option. You can also specify how many lines before/after the found diffs to return with the -U option.
On the server it would look something like this:
// File names to save data to diff in
$leftFile = '/tmp/fileA.txt';
$rightFile = '/tmp/fileB.txt';
file_put_contents($leftFile, $leftData);
file_put_contents($rightFile, $rightData);
// Generate git diff and save shell output
$diff = shell_exec("git diff -U1000 --no-index $leftFile $rightFile");
// Strip off first line of output
$diff = substr($diff, strpos($diff, "\n"));
// Delete the files we just created
unlink($leftFile);
unlink($rightFile);
Then you need to get $diff back to the front-end. You should review the docs for diff2html but the end result will look something like this in JavaScript (assuming you pass $diff as diffString):
function renderDiff(el, diffString) {
var diff2htmlUi = new Diff2HtmlUI({diff: diffString});
diff2htmlUi.draw(el);
}
I think what you're looking for is xdiff.
xdiff extension enables you to create and apply patch files containing differences between different revisions of files.
This extension supports two modes of operation - on strings and on files, as well as two different patch formats - unified and binary. Unified patches are excellent for text files as they are human-readable and easy to review. For binary files like archives or images, binary patches will be adequate choice as they are binary safe and handle non-printable characters well.

Unconventional .ZIP handling and find/replace in PHP

I have a slightly unconventional task I am trying to accomplish with .ZIP archives in PHP. I have a zip archive used for an automation task (It's a startup package for Amazon EC2 instances) which contains a number of text and xml files. What I need to do is find/replace a few pieces of text within those files, and output a BASE 64 encoded string (not write a new .zip file) using PHP on the fly.
I have no problem with getting the file contents and base64 enconding them with file_get_contents(), and base64_encode(), or the find/replace, it's the unzipping, and zipping to and from strings I can't seem to figure out.
I would like to avoid unzipping the archive, copying the files, editing the files writing a new .zip to disk, and then getting the contents and encoding that. I was hoping there might be a solution that looks more like this:
Get the contents of the zip file into a string.
$originalZipFile = file_get_contents('Path/To/ZipFile');
"Unzip" the data in that string, to a new string to expose the bits of text I want to find/replace.
$unzippedFile = someFunction($originalZipFile);
Find and replace bits of text.
$processedString = str_replace($find, $replace, $unzippedFile);
"Rezip" the processed string into a new string.
$rezippedFile = someOtherFunction($processedString);
Base64 encode the "rezziped" string.
$desiredOutputString = base64_encode($rezippedFile);
I have looked at the PHP ZipArchive class, but it doesn't seem to have the functions I'm looking for.
Any insights are greatly appreciated!
-Oliver
Well, I believe I found a pretty good solution to this. For any others looking for similar solutions, I would recommend looking at the ZipStream-PHP class by Paul Duncan.
With this class, you are able to dynamically write files, contents, and directories to a zip file, which is then streamed without writing a file to disk.
Pretty automagical.

convert txt or doc to pdf using php

have anyone come across a php code that convert text or doc into pdf ?
it has to follow the same format as the original txt or doc file meaning the line feed as well as new paragraph...
Converting from DOC to PDF is possible using phpLiveDocx:
$phpLiveDocx = new Zend_Service_LiveDocx_MailMerge();
$phpLiveDocx->setUsername('username')
->setPassword('password');
$phpLiveDocx->setLocalTemplate('document.doc');
// necessary as of LiveDocx 1.2
$phpLiveDocx->assign('dummyFieldName', 'dummyFieldValue');
$phpLiveDocx->createDocument();
$document = $phpLiveDocx->retrieveDocument('pdf');
file_put_contents('document.pdf', $document);
unset($phpLiveDocx);
For text to PDF, you can use the pdf extension is PHP.
You can view the examples here.
Have a look at this SO question. Using OpenOffice in command line mode for conversions can be done, though you'd have to search a bit for the conversion macro's. I'm not saying it's light-weight though :)
See HTML_ToPDF. It also works for text.
It has been a long time since I touched PHP, but if you can make web service calls from it then try this product. It provides excellent conversion fidelity. It also supports additional formats including Infopath, Excel, PowerPoint etc as well as Watermarking support.
Please note that I have worked on this product so the usual disclaimers apply.

Categories