I want to make PDFs from external URLs searchable. I'm using pdftotext from XPDF.
It's working fine with PDFs already on my webspace, but I keep getting an error message when trying to use external PDFs instead. Specifically I get:
"Error: Couldn't open file 'https://www.vericoa.com/sandbox/test2.pdf' "
Here is my code
$path = 'https://www.vericoa.com/sandbox/test2.pdf';
echo shell_exec('pdftotext -enc UTF-8 '.$path.' pdf.txt 2>&1');
$file = file_get_contents('pdf.txt');
echo $file;
Is it even possible to extract text from external PDF sources? Are there any alternatives (I spent the last hours searching, but found nothing).
Thanks in advance
Matthias
You could perhaps try downloading the external URL in php, saving it to a file and passing that to the pdftotext script?
Related
I am creating .docx files from a template using PHPWord. It works fine but now I want to convert the generated file to PDF.
First I tried using tcpdf in combination with PHPWord
$wordPdf = \PhpOffice\PhpWord\IOFactory::load($filename.".docx");
\PhpOffice\PhpWord\Settings::setPdfRendererPath(dirname(__FILE__)."/../../Office/tcpdf");
\PhpOffice\PhpWord\Settings::setPdfRendererName('TCPDF');
$pdfWriter = \PhpOffice\PhpWord\IOFactory::createWriter($wordPdf , 'PDF');
if (file_exists($filename.".pdf")) unlink($filename.".pdf");
$pdfWriter->save($filename.".pdf");
but when I try to load the file to convert it to PDF I get the following exception while loading the file
Fatal error: Uncaught exception 'BadMethodCallException' with message 'Cannot add PreserveText in Section.'
After some research I found that some others also have this bug (phpWord - Cannot add PreserveText in Section)
EDIT
After trying around some more I found out, that the Exception only occurs when I have some mail merge fields in my document. Once I removed them the Exception does not come up anymore, but the converted PDF files look horrible. All style information are gone and I can't use the result, so the need for an alternative stays.
I thought about using another way to generate the PDF, but I could only find 4 ways:
Using OpenOffice - Impossible as I cannot install any software on the Server. Also going the way mentioned here did not work either as my hoster (Strato) uses SunOS as the OS and this needs Linux
Using phpdocx - I do not have any budget to pay for it and the demo cannot create PDF
Using PHPLiveDocx - This works, but has the limitation of 250 documents per day and 20 per hour and I have to convert arround 300 documents at once, maybe even multiple times a day
Using PHP-Digital-Format-Convert - The output looks better than with PHPWord and tcpdf, but still not usable as images are missing, and most (not all!) of the styles
Is there a 5th way to generate the PDF? Or is there any solution to make the generated PDF documents look nice?
I used Gears/pdf to convert the docx file generated by phpword to PDF:
$success = Gears\Pdf::convert(
'file_path/file_name.docx',
'file_path/file_name.pdf');
You're trying to unlink the PDF file before saving it, and you have also to unlink the DOCX document, not the PDF one.
Try this.
$pdfWriter = \PhpOffice\PhpWord\IOFactory::createWriter($wordPdf , 'PDF');
$pdfWriter->save($filename.".pdf");
unlink($wordPdf);
I don't think I'm correct..
You save the document as HTML content
$objWriter = \PhpOffice\PhpWord\IOFactory::createWriter($phpWord, 'HTML');
After than you read the HTML file content and write the content as PDF file with the help of mPDF or tcPdf or fpdf.
Try this:
// get the name of the input PDF
$inputFile = "C:\\PHP\\Test1.docx";
// get the name of the output MS-WORD file
$outputFile = "C:\\PHP\\Test1.pdf";
try
{
$oLoader = new COM("easyPDF.Loader.8");
$oPrinter = $oLoader->LoadObject("easyPDF.Printer.8");
$oPrintJob = $oPrinter->PrintJob;
$oPrintJob->PrintOut ($inputFile, $outputFile);
print "Success";
}
catch(com_exception $e)
{
Print "error code".$e->getcode(). "\n";
print $e->getMessage();
}
I have a function that uploads a file into a web storage and prior to saving the file on the storage system if the file is a pdf file i would like to determine how many pages a pdf file has.
Currently i have the following:
$pdftext = file_get_contents($path);
$num = preg_match_all("/\/Page\W/", $pdftext, $dummy);
return $num;
Where $path is the temporary path that i use with fopen to open the document
This function works at times but is not reliable. I know theres also this function
exec('/usr/bin/pdfinfo '.$pdf_file.' | awk \'/Pages/ {print $2}\'', $output);
But this requires the file to donwloaded on the server. Any ideas or suggestions to accomplish this?
PHP is a server-side language, meaning all processing happens on your server. There's no way for PHP to determine details of a file on the client side, it has no knowledge of it neither the required access to it.
So the answer to your question as it is now is: It's not possible. But you probably have a goal in mind why you want to check this, sharing this goal might help to get more constructive answers/suggestions.
As Oldskool already explained this is not possible with PHP on the client side. You would have to upload the PDF file to the server and then determine the amount of pages. There are libraries and command line tools that could accomplish this.
In case you don't want to upload the PDF file to the server (which seems to be the case here) you could use the pdf.js library. Now the client is able to determine the amount of pages in a PDF document on its own.
PDFJS.getDocument(data).then(function (doc) {
var numPages = doc.numPages;
}
There are other libraries as well but I'm not certain about their browser support (http://www.electronmedia.in/wp/pdf-page-count-javascript/)
Now you just submit the amount of pages from javascript to your php file that needs this information. In order to achive this you simply use ajax. In case you don't know ajax, just google it there are enough examples out there.
As a side note; Always remember to not trust the client. The client is able to modify the page count and send a completely different one.
For those of you running linux servers this actually is possible. You need the pdfinfo extension installed and using the function
$pages = exec('/usr/bin/pdfinfo '.$pdf_file.' | awk \'/Pages/ {print $2}\'', $output);
outputs the correct page number where $pdf_file is the temporary path on the server upon upload.
The reason it wasnt working for me was because i didnt have the PDFinfo installed.
I have a DB system built in PHP/MySql. I'm fairly new at this. The system allows the user to upload an invoice. Others give permission to pay the invoice. The accounting person uploads the check. After check is uploaded, it generates a PDF as a cover, then uses PDFTK (using Ben Squire's PDFTK-PHP-Library) to combine all of the files together and present the user with a single PDF to download.
Some users upload PDF files which cause PDFTK to hang indefinitely when it tries to combine the PDF with others (but most of the time it works fine). No returned error, just hangs. In order to get back onto the sytem, user must clear cache and re-log in. There are no error messages logged by the server, it just freezes. The only difference I can find in the files that do or do not work in looking at them with Acrobat is that the bad files are legal sized (8.5 x 14) ... but if I create my own legal sized file and try that, it works fine.
Using Putty I've gone to command line and replicated the same problem, PDFTK can't read the file, it hangs on the command line as well. I tried using PDFMerge which uses FPDF to combine the files and get an error with the file as well (The error I get back from this is: FPDF error: Unable to find object (4, 0) at expected location). On the command line I was able to use ImageMagick to convert PDF to JPG, but it gives me an error: "Warning: File has an invalid xref entry: 2. Rebuilding xref table." and then it converts it to a jpg but gives a few other less helpful warnings.
If I could get PHP to check the PDF file to determine if is valid without hanging the system, I could use ImageMagick to convert the file and then convert it back to a PDF, but I don't want to do this to all files. How can I get it to check the validity of the file when uploaded to see if it needs to be converted without causing the system to hang?
Here is a link to a file that is causing problems: http://www.cssc-testing.org/accounting/school_9/20130604-a1atransportation-1.pdf
Thanks in advance for any guidance you can offer!
My Code (which I'm guessing is not very clean, as I'm new):
$pdftk = new pdftk();
if($create_cover) { $pdftk->setInputFile(array("filename" => $cover_page['server'])); }
// Load a list of attachments
$sql = "SELECT * FROM actg_attachments WHERE trans_id = {$trans_id}";
$attachments = Attachment::find_by_sql($sql);
foreach($attachments as $attachment) {
// Check if the file exists from the attachments
$attachment->set_variables();
$file = $attachment->abs_path . DS . $attachment->filename;
if(file_exists($file)){
// Use the pdftk tool to attach the documents to this PDF
$pdftk->setInputFile(array("filename" => $file));
}
}
$pdftk->setOutputFile($save_file);
$pdftk->_renderPdf();
the $pdftk class it is calling is from: https://github.com/bensquire/php-pdtfk-toolkit
You could possibly use Ghostscript using exec() to check the file.
The non-accepted answer here may help:
How can you find a problem with a programmatically generated PDF?
I wont say this is an appropriate/best fix, but it may resolve your problem,
In: pdf_parser.php, comment out the line:
$this->error("Unable to find object ({$obj_spec[1]}, {$obj_spec[2]}) at expected location");
It should be near line 544.
You'll also likely need to replace:
if (!is_array($kids))
$this->error('Cannot find /Kids in current /Page-Dictionary');
with:
if (!is_array($kids)){
// $this->error('Cannot find /Kids in current /Page-Dictionary');
return;
}
in the fpdi_pdf_parser.php file
Hope that helps. It worked for me.
I can convert html page to pdf and send it via email with out any problem, but I am facing trouble with converting php file to pdf.
is it possible to convert php file to pdf using mpdf or do I need to use some other php class for this?
Thanks!
The option seems to be like, first convert the output of the php file to html file, save it and pass that file to mpdf.
Thought my solution may seem other way round but it worked well for me.
Or otherwise this Link
<?php
$file = '/home/user/Desktop/myfile.html';
$result = file_get_contents("url/of/ur/page");
echo $result; //view source now
file_put_contents($file, $result);
?>
now u can pass this file to mpdf. Realie sorie bt i havnt used mpdf till date. May be this solution works for you.
Also, other option is curl.
So, I have a little PHP code that uses Kaywa to generate and display a QR code:
echo "<img src='http://qrcode.kaywa.com/img.php?s=10&d=$qr_url' alt='QR code' />";
Easy peasy. But for the sake of having a backup, I'd like to save this image to my server, maybe in "myserver/qrbackups". I know how to make PHP upload a file from a form, but can I do from a retrieved image URL?
See file_get_contents.
$data = file_get_contents("http://qrcode.kaywa.com/img.php?s=10&d=$qr_url");
$saved = file_put_contents('/path/to/myserver/qrbackups/the-code.png', $data);
Keep in mind /path/to/myserver/qrbackups/the-code.png should be a unique file name for each individual QR code.
You can use cURL to programmatically access URLs.
http://us3.php.net/manual/en/curl.examples-basic.php
You can try to use curl or file_get_contents to pull the file from the server.
For example:
http://www.phpriot.com/articles/download-with-curl-and-php
And after getting the file just display the one you got from your server or remote.
copy('http://domain.com/path', '/tmp/file.jpeg');
should work for you. And, as others have pointed you, cURL will work too.
If you have the GD extension on your server, you can use a library like PHP QR Code to eliminate the dependency on Kaywa. Usage requires only one line of code wherever you want to generate a QR code.