Laravel - PDF: Cannot encode text from PDF to Text

Laravel - PDF: Cannot encode text from PDF to Text - php

I am trying to upload the PDF file and want to convert it from PDF to Text. Some of the files are able to convert and get the text from the PDF with charm but some of them having issues as shown in the screenshots. There are two different examples, (although it shows 3 but 2 are the same) The top one and the second one are the same which I think its not a properly encoded (not sure) and the third one, it only captures the half info from the PDF. The main content that I need is after it stops.
How can I fix this?
use App\FilePdf;
use Spatie\PdfToText\Pdf;
$name=$file->getClientOriginalName();
$file->move(public_path().'/pdftotext/', $name);
$path = public_path('/pdftotext/'. $name);
$reader = new \Asika\Pdf2text;
$output = $reader->decode($path);
$data[] = $name;
$output = str_replace(array("\n", "\r"), '', trim($output));
dd($output);
Or if there is any alternate solution for this problem, please suggest.
Thanks and appreciate for your time.

use below function to get the string of pdf file
use Spatie\PdfToText\Pdf;
$pdf_string = Pdf::getText(public_path() . "/<foldername>/<pdffilename>);

Related

Trim a string purely to text, without leaving any whitespace in PHP?

I am having an image upload script, which works fine so far. It runs with jQuery fileupload. The PHP script generates a new name for the uploaded image and gives it out through exit($imgname);. Strangely I always get a response with many whitespaces like you can see in the picture.
To the screenshot:***
My whole website uses jQuery and I thought about using $.trim() to just trim the result to plain text, but I don't know if this is a great idea since I don't think that this works for any common browser.
Additional:
The most strange thing about this is, that it worked in the past just fine without any whitespaces. Today I uploaded something and suddenly it does something like this...
PHP:
$upload = $image->upload();
$imgname = $image->getName();
$imgmime = $image->getMime();
$fullimgname = $imgname . "." . $imgmime;
if($upload){
// POST TO DATABASE ETC.
exit($fullimgname);
}

The white paces may come from somewhere else in the PHP file...
Look for spaces that would be outside the <?php and ?> brackets.
Like this, for example:
// <-- 4 tabs on this line
<?php
$upload = $image->upload();
$imgname = $image->getName();
$imgmime = $image->getMime();
$fullimgname = $imgname . "." . $imgmime;
if($upload){
// POST TO DATABASE ETC.
exit($fullimgname);
}

Try to use regex, because it might consider full unicode charset, etc:
$imgname = preg_replace('/^\s+/u', "", $imgname);
Although it looks likeltrim() should be enough.

Add text to another file on top of it without overwriting anything in php

I want add 3 to 4 lines to all my files of my website, I am able to modify all php files
$source = file_get_contents($path);
$source = str_replace($this->oldTxt, '', $source);
$source = preg_replace('#\<\?php#',"<?php\n".$this->newTxt,$source,1);
file_put_contents($path, $source);
echo $path."\n";
By this code i am able to modify php files as they are starting with ?php, but what can i do add the same text to js and others text file on the top of it.

If you want to add the text to all files on top
$source = file_get_contents($path);
file_put_contents($path, $this->newTxt . $source);
Any special reason, you insert the text after the <?php ?? The comment syntax is different in PHP, Javascript and HTML. So probably you need to adjust the text you include based on the file type.

PHP Creating dynamic pdf

Im trying to create a dynamic pdf using php,
I can load the pdf once the form is submitted but soon as I try to edit the pdf it fails to load
I shortened the code to keep it straight forward
PDF Structure
Your Name is : <<NAME>>
PHP Dynamic Script
set_time_limit(180);
//Create Variable Names
$name = $_POST['name'];
function pdf_replace($pattern, $replacement, $string) {
$len = strlen($pattern);
$regexp = '';
for($i = 0; $i <$len; $i++) {
$regexp .= $pattern[$i];
if($i < $len - 1) {
$regexp .= "(\)\-{0,1}[0-9]*\(){0,1}";
}
}
return ereg_replace ($regexp, $replacement, $string);
}
header('Content-Disposition: filename=cert.pdf');
header('Content-type: application/pdf');
//$date = date('F d, Y');
$filename = 'testform.pdf';
$fp = fopen($filename, 'r');
$output = fread($fp, filesize($filename));
fclose($fp);
//Replace the holders
$output = pdf_replace('<<NAME>>', $name, $output);
echo $output;
If I comment out the output it loads the form fine but soon as I try to run the function to replace the placeholder it fails to load. Anyone do something like this before?

I've tried your code and I can assume there are one of the following reasons that can cause the problem:
When you call pdf_replace() PHP returns Deprecated notice on ereg_replace() function. This breaks PDF structure and cause PDF fail to load. This function is deprecated since PHP 5.3.0. Simple solution is to start using preg_replace() instead.
function pdf_replace($pattern, $replacement, $string) {
return preg_replace('/'.preg_quote($pattern,'/').'/', $replacement, $string);
}
In case you can't do it then the solution is to either edit php.ini file and edit error_reporting parameter. You can add "^ E_DEPRECATED" to your current config value to disable Deprecated notices. Other option is to add error_reporting() at the beginning of your script with appropriate value.
I do not see the PDF you use but some PDF generators encode PDF source. In this case it is a problem to find text there. For example I've tried "Print to PDF" feature on Mac and I was not able to find plain text in source there. So either fopen or ereg_replace can complain about wrong file format. In this case you should use some library that can work with PDF in more clever manner. I prefer FPDF but there are plenty of such libraries.

I'm a web designer and I use ezPDF to create pdf files for making reports and views it in any browsers. Here's a webpage that i looked for the tutorials: http://www.weberdev.com/get_example.php3?ExampleID=4804
I hope this would be helpful :)

Your code is seemingly having two problems.
First of all, ereg_replace() is a DEPRECATED function. Due to this the php script is throwing an error and the error message breaks the pdf structure. Change the function to preg_replace() and it should work.
Secondly, I tried your code with a sample pdf-form. It seems that ereg_replace() cannot process some characters. In the pdf-form that I used, this function truncates the string(the pdf-form data) after it meets a specific character namely, §. Thats why even if you suppress the the error using error_reporting(E_ALL ^ E_DEPRECATED);,
the code will not work even then.
So, you better go for preg_replace();
<?php
set_time_limit(180);
//Create Variable Names
$name = $_POST['name'];
function pdf_replace($pattern, $replacement, $string) {
$len = strlen($pattern);
$regexp = '';
for($i = 0; $i <$len; $i++) {
$regexp .= $pattern[$i];
if($i < $len - 1) {
$regexp .= "(\)\-{0,1}[0-9]*\(){0,1}";
}
}
return preg_replace ($regexp, $replacement, $string);
}
header('Content-Disposition: filename=cert.pdf');
header('Content-type: application/pdf');
//$date = date('F d, Y');
$filename = 'testform.pdf';
$fp = fopen($filename, 'r');
$output = fread($fp, filesize($filename));
//fclose($fp);
//Replace the holders
$output = pdf_replace('<<NAME>>', $name, $output);
echo $output;
?>

It is not possible to replace a text string in a PDF without very specific requirements on the original PDF document.
Some comments on such project:
A PDF document uses byte offsets to allow fast access to specific objects within the document. By changing a string these offsets will get invalid and the PDF document can be seen as damaged.
At the end most content streams in a PDF document are compressed. So the string you are searching for is (maybe) in one of these streams and not "visible".
A string you "see" after a PDF is rendered has not to be the same as in the PDF source.
The replaced/new string may use characters which are not available in the used font or they are matched to other characters by an separate encoding.
And much more things to consider...

why you are not using str_replace?
please check if this code is working for you (assumed you can see the words you want to replace in simple text editor):
// hide all errors so there will be no errors output which will break your PDF file
error_reporting(0);
ini_set("display_errors", 0);
//Create Variable Names
$name = $_POST['name'];
$filename = 'testform.pdf';
$fp = fopen($filename, 'r');
$output = fread($fp, filesize($filename));
fclose($fp);
//Replace the holders
$output = str_replace('<<NAME>>', $name, $output);
// I added the time just to avoid browser caching of the PDF file
header('Content-Disposition: filename=cert'.time().'.pdf');
header('Content-type: application/pdf');
echo $output;
There are many PHP projects to generate dynamic PDF files using PHP, why you dont use them, why you are using this regular expression?, probably you will have some designs you will
have to add in the future to the pdf such as tables logos and stuff and the easiest way is to convert HTML to PDF.
If you are looking for performance and almost %100 of look alike the HTML and CSS you have you should use wkhtmltopdf its free and easy to use it runs on your server but it is not PHP it is executable file that you will have to execute it from your PHP code.
http://wkhtmltopdf.org/
another alternative is to use is pure PHP unlike the previous which is executable
http://www.tcpdf.org/
I used both and prefer the wkhtmltopdf because its faster and priciest in HTML and CSS

Generating different content when creating picture on different machines

I want to create a picture from a passed binary string:
$fileName = uniqid().".".$imgType;
$fileName = "../tmp/".$fileName;
$f = fopen($fileName,'wb');
$picture = mb_convert_encoding($picture, "UUENCODE", "UTF-8");
fwrite($f, $picture);
fclose($f);
This works quite well on one machine with PHP 5.3.10-1ubuntu3.4. The picture is created properly. If try it on a different machine with PHP 5.3.19 the output is very strange. E.g. if you open the file with less then you will find \0 instead of desired ^# characters.
Why does this happen?
The binary string is part of a post request from a website using HTML5 Formdata encoded in both cases with UTF-8.

$picture = convert_uudecode($picture);

Editing a .doc in PHP

I am trying to replace strings in a word document by reading the file into a variable $content and then using str_ireplace() to change the string. I can read the content from the file but I str_ireplace() does not seem to be able to replace the string. I assumed it would because the string is 'binary safe' according to the PHP documentation. Sorry, I am a beginner with PHP file manipulation so all this is quite new to me.
This is what I have written.
copy('jack.doc' , 'newFile.doc');
$handle = fopen('newFile.doc','rb');
$content = '';
while (!feof($handle))
{
$content .= fread($handle, 1);
}
fclose($handle);
$handle = fopen('newFile.doc','wb');
$content = str_ireplace('USING_ICT_BOX', 'YOUR ICT CONTENT', $content);
fwrite($handle, $content);
fclose($handle);
When I download the new file, it opens as it should in MS Word but it shows the old string and not the one that should be replaced.
Can I fix this issue? Is there any better tool I can use for replacing strings in MS Word thourgh PHP?

I have same requirement for Edit .doc or .docx file using php and i have find solution for it.
And i have write post on It :: http://www.onlinecode.org/update-docx-file-using-php/
copy('jack.doc' , 'newFile.doc');
$full_path = 'newFile.doc';
if($zip_val->open($full_path) == true)
{
// In the Open XML Wordprocessing format content is stored.
// In the document.xml file located in the word directory.
$key_file_name = 'word/document.xml';
$message = $zip_val->getFromName($key_file_name);
$timestamp = date('d-M-Y H:i:s');
// this data Replace the placeholders with actual values
$message = str_replace("client_full_name", "onlinecode org", $message);
$message = str_replace("client_email_address", "ingo#onlinecode.org", $message);
$message = str_replace("date_today", $timestamp, $message);
$message = str_replace("client_website", "www.onlinecode.org", $message);
$message = str_replace("client_mobile_number", "+1999999999", $message);
//Replace the content with the new content created above.
$zip_val->addFromString($key_file_name, $message);
$zip_val->close();
}

Maybe this would point you to the right direction: http://davidwalsh.name/read-pdf-doc-file-php

Solutions I've found so far (not tested though):
Docvert - works for Doc, free, but not directly usable
PHPWordLib - works for Doc, not free
PHPDocX - DocX only, needs Zend.

I am going to opt for PHPWord www.phpword.codeplex.com as I believe teachers are going to get Office 2007 next year and also I will try and find some way to convert between .docx and .doc through PHP to support them in the mean time.

If you can reach a web-service, look at Docmosis Cloud services since it can mailmerge a doc file with your data and give you back a doc/pdf/other. You can https post to the service to make the request so is pretty straight forward from PHP.

There is many way to handle word document file on linux
antiword - not much effective as it converts into plain text.
pyODconvert
open-office or liboffice - through UNO
unoconv utility - need to installation permission on server
There is one python script which is most usable for online file conversion but you need to convert those file through command line.
There is no specific and satisfied solution to handle word files by only using php code.
I hunted for a long time to reach at this suggestion.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Laravel - PDF: Cannot encode text from PDF to Text - php

use below function to get the string of pdf file use Spatie\PdfToText\Pdf; $pdf_string = Pdf::getText(public_path() . "/<foldername>/<pdffilename>);

Related

Trim a string purely to text, without leaving any whitespace in PHP?

Add text to another file on top of it without overwriting anything in php

PHP Creating dynamic pdf

Generating different content when creating picture on different machines

Editing a .doc in PHP

Categories

Resources