PHP - Check if pdf contains given text - TcpdfFpdi / pdftk / fpdi - php

I have a pdf document and I want to check if a specific text occurs (which are tags that I put in while generating the pdf) in the document, however using these libraries (tcpdfFpdi, pdftk or fdpi) I couldn't figure out if it's possible or how to do it.
$str = "{hello}";
$pdf = new TcpdfFpdi();
$pdf->setSourceFile($filePath);
$pdf->searchForText($str); // something like this which returns boolean
If I try without any library to dd(file_get_contents($filePath)), it returns a very long output and doesn't seem to contain the file I want so I think it's better to use one of those libraries.

Just an idea…
It's no actual PHP solution but you could use tools like pdftotext which I know from this post (where a PDF file is converted into a string to count its words): https://superuser.com/a/221367/535203
You can install it and play around with that command and call it from within your PHP application.
As far as I remember (long time ago since I used pdftotext) the output text is not exaclty the PDF's content but to search a few tags in it it's at least a good try.

Related

ImageMagick with PHP text overflowing PDF to JPG conversion

I'm trying now to convert a PDF file to JPG, using ImageMagick with PHP and CakePHP. The PDF is in perfect shape and it's right the way it should be, but the image generated from the PDF is always overflowing the borders of the file.
Until now, I've tried tweaking the code for the generation with no sucess, reading a lot from the PHP docs (http://php.net/manual/pt_BR/book.imagick.php).
Here are the convertion code:
$image = new Imagick();
$image->setResolution(300,300);
$image->setBackgroundColor('white');
$image->readImage($workfile);
$image->setGravity(Imagick::GRAVITY_CENTER);
$image->setOption('pdf:fit-to-page',true);
$image->setImageFormat('jpeg');
$image->setImageCompression(imagick::COMPRESSION_JPEG);
$image->setImageCompressionQuality(60);
$image->scaleImage(1200,1200, true);
$image->mergeImageLayers(Imagick::LAYERMETHOD_FLATTEN);
$image->setImageAlphaChannel(Imagick::ALPHACHANNEL_REMOVE);
$image->writeImage(WWW_ROOT . 'files' . DS . 'Snapshots' . DS . $filename);
Here are the results:
https://imgur.com/a/ISBmDMv
The first image is the PDF before the conversion and the second one, the image generated from the PDF where the right side text overflows.
So, why this is happening? And if someone got some alternative for any tech used (the GhostScript, ImageMagick, etc) is also welcome!
Thanks everyone!
Its very hard to say why you see the result you do, without seeing the original PDF file, rather than a picture of it.
The most likely explanation is that your original PDF file uses a font, but does not embed that font in the PDF. When Ghostscript comes to render it to an image it must then substitute 'something' in place of the missing font. If the metrics (eg spacing) of the substituted font do not match precisely the metrics of the missing font, then the rendered text will be misplaced/incorrectly sized. Of course since its not using the same font it also won't match the shapes of the characters either.
This can result in several different kinds of problems, but what you show is pretty typical of one such class of problem. Although you haven't mentioned it, I can also see several places in the document where text overwrites as well, which is another symptom of exactly the same problem.
If this is the case then the Ghostscript back channel transcript will have told you that it was unable to find a font and is substituting a named font for the missing one. I can't tell you if Imagemagick stores that anywhere, my guess would be it doesn't. However you can copy the command line from the ImagMagick profile.xml file and then use that to run Ghostscript yourself, and then you will be able to see if that's what is happening.
If this is what is happening then you must either;
Create your PDF file with the fonts embedded (this is good practice anyway)
Supply Ghostscript with a copy of the missing font as a substitute
Live with the text as it is

Concatanate RTF files with PHP withouth header

I have some RTF files generated by users with Microsoft Word. I need to be able to concatenate these files, and the result file should still be readable by libreoffice. I'm using libreoffice in order to convert the result file into a PDF file.
In order to concatenate two files, my application remove the last character of the first file and the first one of my other file. The files headers are not removed (I'm not speaking about page header).
For some reason, libreoffice do not like the headers inserted by Microsoft Word. But it works fine if I open these files with Wordpad and save them.
Another way to remove these headers is to convert these files into RTF before I concatenate them. This way i can convert into PDF, but libreoffice make a serious mess with my tabs when i convert my files to RTF.
So how can I remove the headers through PHP withouth messing with tabs ? Or do you have another way to get to the same result ?
Edit :
In a nutshell, I must be able to concanate these files and that libreoffice could open it. And my tabs must still display nicely in Microsoft Word.
As you can guess, users don't want to use Wordpad. And my customer's IT department has to comply to that wish ( office politics).
UPDATE :
I have to do the merging first, because of business rules. The files are merged, then my users can modify it using Word (no problems here). Then they ask their boss to validate it. If the boss agree to validate, the RTF file become a PDF file.
UPDATE 2 :
I have a begenning of a solution. If the RTF file start by plain text or a picture, you have to remove everything until you get \pard. But this does not work if you file start by a tab.
UPDATE 3 :
If you want to support tab too, you have to remove evrything until you get \pard or \trowd. I'm going to post the total solution once i get a working code. This will works fine as long you don't need colours and that all yours files use the same font (because we don't remove the RTF headers of the first file).
If the limitations with the 'pure RTF' approach come back to bite you, you could use LibreOffice to convert your RTF files to docx, then use a tool to merge the docx files.
There are such tools for .NET and Java (such as our MergeDocx product); I'm not sure what you'll find for PHP.
I succeed to build a reliable code, which make possible to manipulate the RTF files created with Microsoft Word. It works as long as you only need text, pictures and tabs, and don't need fancy things as color. Color works for text, but beside that ...
$content = "";
//stristr Returns all of haystack starting from and including the first occurrence of needle to the end.
$tmp_pard = stristr($RTFstring, "\pard");
//stristr fail to detect \trowd
$tmp_tab = stristr($RTFstring, "trowd");
if($tmp_pard != "" || $tmp_tab != "") {
//We pick the longer string. Because we want the first occurence of \pard or \trowd
if(strlen($tmp_pard) > strlen($tmp_tab))
// { is added so concatenation code still works. We just remove headers.
$content = "{" . substr($RTFstring,-strlen($tmp_pard)) ;
else
$content = "{" . "\\". substr($RTFstring,-strlen($tmp_tab)) ;
} else {
$content = $RTFstring;
}
return $content;

Auto New Line In GD Library

I'm using the GD Library to create images from data I'm pulling from an API.
The strings that are returned can sometimes be kind of lengthy, and I'm hoping to find a way to automatically create a new line for text if the string goes too far.
Is there something like this built into the GD library, or will I have to write some code to count the characters and move everything to a new line if it goes too long?
GD is strictly for drawing. You'll need a text layout engine such as Pango.
I am not familier with a built-in function that automatically creates new lines,
so I guess you need to write a php function that sorts the string to "sub-strings"
according to your width length and then use them in your image.
Consider looking at this post:
http://www.php.net/manual/en/function.imagestring.php#90481

Search Text In Files Using PHP

How to search text in some files like PDF, doc, docs or txt using PHP?
I want to do similar function as Full Text Search in MySQL,
but this time, I'm directly search through files, not database.
The search will do searching in many files that located in a folder.
Any suggestion, tips or solutions for this problem?
I also noticed that, google also do searching through the files.
For searching PDF's you'll need a program like pdftotext, which converts content from a pdf to text. For Word documents a simular thingy could be available (because of all the styling and encryption in Word files).
An example to search through PDF's (copied from one of my scripts (it's a snippet, not the entire code, but it should give you some understanding) where I extract keywords and store matches in a PDF-results-array.):
foreach($keywords as $keyword)
{
$keyword = strtolower($keyword);
$file = ABSOLUTE_PATH_SITE."_uploaded/files/Transcripties/".$pdfFiles[$i];
$content = addslashes(shell_exec('/usr/bin/pdftotext \''.$file.'\' -'));
$result = substr_count(strtolower($content), $keyword);
if($result > 0)
{
if(!in_array($pdfFiles[$i], $matchesOnPDF))
{
array_push($matchesOnPDF, array(
"matches" => $result,
"type" => "PDF",
"pdfFile" => $pdfFiles[$i]));
}
}
}
Depending on the file type, you should convert the file to text and then search through it using i.e. file_get_contents() and str_pos(). To convert files to text, you have - beside others - the following tools available:
catdoc for word files
xlhtml for excel files
ppthtml for powerpoint files
unrtf for RTF files
pdftotext for pdf files
If you are under a linux server you may use
grep -R "text to be searched for" ./ // location is everything under the actual directory
called from php using exec resulting in
cmd = 'grep -R "text to be searched for" ./';
$result = exec(grep);
print_r(result);
2021 I came across this and found something so I figure I will link to it...
Note: docx, pdfs and others are not regular text files and require more scripting and/or different libraries to read and/or edit each different type unless you can find an all in one library. This means you would have to script out each different file type you want to search though including a normal text file. If you don't want to script it completely then you have to install each of the libraries you will need for each of the file types you want to read as well. But you still need to script each to handle them as the library functions.
I found the basic answer here on the stack.

convert txt or doc to pdf using php

have anyone come across a php code that convert text or doc into pdf ?
it has to follow the same format as the original txt or doc file meaning the line feed as well as new paragraph...
Converting from DOC to PDF is possible using phpLiveDocx:
$phpLiveDocx = new Zend_Service_LiveDocx_MailMerge();
$phpLiveDocx->setUsername('username')
->setPassword('password');
$phpLiveDocx->setLocalTemplate('document.doc');
// necessary as of LiveDocx 1.2
$phpLiveDocx->assign('dummyFieldName', 'dummyFieldValue');
$phpLiveDocx->createDocument();
$document = $phpLiveDocx->retrieveDocument('pdf');
file_put_contents('document.pdf', $document);
unset($phpLiveDocx);
For text to PDF, you can use the pdf extension is PHP.
You can view the examples here.
Have a look at this SO question. Using OpenOffice in command line mode for conversions can be done, though you'd have to search a bit for the conversion macro's. I'm not saying it's light-weight though :)
See HTML_ToPDF. It also works for text.
It has been a long time since I touched PHP, but if you can make web service calls from it then try this product. It provides excellent conversion fidelity. It also supports additional formats including Infopath, Excel, PowerPoint etc as well as Watermarking support.
Please note that I have worked on this product so the usual disclaimers apply.

Categories