Read the content of a PDF with PHP? - php

I need to read certain parts from a complex PDF. I searched the net and some say FPDF is good, but it cant read PDF, it can only write. Is there a lib out there which allows to get certain content of a given PDF?
If not, whats a good way to read certain parts of a given PDF?
Thanks!

I see two solutions here:
converting your PDF file into something else before: text, html.
using a library to do so and bad news here, most of them are written in Java.
https://whatisprymas.wordpress.com/2010/04/28/lucene-how-to-index-pdf-files/

What about that ?
http://www.phpclasses.org/package/702-PHP-Searches-pdf-documents-for-text.html
ps: I don't test this class, just read the description.

$result = pdf2text ('sample.pdf');
echo "<pre>$result</pre>";
How to get “clean” text :source code pdf2text
http://webcheatsheet.com/php/reading_clean_text_from_pdf.php

Related

extract info from jpeg with PHP

I want to extract variable lengths of information from a jpeg-file using PHP, but it is not exif-data.
If I open the jpeg with a simple text editor, I can see that the wanted informations are at the end of the file and seperated by \00.
Like this:
\00DATA\00DATA00DATA\00DATA\000\00DATA
Now if I use PHP's file_get_contents() to load the file into a string, the dividers \00 are gone and other symbols show up.
Like so:
ÿëžDATADATADATADATADATA ÿÙ
Could somebody please eplain:
Why do the \00 dividers vanish?
How to get the informations using PHP?
EDIT
The question is solved, but for those seeking a smarter solution, here is the file I try to obtain the DATA parts from: https://www.dropbox.com/s/5cwnlh2kadvi6f7/test-img.jpg?dl=0 (yes I know its corrupted)
Use instead $data = exif_read_data("PATH/some.jpg") it will give you all headers data about image, you can check its manual here - http://php.net/manual/en/function.exif-read-data.php
I came up with a solution on my own. May not be pretty, but works for me.
Using urlencode(file_get_contents()) I was able to retrieve the \00 parts as %00.
So now it reads like this:
%00DATA%00DATA%00DATA%00DATA%000%00DATA
I can split the string at the %00 parts.
I am going to accept this answer, once SO lets me do so and nobody comes up with a better solution.

Convert HTML code to doc using PHP and PHPWord

I am using PHPWord to load a docx template and replace tags like {test}. This is working perfectly fine.
But I want to replace a value with html code. Directly replacing it into the template is not possible. There is now way to do this using PHPWord, as far as I know.
I looked at htmltodocx. But it seams it will not work either, is it posible to transform a peace of code like <p>Test<b>test</b><br>test</p> to a working doc markup? I only need the basic code, no styleing. but Linebreaks have to work.
Here is the link to the github. It is working fine Html-Docx-js.
And it is the demo also available here.
Other option is this Link.
$toOpenXML = HTMLtoOpenXML::getInstance()->fromHTML("<p>te<b>s</b>t</p>");
$templateProcessor->setValue('test', $toOpenXML);
The other answers propose H2OXML which only supports
Bold, italic and underlined text
Bulled lists
As described in their docs and their last update was in 2012.
I did some research and found a pretty nice solution:
$var = 'Some text';
$xml = "<w:p><w:r><w:rPr><w:strike/></w:rPr><w:t>". $var."</w:t></w:r></w:p>";
$templateProcessor->setValue('param_1', $xml);
The above example, shows how would be a striked text. Instead of "w:strike" you can use "w:i" for italic or "w:b" bold, and so on. Not sure if it works on all tags or not.
Thanks for your answer, Varun.
The simple PHP library H2OXML works for me https://h2openxml.codeplex.com/
$toOpenXML = HTMLtoOpenXML::getInstance()->fromHTML("<p>te<b>s</b>t</p>");
$templateProcessor->setValue('test', $toOpenXML);
I can now convert html code to insert it using PHPWord.
$content = '<p>Test<b>test</b><br>test</p>';
use it before IOFactory::createWriter();
\PhpOffice\PhpWord\Shared\Html::addHtml($section, $content);

Php - pdf parser

i try to find a pdf parser. I search in stackoverflow but there's no satisfactory answers. Some say that Zend is good to do that but i dont want to use it. Is there are good class to do that ?
I don't know how deep you need to go the pdf parsing but here is something I have done very recently to extract pdf text into a json string. it will also extract the images, but if you don't want them you can still comment these two lines in the run function in pdfreader/main.py
extract_images(pdf_file)
dict_book = get_images_update_dict(dict_book, image_folder)
Yes it's in python not in php, but you can get back the result in json the following way:
exec("./parser.py pdfreader/book.pdf './images/' 2>&1", $output);
$data = json_decode($output)
var_dump($data);

converting html text to an image with php

what's the best way to convert a text embedded in a html tag to an image using php keeping the style written in the html tag ? for example :
convert :
<span class="Apple-style-span" style="font-size: xx-large;"><font class="Apple-style-span" color="#F4A460">Stack </font><font class="Apple-style-span" color="#800000">Overflow</font></span>
into :
is there any class for it ? or should I explode it and read the tags one by one ? any suggestion ?
Might want to have a look at Painty. Although it isn't exactly what you're looking for because you'll have to feed it an array of options, it should be a good resource on which you can expand.
Not sure if you also want to render the font(s) being used in your HTML snippet, but if you do, you would also have to get all the commonly used web-fonts and put them all in a folder from where the script can read.
Hope this helps.
With PHP GD Library support, yes:
http://visionmasterdesigns.com/tutorial-convert-text-into-transparent-png-image-using-php/ (font/size technique included)
http://corpocrat.com/2009/06/23/php-script-to-convert-textemail-address-to-image/
Check this one out
http://code.google.com/p/wkhtmltopdf/downloads/list
The project is centered around html to pdf using the webkit engine, but there are also binaries and source for html to image. It's an external binary though, so might not be useful to you in your use-case.
Otherwise I would look into imagemagick.

rtf format to pdf

Is there any way to convert rtf format to pdf using PHP?
Thanks
If you want to stick with pure PHP, you can probably use HTML as an intermediary:
Convert RTF to HTML
http://freshmeat.net/projects/rtf2htm/ , http://www.phpclasses.org/package/1930-PHP-RTF-to-HTML-converter-with-latin-character-support.html
Optionally: clean up the HTML
http://htmlpurifier.org/
Convert HTML to PDF
http://dompdf.github.io/
You can use OpenOffice command line interface for that. Check my answer to a similar question.
Ted is the tool you're looking for. Ted brings also a script called rtf2pdf.sh you can execute by PHP to create a PDF file.
You should try out livedocx livedocx.com . The latest Zend Framework 1.10 has a ready built module to help you out. You can read more about it at this place http://www.phpfreaks.com/tutorial/template-based-document-generation-using-livedocx-and-zend-framework

Categories