Php - pdf parser - php

i try to find a pdf parser. I search in stackoverflow but there's no satisfactory answers. Some say that Zend is good to do that but i dont want to use it. Is there are good class to do that ?

I don't know how deep you need to go the pdf parsing but here is something I have done very recently to extract pdf text into a json string. it will also extract the images, but if you don't want them you can still comment these two lines in the run function in pdfreader/main.py
extract_images(pdf_file)
dict_book = get_images_update_dict(dict_book, image_folder)
Yes it's in python not in php, but you can get back the result in json the following way:
exec("./parser.py pdfreader/book.pdf './images/' 2>&1", $output);
$data = json_decode($output)
var_dump($data);

Related

extract info from jpeg with PHP

I want to extract variable lengths of information from a jpeg-file using PHP, but it is not exif-data.
If I open the jpeg with a simple text editor, I can see that the wanted informations are at the end of the file and seperated by \00.
Like this:
\00DATA\00DATA00DATA\00DATA\000\00DATA
Now if I use PHP's file_get_contents() to load the file into a string, the dividers \00 are gone and other symbols show up.
Like so:
ÿëžDATADATADATADATADATA ÿÙ
Could somebody please eplain:
Why do the \00 dividers vanish?
How to get the informations using PHP?
EDIT
The question is solved, but for those seeking a smarter solution, here is the file I try to obtain the DATA parts from: https://www.dropbox.com/s/5cwnlh2kadvi6f7/test-img.jpg?dl=0 (yes I know its corrupted)
Use instead $data = exif_read_data("PATH/some.jpg") it will give you all headers data about image, you can check its manual here - http://php.net/manual/en/function.exif-read-data.php
I came up with a solution on my own. May not be pretty, but works for me.
Using urlencode(file_get_contents()) I was able to retrieve the \00 parts as %00.
So now it reads like this:
%00DATA%00DATA%00DATA%00DATA%000%00DATA
I can split the string at the %00 parts.
I am going to accept this answer, once SO lets me do so and nobody comes up with a better solution.

Convert HTML code to doc using PHP and PHPWord

I am using PHPWord to load a docx template and replace tags like {test}. This is working perfectly fine.
But I want to replace a value with html code. Directly replacing it into the template is not possible. There is now way to do this using PHPWord, as far as I know.
I looked at htmltodocx. But it seams it will not work either, is it posible to transform a peace of code like <p>Test<b>test</b><br>test</p> to a working doc markup? I only need the basic code, no styleing. but Linebreaks have to work.
Here is the link to the github. It is working fine Html-Docx-js.
And it is the demo also available here.
Other option is this Link.
$toOpenXML = HTMLtoOpenXML::getInstance()->fromHTML("<p>te<b>s</b>t</p>");
$templateProcessor->setValue('test', $toOpenXML);
The other answers propose H2OXML which only supports
Bold, italic and underlined text
Bulled lists
As described in their docs and their last update was in 2012.
I did some research and found a pretty nice solution:
$var = 'Some text';
$xml = "<w:p><w:r><w:rPr><w:strike/></w:rPr><w:t>". $var."</w:t></w:r></w:p>";
$templateProcessor->setValue('param_1', $xml);
The above example, shows how would be a striked text. Instead of "w:strike" you can use "w:i" for italic or "w:b" bold, and so on. Not sure if it works on all tags or not.
Thanks for your answer, Varun.
The simple PHP library H2OXML works for me https://h2openxml.codeplex.com/
$toOpenXML = HTMLtoOpenXML::getInstance()->fromHTML("<p>te<b>s</b>t</p>");
$templateProcessor->setValue('test', $toOpenXML);
I can now convert html code to insert it using PHPWord.
$content = '<p>Test<b>test</b><br>test</p>';
use it before IOFactory::createWriter();
\PhpOffice\PhpWord\Shared\Html::addHtml($section, $content);

PHP pdf form parse regex

I have a two PDF forms that I'd like to input values for using PHP. There doesn't seem to be any open source solutions. The only solution seems to be SetaSign which is over $400. So instead I'm trying to dump the data as a string, parse using a regex and then save. This is what I have so far:
$pdf = file_get_contents("../forms/mypdf.pdf");
$decode = utf8_decode($pdf);
$re = "/(\d+)\s(?:0 obj <>\/AP<>\/)(.*)(?:>> endobj)/U";
preg_match_all($re, $decode, $matches);
print_r($matches);
However, my print_r is empty even after testing here. The matches on the right are first a numerical identifier for the field (I think) and then V(XX1) where "XX1" is the text I've manually entered into the form and saved (as a test to find how and where that data is stored). I'm assuming (but haven't tested) that N<>>>/AS/Off is a checkbox.
Is there something I need to change in my regex to find matches like (2811 0 obj <>/AP<>/V(XX2)>> endobj) where the first find will be a key and the second find is the value?
Part 1 - Extract text from PDF
Download the class.pdf2text.php # http://pastebin.com/dvwySU1a (Updated on 5 of April 2014) or http://www.phpclasses.org/browse/file/31030.html (Registration required)
Usage:
include('class.pdf2text.php');
$a = new PDF2Text();
$a->setFilename('test.pdf');
$a->decodePDF();
echo $a->output();
The class doesn't work with all pdf's I've tested, give it a try and you may get lucky :)
Part 2 - Write to PDF
To write the pdf contents use tcpdf which is an enhanced and maintained version of fpdf.
Thanks for those who've looked into this. I decided to convert the pdfs (since I'm not doing this as a batch) into svg files. This online converter kept the form fields and with some small edits I've made them printable. Now, I'll be able to populate the values and have a visual representation of the pdf. I may try tcpdf in the event I want to make it an actual pdf again though I'm assuming it wont keep the form fields.

A way to return XML output from a PHP shell_exec?

I need to be able to run the Linux find command from a PHP program, and want to be able to have it return the output of the find as an XML. Is this possible to do? I want to be able to do this so I can easily find the parent (directory) for each child (file). Or is there a better way to do this? Thanks!
You wouldn't get the output from the find command as xml, it will just return text (as it only ever should).
your best bet would probably to create the xml you want from the text that is returned when you use exec to run find.
example sudo code:
get all info you want to find: exec(find);
create barebones xml string;
create xml object ("i'd use simplexml in this example");
simplexml->addchild(info found from exec find);
sorry for only sudo code, couldn't write anything up in my current situtation
Helpful refenece if you don't know about simplexml:
http://us3.php.net/manual/en/book.simplexml.php

Read the content of a PDF with PHP?

I need to read certain parts from a complex PDF. I searched the net and some say FPDF is good, but it cant read PDF, it can only write. Is there a lib out there which allows to get certain content of a given PDF?
If not, whats a good way to read certain parts of a given PDF?
Thanks!
I see two solutions here:
converting your PDF file into something else before: text, html.
using a library to do so and bad news here, most of them are written in Java.
https://whatisprymas.wordpress.com/2010/04/28/lucene-how-to-index-pdf-files/
What about that ?
http://www.phpclasses.org/package/702-PHP-Searches-pdf-documents-for-text.html
ps: I don't test this class, just read the description.
$result = pdf2text ('sample.pdf');
echo "<pre>$result</pre>";
How to get “clean” text :source code pdf2text
http://webcheatsheet.com/php/reading_clean_text_from_pdf.php

Categories