Text extraction from a pdf / a - php

Do you know any library that allows me to extract the text of a type A pdf to read it in PHP?
I have tried many libraries but none of them have been able to read the content
I need help

You could try PDF Parser, an open source library available in github
Will be something like this. But check the doc for further details
<?php
// lot of lines
// Parse pdf file and build necessary objects.
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('document.pdf');
$text = $pdf->getText();
echo $text;
?>

Related

Convert doc or docx to pdf use tcpdf not matched

I trying to convert file doc or docx to pdf but the result doesn't match with the origin file doc/docx and also there is no style in file pdf. I don't know why, because here i'm using tcpdf and phpword
this is my code to convert:
$filetarget = FileHelper::normalizePath($pathdirectory.'/'.$filename);
$objReader = \PhpOffice\PhpWord\IOFactory::createReader('Word2007');
$contents = $objReader->load($filetarget);
$tcpdfPath = Yii::getAlias('#baseApp') . '/vendor/tecnickcom/tcpdf';
\PhpOffice\PhpWord\Settings::setPdfRendererPath($tcpdfPath);
\PhpOffice\PhpWord\Settings::setPdfRendererName('TCPDF');
$objWriter = \PhpOffice\PhpWord\IOFactory::createWriter($contents,'PDF');
$fileresult = str_replace('.docx', '.pdf', $filetarget);
$objWriter->save($fileresult);
$toPdf = FileHelper::normalizePath($fileresult);
this is part of result after converted from docx to pdf
and this is part of origin docx file
what's wrong with my code?
Unfortunately phpWord is very basic so for DocX to PDF output you can see there is no ability to preserve text or page breaks, nor support lists or export images.
For the current list of features see
https://phpword.readthedocs.io/en/latest/intro.html#writers
Since it runs OpenOffice as the converter you could try other PHP methods to run the conversion direct

Get content of PDF file in PHP

I have a FlipBook jquery page and too many ebooks(pdf format) to display on it. I need to keep these PDF's hidden so that I would like to get its content with PHP and display it with my FlipBook jquery page. (instead of giving whole pdf I would like to give it as parts).
Is there any way i can get whole content of PDF file with PHP?
I need to seperate them according to their pages.
You can use PDF Parser (PHP PDF Library) to extract each
and everything from PDF's.
PDF Parser Library Link: https://github.com/smalot/pdfparser
Online Demo Link: https://github.com/smalot/pdfparser/blob/master/doc/Usage.md
Documentation Link: https://github.com/smalot/pdfparser/tree/master/doc
Sample Code:
<?php
// Include Composer autoloader if not already done.
include 'vendor/autoload.php';
// Parse pdf file and build necessary objects.
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('document.pdf');
$text = $pdf->getText();
echo $text;
?>
Regarding another part of your Question:
How To Convert Your PDF Pages Into Images:
You need ImageMagick and GhostScript
<?php
$im = new imagick('file.pdf[0]');
$im->setImageFormat('jpg');
header('Content-Type: image/jpeg');
echo $im;
?>
The [0] means page 1.

How to parse line of PDF file from PHP?

I want to parse PDF file from PHP. For this, I have build this code (I have used PDF Parser library).
Code:
<?php
// Include Composer autoloader if not already done.
include 'vendor/autoload.php';
// Parse pdf file and build necessary objects.
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('XA035 - Luis gui Lopes esteves.pdf');
$text = $pdf->getText();
echo $text;
?>
With this code, I'm able to read the text from PDF file but I'm not able to parse the information because for example, if in the file I have this line:
PERSONAL INFORMATION Marco Mengoni
Italia
Via della giustizia
when I call my page the echo $text; print this on the page:
PERSONAL INFORMATION Marco Mengoni Italia Via Della Giustizia.
Now is there a mode to parse single line????

Reading PDF files in PHP or JS, then extracting the contents, by text ideally

I have a task of reading pdf files after an upload in the DB or n a folder,
What is the question here is : How to read PDF files in PHP or JS, JQuery, AJAX,
Then i want to recuperate the datas to inject in a form fields.
There's a lot of infos to do this process with text files but pdf seems complicated. There is a PHP class for that ? I'm not used to classes in Php but with infos, it would lead me.
Thanks a lot for help!!
Have a grreat one!
I managed to do this using http://www.pdfparser.org/
I needed the specifications from a pdf file and get all the raw text. This is the code I used:
<?php
include 'pdfparser-master/vendor/autoload.php';
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('specs.pdf');
$text = $pdf->getText();
echo $text;
?>

Output Buffer + Pdf - PHP

I have a page that uses the glob function and file_get_contents to have a few html files and store them in the buffer.
So I want to convert this buffer ob_get_contents() to an pdf file.
What is the best way to do that? how?
Thanks in advance.
For creating PDF files from HTML and CSS, check out DOMpdf.
While this solution doesn't support the full range of HTML and CSS and its rendering can be a pain sometimes, it has one advantage: it does not require any special binaries to be installed (like wkhtmltopdf). It should run on your average shared PHP hosting.
Usage example:
<?php
require_once("dompdf_config.inc.php");
$html =
'<html><body>'.
'<p>Put your html here, or generate it with your favourite '.
'templating system.</p>'.
'</body></html>';
$dompdf = new DOMPDF();
$dompdf->load_html($html);
$dompdf->render();
$dompdf->stream("sample.pdf");
?>
why using the outputbuffer for this? you have it in variables using file_get_contents and can simply create your pdf with the data from the variables. when using ob_get_contents all it does is return the outputbuffer and what you normally do with the result is saving into a variable...
btw. you do want to convert html into pdf? If yes have a look at wkhtmltopdf
If ob_get_contents contains html files they are so many solutions out there that can achieve what you want. I think you should look at the following
PrinceXML
FPDF
TCPDF
HTML to PDF converter (PHP5)
wkhtmltopdf
Example using Simple HTML 2 PDF using PHP
$html = ob_get_contents();
ob_end_clean();
$pdf = new HTML2FPDF();
$pdf->SetTopMargin(1);
$pdf->AddPage();
$pdf->WriteHTML($html);
$pdf->Output('test.pdf','D');

Categories