I'm using pdfparser to parse text from a pdf file. for old version pdf files it is working but for new version pdf files this parser is not working.
my pdf version is 1.7
<?php
include 'vendor/autoload.php';
// Parse pdf file and build necessary objects.
$parser = new Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('sample.pdf');
// Retrieve all pages from the pdf file.
$pages = $pdf->getPages();
// Loop over each page to extract text.
$content=array();
foreach ($pages as $page) {
$content[]= $page->getTextArray();
echo"<pre>";
print_r($content);
}
I experienced the same behaviour!
Now I use a tool to check the pdf version before I try to parse it. If it is not 1.4 I convert it to 1.4 and parse it then.
Here is a php library for that if needed: https://github.com/xthiago/pdf-version-converter
Code example:
function searchablePdfParser($systemPath) {
//we save the file to a temporay file because we might need to convert it.
$tempPath = getPathWithIdAndTimestamp($systemPath) . 'tmp.pdf';
copy($systemPath, $tempPath);
//check whether it needs to be converted and convert it if required
$guesser = new RegexGuesser();
$pdfVersion = $guesser->guess($tempPath); // will print something like '1.4'
if ( $pdfVersion != '1.4' ) {
$command = new GhostscriptConverterCommand();
$filesystem = new Filesystem();
$converter = new GhostscriptConverter($command, $filesystem);
$converter->convert($tempPath, '1.4');
}
//parse the original file or the converted file if it hadn't been a pdf 1.4 version
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile($tempPath);
$text = $pdf->getText();
unlink($tempPath);
if ( strlen($text) < 30 ) {
return '';
}
return $text;
}
Related
I am using Pdfparser Library for parsing pdfs. While parsing, Some pages of the 20-page pdf file are read and some pages are not. This is code I am using
$str_path = 'example_book.pdf';
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile($str_path);
$pages = $pdf->getPages();
$page = $pages[7];
$text = $page->getText();
echo $text;
When I run the php script I get this error:
Call to undefined method Smalot\PdfParser\Encoding::__toString()
Smalot\PdfParser\Font::translateChar
vendor/smalot/pdfparser/src/Smalot/PdfParser/Font.php:104
Is there any other way to do this?
follow below steps
1.open below mentioned file on the editor
vendor\smalot\pdfparser\src\Smalot\PdfParser\Font.php
search for function
translateChar
3. Comment Out the below-mentioned code (or remove )
if (\strlen($char) < 2 && $this->has('Encoding') && 'WinAnsiEncoding'
=== $this->get('Encoding')->__toString()) {
$fallbackDecoded = self::uchr($dec);
}
save file and exit.
I am using PDFTOHTML (a php library) to convert pdf files to html and it's working fine but it's showing converted file in a browser and not storing in local folder, i want to store converted html in local folder using php with the same name as pdf was i-e mydata.pdf to mydata.html
Code that is converting pdf to html is:-
<?php
// if you are using composer, just use this
include 'vendor/autoload.php';
$pdf = new \TonchikTm\PdfToHtml\Pdf('cv.pdf', [
'pdftohtml_path' => 'C:/wamp64/www/new/poppler-0.51/bin/pdftohtml.exe',
'pdfinfo_path' => 'C:/wamp64/www/new/poppler-0.51/bin/pdfinfo.exe'
]);
// get content from all pages and loop for they
foreach ($pdf->getHtml()->getAllPages() as $page) {
echo $page . '<br/>';
}
?>
Just change your foreach to
$filePdf = 'cv'; // your pdf filename without extension
$pdf = new \TonchikTm\PdfToHtml\Pdf($filePdf.'.pdf', [
'pdftohtml_path' => 'C:/wamp64/www/new/poppler-0.51/bin/pdftohtml.exe',
'pdfinfo_path' => 'C:/wamp64/www/new/poppler-0.51/bin/pdfinfo.exe'
]);
$counterPage = 1;
foreach ($pdf->getHtml()->getAllPages() as $page) {
$filename = $filePdf . "_" . $counterPage.'.html'; // set as string directory and filename where you want to save it
if (file_exists($filename)) {
// if file exist do something
} else {
// else
$fileOpen = fopen($filename, 'w+');
fputs($fileOpen, $page);
fclose($fileOpen);
}
$counterPage++;
echo $page . '<br/>';
}
This will create you file for example: example_1.html, example_2.html and so on.
if this not help you then probably you need to use file_put_contents with ob_start() and ob_get_contents() read more here
Look this :
<?php
// if you are using composer, just use this
include 'vendor/autoload.php';
$pdf = new \TonchikTm\PdfToHtml\Pdf('cv.pdf', ['pdftohtml_path' => 'C:/wamp64/www/new/poppler-0.51/bin/pdftohtml.exe', 'pdfinfo_path' => 'C:/wamp64/www/new/poppler-0.51/bin/pdfinfo.exe']);
// get content from all pages and loop for they
$file = fopen('cv.html', 'w+');
$data = null;
foreach ($pdf->getHtml()->getAllPages() as $page) {
$data .= "".$page."<br/>";
}
fputs($file, $data);
fclose($file);
I did not test this code
Recently I'd tried with a dompdf. But after converting my file to pdf $this->event_registration();is not working. How can I deal with this issue . ?
My code:
$data['info'] = $this->reg_model->getInfo();
$id = $data['info']->id;
$file = $data['info']->Register_num;
$data['ed_info'] = $this->reg_model->getEdInfo($id);
$this->load->view('info.html',$data);
$html = $this->output->get_output();
// Load library
$this->load->library('dompdf_generate');
// Convert to PDF
$this->dompdf->load_html($html);
$this->dompdf->render();
$this->dompdf->stream($file.".pdf");
$this->event_registration();
I'm trying to load images from a http url but they won't display in my generated pdf.
$this->layout = '//layouts/pdftemplate';
$pdf = Yii::app()->toPDF->mpdf();
$pdf->shrink_tables_to_fit = 1;
$pdf->defaultfooterline = false;
$stylesheet = file_get_contents(Yii::app()->basePath.'/../webroot/admin/themes/admin/css/formbuilder-print.css');
$pdf->WriteHTML($stylesheet, 1);
$pdf->WriteHTML($_POST['html_string']);
$pdf->Output(sys_get_temp_dir()."/test.pdf", 'F');
I'm passing the html to the php function in an ajax call. The images are on Amazon CloudFront.
Update
Thanks to Asped and Latheesan Kanes I got the issue resolved. I also used PHP's DOMDocument class to replace the image urls with the local copy of the image. This is for future reference if anyone also runs into a similar issue
$doc = new DOMDocument();
#$doc->loadHTML($_POST['html_string']);
$imgs = $doc->getElementsByTagname('img');
foreach ($imgs as $img){
$src = $img->getAttribute('src');
$name = explode('?', basename($src));
$name = $name[0];
$tmp = sys_get_temp_dir().'/'.$name;
copy($src, $tmp);
$img->setAttribute('src', $tmp);
}
$html = $doc->saveHTML(); // you can write this to the pdf. $pdf->WriteHTML($html);
I had a similar issue once displaying an SVG file in the pdf.. it would not work. Then I converted it to a PNG (on the fly), stored locally in a temp folder, and passed the temporary file to mDPF, which helped.
UPDATE - Actually now I remember I didn't even had to convert it, I just had to store it locally in a temp folder..
I am currently trying to convert an ODT to PDF using OO silently from PHP. Here is the code for that:
function MakePropertyValue($name, $value,$osm){
$oStruct = $osm->Bridge_GetStruct("com.sun.star.beans.PropertyValue");
$oStruct->Name = $name;
$oStruct->Value = $value;
return $oStruct;
}
function odt2pdf($doc_url, $output_url){
$osm = new COM("com.sun.star.ServiceManager") or die ("Please be sure that OpenOffice.org is installed.\n");
$args = array(MakePropertyValue("Hidden",true,$osm));
$oDesktop = $osm->createInstance("com.sun.star.frame.Desktop");
$oWriterDoc = $oDesktop->loadComponentFromURL($doc_url,"_blank", 0, $args);
$aFilterData = array();
$aFilterData [0] = $osm->Bridge_GetStruct("com.sun.star.beans.PropertyValue");
$aFilterData [0]->Name = "SelectPdfVersion";
$aFilterData [0]->Value = 1;
$obj = $osm->Bridge_GetValueObject();
$obj->set("[]com.sun.star.beans.PropertyValue",$aFilterData );
$storePDF = array();
$storePDF[0] = $osm->Bridge_GetStruct("com.sun.star.beans.PropertyValue");
$storePDF[0]->Name = "FilterName";
$storePDF[0]->Value = "writer_pdf_Export";
$storePDF[1] = $osm->Bridge_GetStruct("com.sun.star.beans.PropertyValue");
$storePDF[1]->Name = "FilterData";
$storePDF[1]->Value = $obj;
$oWriterDoc->storeToURL($output_url,$storePDF);
$oWriterDoc->close(true);
}
$output_dir = "C:/wamp/www/cert/pdf/";
$doc_file = "C:/wamp/www/cert/output2.odt";
$pdf_file = "output2.pdf";
$output_file = $output_dir . $pdf_file;
$doc_file = "file:///" . $doc_file;
$output_file = "file:///" . $output_file;
odt2pdf($doc_file,$output_file);
I have managed to get it to convert to PDF/A-1a as can be seen, however it still does not preserve the fonts being used. If I convert this from within the GUI in Open Office, then the fonts are kept. What's wrong?
UPDATE: I decided to stop converting to PDF and do a silent print of the odt to the printer with Open Office using exec() in PHP.
As an additional note, make sure to switch off Font substitution on the printer driver to prevent any font substitution when printing to a PostScript printer.
Try by first converting the ODT to HTML ( to maintain formatting ) by using this library
https://gist.github.com/1918801
and later use this FPDF library to convert from html to pdf
I guess FPDF wont support so easily to convert from html so there is a plugin for that FPDF
here is the link for the plugin
plugin for FPDF to convert from HTML