I have a series of PDF files on my shared hosting webserver which I'm writing a PHP script for to catalogue them on the screen. I've added metadata to the PDF files - Document Title, Author and Subject. The filename is composed of the Author and Title so I can construct the catalogue text from that. However, I want to display the contents of the 'Subject' metadata field as well.
Because I'm using shared hosting, I cannot install any extra PHP extensions. They have the free version of PDFLib but this doesn't include any functions to load the PDF file or to extract metadata.
This is the script so far which just displays a list of the filenames...
function catalogue($folder){
$files = preg_grep('/^([^.])/', scandir($folder));
foreach($files as $file){
echo($file.'<br/>');
}
}
So, I've not made much progress :(
I've tried PDF_open_pdi_document() but this is not part of the installed PDFLib extension. I've tried PDF_pcos_get_string() but all I get with...
PDF_pcos_get_string($file,0,'author');
...is...
pdf_pcos_get_string(): supplied resource is not a valid pdf object resource
...and I can find literally ZERO help on the web for this function. Literally nothing!
I am running PHP 7.4 on the shared hosting.
Metadata aren't encrypted like the PDF, so you can use file_get_contents, find the pattern for the subject (<</Subject) and extract it using either a regex or a simple combination of strpos/substr.
Thank you #drdlp. I've used file_get_contents() to load in the PDF and extract and display the metadata.
function catalogue($folder){
$files = preg_grep('/^([^.])/', scandir($folder));
foreach($files as $file){
$page = file_get_contents($file);
$metadata = preg_match_all('/\/[^\(]*\(([^\/\)]*)/',$page,$matches);
$author = $matches[1][0];
$subject = $matches[1][4];
$title = $matches[1][5];
echo($title.'/'.$subject.'/'.$author.'<br>');
}
}
/
However, this is very slow for 40 odd PDF articles in a folder.
How can I speed this up?
I've begun experimenting with pdf.js for which I can load all the basic details from files first (filename etc) and then update them with Javascript after the page has loaded.
However, I clearly don't know enough about Javascript to make this work. This is what I have so far and I am very stuck. I've imported pdf.js from mozilla.github.io/pdf.js/build/pdf.js...
function pdf_metadata(file_url,id){
var pdfjsLib = window['pdfjs-dist/build/pdf'];
pdfjsLib.GlobalWorkerOptions.workerSrc = '//mozilla.github.io/pdf.js/build/pdf.worker.js';
var loadingTask = pdfjsLib.getDocument(file_url);
loadingTask.promise.then(function(pdf) {
pdf.getMetadata().then(function(details) {
console.log(details);
document.getElementById(id).innerHTML=details;
}).catch(function(err) {
console.log('Error getting meta data');
console.log(err);
});
});
}
The line console.log(details); outputs an object to the console. From there I have no idea how to extract any data at all. Therefore document.getElementById(id).innerHTML=details; displays nothing.
This is the object which is output to the console.
what I'm trying to accomplish is population of PDF form with PHP.
I tried many ways, I found that FPDM (FPDF) is working well when I create a new form, or use the form from source file they provided.
My problem is when I'm using already created PDF form, the form has restrictions such as Owner password, document is signed and certified. I used the app to remove those restrictions, some of them are left. In picture below you can see how my current PDF looks like.
That PDF also was compressed, and because FPDM was throwing the error that 'Object Stream' is not supported I decompressed it through PDFTK, so file went from 1.48 Mb to 6.78 Mb.
To get all form field names I used also PDFTK, so I have them in txt file.
There are two ways I can do by the instructions of FPDM:
First way is only to send an array field_name => value along with PDF I want to change and that's it. So when I use PDF described above I get error:
'FPDF-Merge Error: field form1[0].#subform[0].Line1_GivenName[0] not found'
Just to remind that I have all names and this name exists.
<?php
require('fpdm.php');
$fields = array(
'form1[0].#subform[0].Line1_GivenName[0]' => 'my name'
);
$pdf = new FPDM('test.pdf');
$pdf->Load($fields, false); // second parameter: false if field values are in ISO-8859- 1, true if UTF-8
$pdf->Merge();
$pdf->Output('new_pdf.pdf', 'F');
?>
The other way is that I create FDF file with createXFDF function and then use FPDM to merge FDF to PDF. This solution creates 'new_file.pdf' like I want but empty :)
function createXFDF($file, $info, $enc = 'UTF-8') {
$data = '<?xml version="1.0" encoding="'.$enc.'"?>' . "\n" .
'<xfdf xmlns="http://ns.adobe.com/xfdf/" xml:space="preserve">' . "\n" .
'<fields>' . "\n";
foreach($info as $field => $val) {
$data .= '<field name="' . $field . '">' . "\n";
if(is_array($val)) {
foreach( $val as $opt )
$data .= '<value>' .
htmlentities( $opt, ENT_COMPAT, $enc ) .
'</value>' . "\n";
} else {
$data .= '<value>' .
htmlentities( $val, ENT_COMPAT, $enc ) .
'</value>' . "\n";
}
$data .= '</field>' . "\n";
}
$data .= '</fields>' . "\n" .
'<ids original="' . md5( $file ) . '" modified="' .
time() . '" />' . "\n" .
'<f href="' . $file . '" />' . "\n" .
'</xfdf>' . "\n";
return $data;
}
require('fpdm.php');
$pdf = new FPDM('test.pdf', 'posted-0.fdf');
$pdf->Merge();
$pdf->Output('new_file.pdf', 'F');
One more thing, if I try to open FDF file in Acrobat I get a message
'The file you are attempting to open contains comments or form data that are supposed to be placed on test.pdf. This document cannot be found. It may have been moved, or deleted. Would you like to browse to attempt to locate this document?'
but the file is there, not moved or deleted. When I find it manually the form populates.
If anyone has experience with this, any help or advice would help a lot.
Thank you in advance, Vukasin
EDIT:
More info about the PDF file
I have spent more than a complete day working through issues with FPDM, and was hard pressed to find someone who had similar issues.
The following format worked for me: PDF 1.4 (Acrobat 5). I had to
actually go to Save As -> choose Adobe PDF Optimized, then click the
Settings button. From there I had to choose the version from the
drop-down/fly-out menu.
I received the error: 'not compatible with fast web view' or similar. If in the PDF Optimized settings option you click 'clean up' on the left side you can untoggle fast web view.
Now I am receiving the error, PDF-Merge Error: field 'fieldname' not found. When I run it through pdftk hoping to resolve this, I receive the error: FPDF-Merge Error: Number of objects (35) differs with number of xrefs (36), something , pdf xref table is corrupted :(
To fix this issue, I had to download and install pdftk server utility on ubuntu
sudo apt-get install pdftk
After install, I ran this command to repair a PDF’s corrupted XREF table and stream lengths, if possible:
pdftk broken.pdf output fixed.pdf
When I open fixed.pdf it has no issues whatsoever and populates the fields correctly. Hallelujah this was the most annoying issue in the world. To summarize, I had to take the pdf and put it through the following steps:
Edit PDF to preference
Save As > .pdf optimized > Settings > Acrobat 5 > uncheck fast web under cleanup
Open file in pdftk and resave
Install command line pdftk via ubuntu
run command: pdftk broken.pdf output fixed.pdf
done.
After revealing the creator/producer info the problem is clear.
You do not have a real PDF form, but you have a XFA form (created by LiveCycle Designer), wrapped in a PDF wrapper so that Adobe Reader can display it.
XFA forms do not support (X)FDF. You have to import data using XML. You can try to export the data from a filled version, and then use this as a sample for creating the import XML.
Note that the XML export/import format XFA forms use is not the same as XFDF (which is simply an XML representation of FDF, the PDF-native forms data format).
Thanks to Kyle's answer, I resolved a similar issue.
I have a pdf created in Adobe Acrobat Pro DC and need to populate its form fields from web input.
On my development machine, I created an fdf file from my web form data, and merged it into the pdf using pdftk (called from php with exec() ).
But I couldn't put that on the cloud linux webserver where my site is hosted because it requires deprecated libraries.
So I switched to FPDM and had the following errors:
'Fast Web View mode is not supported'. I fixed that by setting
preferences in 'save as' in Adobe Acrobat Pro (save as -> pdf
optimized -> settings -> clean up -> uncheck Fast Web View).
'Object streams are not supported' - again, fixed in the 'save as'
preferences (clean up -> object compression options -> remove
compression).
'Incremental updates are not supported' - again, fixed using 'save
as' in Acrobat.
Then FPDM ran, but couldn't read any field names.
The One-Step Solution:
Take the original file pdf - with incremental updates, object compression and fast web view - and pass it through pdftk on Windows, exactly as Kyle describes.
> pdftk broken.pdf output fixed.pdf
Now FPDM populates the fields correctly from the fdf file.
I found a tool called Scribus after examining the template used in the fpdf example. You can use it to create pdf templates and the format created plays nice with fpdm. It isn't a complicated program and allows you to create form fields with permissions/parameters around them (like making a form field read-only after you populate data in it from an online form). For my application, I needed to have some fields pre-populated from values in a database that were non-editable, have other fields that were pre-populated, but still editable and some fields that were empty and required completion (force required). It was all possible using the template that Scribus has generated.
MAGNIFICENT work!
The One-Step Solution:
Take the original file pdf - with incremental updates, object compression and fast web view - and pass it through pdftk on Windows, exactly as Kyle describes.
pdftk broken.pdf output fixed.pdf
Now FPDM populates the fields correctly from the fdf file.
I created a PDF with Acrobat, then "fixed" it with pdftk, and FPDM the class merged the data perfectly...
I have a photo community (www.jungledragon.com) that allows users to upload photos. My platform is PHP/CodeIgniter.
As part of the upload process I'm already reading EXIF info using PHP's exif_read_data function, which works fine. I read camera details and show these on an info tab.
On top of that, user's are expected to manually set the photo title, description and tags on the website after uploading the photo. However, some users manage these fields in their image management program, for example Lightroom. It would be great if I could read those as well, uploading would become a total joy.
I already improved my EXIF reading to read the "caption", this way users don't have to set the image title after uploading anymore. Now I'm looking to read keywords, which is where I am stuck. Here's a partial screenshot of an image in Lightroom:
I can read the Metadata, but how do I read the keywords? The fact that it is not inside metadata makes me wonder if it's at all possible? I've tried reading every value I can get (ANY_TAG, IFD0, EXIF, APP12) using exif_read_data, but the keywords are not to be found.
Any thoughts?
As suggested you may have to use another method of reading metadata.
http://www.foto-biz.com/Lightroom/Exif-vs-iptc-vs-xmp
Image keywords may be stored in IPTC and not in EXIF. I don't know if there is a standard platform method for reading iptc but a quick google shows this
http://php.net/manual/en/function.iptcparse.php
Try using PEL, a much more comprehensive library than exif_read_data() for exif data.
After a long research, i found the solution to get keywords exported by lightroom in a jpg file :
$image = getimagesize($imagepath, $info);
if(isset($info['APP13']))
{
$iptc = iptcparse($info['APP13']);
$keywordcount = count($iptc["2#025"]);
for ($i=0; $i<$keywordcount; $i++)
{
echo "keyword : " . $iptc["2#025"][$i] . "<br/>";
}
}
What I'm looking for is something which will take any file extension given to it and return a description of that extension.
For example :
$extension = 'PNG';
$description = ext_description($extension);
echo $description // Outputs 'Portable Network Graphic'
OR
$extension = 'DOC';
$description = ext_description($extension);
echo $description // Outputs 'Microsoft Office Word Document'
I've searched google and nothing came up. It would be a huge time saver if anyone knew if such a script existed.
Thanks in advance.
Yes there is, one is the getID3 Library and there is the PHP build in fileinfo library.
As the description for everybody is something else, you can create a simple look-up function that works on array data. You only need to add your file-types into the array and you're done:
function ext_description($extension) {
static extensions = array(
'png' => 'Portable Network Graphic',
'doc' => 'Microsoft Office Word Document'
);
$extension = strtolower($extension);
return isset($extensions[$extension])
? $extensions[$extension]
: sprintf('Unknown File (%s)', $extension)
;
}
It does'nt exist but you can create one.
function ext_description($extension) {
switch ($extension) {
case "png":
return "Portable Network Graphic";
case "doc":
return "Microsoft Office Word Document";
default:
return "Unknow extention";
}
You can find here a complete list of all existing extensions : http://en.wikipedia.org/wiki/List_of_file_formats_(alphabetical)
No, there is no easy way to do this in PHP. The closest alternative would be the MIME type which you can get using finfo or mime_content_type.
If you are running on windows you could also use the SHGetFileInfo function which does exactly what you're looking for.
http://msdn.microsoft.com/en-us/library/bb762179(v=vs.85).aspx
This is what I am currently using for the above question if anyone else ever needs it.
/**
* Takes a file extension and returns a readable file type.
*
* #param string $ext
* #return string
* #example .doc returns Microsoft Word Document
*/
function file_extension_2_type($ext) {
$files = json_decode('{".doc":"Microsoft Word Document",".docx":"Microsoft Word Open XML Document",".log":"Log File",".msg":"Outlook Mail Message",".pages":"Pages Document",".rtf":"Rich Text Format File",".txt":"Plain Text File",".wpd":"WordPerfect Document",".wps":"Microsoft Works Word Processor Document",".csv":"Comma Separated Values File",".dat":"Data File",".efx":"eFax Document",".gbr":"Gerber File",".key":"Keynote Presentation",".pps":"PowerPoint Slide Show",".ppt":"PowerPoint Presentation",".pptx":"PowerPoint Open XML Presentation",".sdf":"Standard Data File",".vcf":"vCard File",".xml":"XML File",".aif":"Audio Interchange File Format",".iff":"Interchange File Format",".m3u":"Media Playlist File",".m4a":"MPEG-4 Audio File",".mid":"MIDI File",".mp3":"MP3 Audio File",".mpa":"MPEG-2 Audio File",".wav":"WAVE Audio File",".wma":"Windows Media Audio File",".3g2":"3GPP2 Multimedia File",".3gp":"3GPP Multimedia File",".asf":"Advanced Systems Format File",".asx":"Microsoft ASF Redirector File",".avi":"Audio Video Interleave File",".flv":"Flash Video File",".mov":"Apple QuickTime Movie",".mp4":"MPEG-4 Video File",".mpg":"MPEG Video File",".swf":"Shockwave Flash Movie",".vob":"DVD Video Object File",".wmv":"Windows Media Video File",".3dm":"Rhino 3D Model",".max":"3ds Max Scene File",".bmp":"Bitmap Image File",".gif":"Graphical Interchange Format File",".jpg":"JPEG Image File",".png":"Portable Network Graphic",".psd":"Adobe Photoshop Document",".pspimage":"PaintShop Pro Image",".thm":"Thumbnail Image File",".tif":"Tagged Image File",".yuv":"YUV Encoded Image File",".ai":"Adobe Illustrator File",".drw":"Drawing File",".eps":"Encapsulated PostScript File",".ps":"PostScript File",".svg":"Scalable Vector Graphics File",".indd":"Adobe InDesign Document",".pct":"Picture File",".pdf":"Portable Document Format File",".qxd":"QuarkXPress Document",".qxp":"QuarkXPress Project File",".rels":"Open Office XML Relationships File",".xlr":"Works Spreadsheet",".xls":"Excel Spreadsheet",".xlsx":"Microsoft Excel Open XML Spreadsheet",".accdb":"Access 2007 Database File",".db":"Database File",".dbf":"Database File",".mdb":"Microsoft Access Database",".pdb":"Program Database",".sql":"Structured Query Language Data",".app":"Mac OS X Application",".bat":"DOS Batch File",".cgi":"Common Gateway Interface Script",".com":"DOS Command File",".exe":"Windows Executable File",".gadget":"Windows Gadget",".jar":"Java Archive File",".pif":"Program Information File",".vb":"VBScript File",".wsf":"Windows Script File",".gam":"Saved Game File",".nes":"Nintendo (NES) ROM File",".rom":"N64 Game ROM File",".sav":"Saved Game",".dwg":"AutoCAD Drawing Database File",".dxf":"Drawing Exchange Format File",".gpx":"GPS Exchange File",".kml":"Keyhole Markup Language File",".asp":"Active Server Page",".cer":"Internet Security Certificate",".csr":"Certificate Signing Request File",".css":"Cascading Style Sheet",".htm":"Hypertext Markup Language File",".html":"Hypertext Markup Language File",".js":"JavaScript File",".jsp":"Java Server Page",".php":"Hypertext Preprocessor File",".rss":"Rich Site Summary",".xhtml":"Extensible Hypertext Markup Language File",".8bi":"Photoshop Plug-in",".plugin":"Mac OS X Plug-in",".xll":"Excel Add-In File",".fnt":"Windows Font File",".fon":"Generic Font File",".otf":"OpenType Font",".ttf":"TrueType Font",".cab":"Windows Cabinet File",".cpl":"Windows Control Panel Item",".cur":"Windows Cursor",".dll":"Dynamic Link Library",".dmp":"Windows Memory Dump",".drv":"Device Driver",".lnk":"File Shortcut",".sys":"Windows System File",".cfg":"Configuration File",".ini":"Windows Initialization File",".keychain":"Mac OS X Keychain File",".prf":"Outlook Profile File",".bin":"Macbinary Encoded File",".hqx":"BinHex 4.0 Encoded File",".mim":"Multi-Purpose Internet Mail Message File",".uue":"Uuencoded File",".7z":"7-Zip Compressed File",".deb":"Debian Software Package",".gz":"Gnu Zipped Archive",".pkg":"Mac OS X Installer Package",".rar":"WinRAR Compressed Archive",".rpm":"Red Hat Package Manager File",".sit":"Stuffit Archive",".sitx":"Stuffit X Archive",".tar.gz":"Tarball File",".zip":"Zipped File",".zipx":"Extended Zip File",".dmg":"Mac OS X Disk Image",".iso":"Disc Image File",".toast":"Toast Disc Image",".vcd":"Virtual CD",".c":"C\/C++ Source Code File",".class":"Java Class File",".cpp":"C++ Source Code File",".cs":"Visual C# Source Code File",".dtd":"Document Type Definition File",".fla":"Adobe Flash Animation",".java":"Java Source Code File",".m":"Objective-C Implementation File",".pl":"Perl Script",".py":"Python Script",".bak":"Backup File",".gho":"Norton Ghost Backup File",".ori":"Original File",".tmp":"Temporary File",".dbx":"Outlook Express E-mail Folder",".msi":"Windows Installer Package",".part":"Partially Downloaded File",".torrent":"BitTorrent File"}', true);
if (isset($files['.' . strtolower($ext)]))
return $files['.' . strtolower($ext)] . ' (' . strtoupper($ext) . ')';
return strtoupper($ext);
}
You could use mime types of the file http://www.feedforall.com/mime-types.htm
I guess no one was lucky to found the best solution of handling reports in php, specialy when it's a .doc/x report or file .... i searched for sometime and then i found phpdocx.com .. amazing php script, but it just doesn't work, and i don't know exactly where to find the output file ... and unfortunately the documentation doesn't help at any level ...
Now i need to know the way this script work .. i mean how results come out and become usable ... and what needs it take the script to work .. because it simply doesn't work on my local host .. i am using appache 2, php 5.2.6 ..
I don't actually need more than writing html with in ( a real doc format file, not rename a html file to .doc !! ), so if there is any solution ( without the COM Lib ... i am not on a windows server ) to generate real doc file with HTML .. please but it here
Thanks very much in advance :)
I guess no one was lucky to found the best solution of handling
reports in php, specialy when it's a .doc/x report or file
This is not the question corresponding to the title, but you should try OpenTBS.
It's an open source PHP library which builds DOCX with the technique of templates.
No temp directory, no extra exe needed. First create your DOCX, XLSX, PPTX with Ms Office, (ODT, ODS, ODP are also supported, that's OpenOffice files). Then you use OpenTBS to load the template and change the content using the Template Engine (easy, see the demo). At the end, you save the result where you need. It can be a new file, a download flow, a PHP binary string.
OpenTBS can also change pictures and charts in a document.
Demo page
Documentation
The documentation of PHPDocX has been greatly improved.
Have you tried to look at the PHPDocX tutorial?
You may also have a look at the Forum.
require_once "Path of phpdocx library/CreateDocx.inc";
$docx = new CreateDocx();
$html = 'your data will store in this variable';
$docx->embedHTML(
$html,
array(
'parseDivsAsPs' => true,
'downloadImages' => true,
'WordStyles' => array(
'<table>' => 'MediumGrid3-accent5PHPDOCX'
),
'tableStyle' => 'NormalTablePHPDOCX'
)
);
$docx->createDocx($varPublicPath.'/word_export_file/example1_'.time());
// this is location where your docx file will generate(inside word_export_file docx file will store)