Reading pdf or word file with php - php

how can i open and read file PDF or Word file with php ??
<?php
echo "hello";
$myfile = fopen("SoftwareTesting&Engineering_day1.pdf", "r") or die("Unable to open file!");
echo fread($myfile,filesize("SoftwareTesting&Engineering_day1.pdf"));
fclose($myfile);
?>

You cannot do it with php fopen(). You may be able to display some junk data only. Because word documents and pdf files are not simple text files rather complicated than plain text. If you have a simple text file then you can go along with php file handling functions. If you really want to manipulate word documents and pdf documents then you have to use a good library for that.
https://github.com/PHPOffice/PHPWord
http://phpword.codeplex.com/
https://github.com/smalot/pdfparser
are some of them.

You could set the header to the content type of the file (see PHP.net for information). Note: You should not print out any contents before setting the header. Also note that the client needs correct plugins to display the file. Otherwise it will be offered to download or you see the binary content of the file.
Then you could echo file_get_content(path) to print out the file content itself.
(untested).

Related

Using php how I check a pdf file contents is valid or invalid

I am trying to implement a functionality that should be to detect pdf file and it's content is valid or invalid. Using following scripts I can easily detect whether file is pdf or not:
$info = pathinfo("test.pdf");
if ($info["extension"] == "pdf"){
echo "PDF file";
}
Now I want to check if a file extension pdf then content of pdf file should be valid.
Please tell how can I check pdf file contents are valid not corrupted or invalid format.
Content of pdf file start with %PDF-version no, So at first get contents of pdf file using following scripts:
$filecontent = file_get_contents("test.pdf");
After that check $filecontent variable using following regular expression in order detect it's valid or invalid format:
if (preg_match("/^%PDF-1.5/", $filecontent)) {
echo "Valid pdf";
} else {
echo "In Valid pdf";
}
Note: Pdf version could be different such 1.0 , 1.5 , 1.7 etc... In my case it was 1.5 also make sure you have placed above code inside of scripts/conditions (if file has .pdf extension).
PHP can create PDF files using the built-in libraries Haru and PDF, but cannot directly read, parse or validate PDF files. You will need an external library or tool for this. You can look into pdftk but it seems to be a Windows-only solution which is probably not what you're looking for.

PHPRtfLite - RTF file opens as raw

I am using PHPRtfLite library (http://sigma-scripts.de/phprtflite/docs/index.html) to produce an RTF file using PHP and Yii.
So far, I've made a simple "Hello world" function.
Yii::import('ext.phprtf.PHPRtfLite');
Yii::registerAutoloader(array('PHPRtfLite','registerAutoloader'), true);
$rtf = new PHPRtfLite();
$sect = $rtf->addSection();
$sect->writeText('Hello world!', new PHPRtfLite_Font(), new PHPRtfLite_ParFormat());
//save rtf document
$rtf->sendRtf('takis.rtf');
File is created successfully, but when I open it (either wordpad or ms word) I do not see the actual content of the file but the raw code of the RTF:
{\rtf\ansi\deff0\fs20
{\fonttbl{\f0 Times New Roman;}}
{\colortbl;\red0\green0\blue0;}
{\info
}
\paperw11907 \paperh16840 \deftab1298 \margl1701 \margr1701 \margt567 \margb1134 \pgnstart1\ftnnar \aftnnrlc \ftnstart1 \aftnstart1
\pard \ql {\fs20 Hello world!}
}
Do you have any idea on how to solve this?
Thank you very much in advance.
To answer my own question, in case someone is having the same issue in the coming future...
It seems to be a problem of the sendRTF function. Now, I save the created file locally:
$rtf->save('takis.rtf');
and then generate a link for the user to download the file. This works pretty good.
I have experienced same thing myself. I'm not sure, if you had same reasons, but in my case, there was extra newline in the beginning of PHP file, before <?php tag. When I used sendRtf to download file from browser, that newline ended up also in RTF file, making it invalid and as result, raw rtf code was displayed. When using save, such extra characters won't reach to file.
So one thing to check in similar situations - open Rtf file in Notepad and examine beginning of file.

File_get_contents and fwrite with word document (images are not saved)

I'm coding a script in PHP which is taking an xml document with file_get_contents, i replace some caracters with str_replace and i write this file in a Word document with a fwrite.
Exemple :
$myContent = file_get_contents("../ressources/fichiers/modeles_conventions/modele_convention.xml");
$lettre = str_replace("#NOMENT#",utf8_encode($data['nomentreprise']),$lettre);
$newFileHandler = fopen("../ressources/fichiers/conventions/lettre_convention_1.doc","a");
fwrite($newFileHandler,$lettre);
fclose($newFileHandler);
In localhost it's working but on a server the problem is :
My xml file contains images, but my final .doc document doesn't retrieves these images.
I don't understand why my images are not retrieved.
Well i didn't find out a solution to my problem.
I get my xml file (which is in reallity a .doc file with an .xml extension)
$myContent = file_get_contents("../ressources/fichiers/modeles_conventions/modele_convention.xml");
I replace some stuff
$myContent = str_replace("#NOM_ENTREPRISE#",stripslashes($data['nomentreprise']),$myContent);
$myContent = str_replace("#STATUT_ENTREPRISE#",stripslashes($data['juridique']),$myContent);
I save my document
//On génère la convention
$newFileHandler = fopen("../ressources/fichiers/conventions/convention_".$data2['nomeleve']."_".$data2['prenomeleve']."_".$data3['idstage'].".doc","ab");
fwrite($newFileHandler,$myContent);
fclose($newFileHandler);
The xml document contains images, in localhost it retrieves images but not on the server.
exemple of xml code :
<w:r>
<w:rPr>
<w:rFonts w:ascii="Arial" w:h-ansi="Arial" w:cs="Arial"/>
<wx:font wx:val="Arial"/>
</w:rPr>
<w:pict>
<v:shape id="_x0000_i1028" type="#_x0000_t75" style="width:48.75pt;height:24pt">
<v:imagedata src="wordml://06000003.emz" o:title=""/>
</v:shape>
</w:pict>
</w:r>
</w:p>
Creating Word files using HTML is a way to let Word think that it is a Word document. Naming it .doc creates a file that is by default opened in Word. However, it isn't an actual Word document, you are faking it. It only works because Word also supports opening HTML. Other clients may not support HTML or not support all HTML entirely. For instance, the image tag does not work with TextMate on Mac, altough the bold tag works just fine.
In your XML, you must refer to the image using an absolute path, i.e. a path on the internet or a local file system path. For instance, <img src="image.png"> will not work, since the Word file does not know how to locate it. However, you may use <img src="http://yoursite.com/image.png">. I'm sure that you can also refer to your local file system, with for example the file: 'protocol'. This only works when the file is present on the file system where the file is opened though.
If this does not solve your problem, you should probably post your XML file here.
However, if you are creating this for a client or external system (so anything but yourself), I'd suggest using something like:
COM objects
This only works when Word is actually installed on the system where the web application runs.
<?php
$word = new COM("word.application") or die ("Can't create Word file");
$word->visible = 1;
$word->Documents->Add();
$word->Selection->TypeText("this is some sample text in the document");
$word->Documents[1]->SaveAs("sampleword.doc");
$word->Quit();
$word->Release();
$word = null;
?>
(source)
Office Open XML or other format
The new XML format used by Word is open source and can be modified more easily. I don't know the exact details, but it's basically some XML files compressed into a zip file and given the extension .docx.
If possible, you can also use the ODT format of OpenOffice. Most recent Word versions can also read this file and the format is open source. It is also more feasible to create PDF files than Word files using PHP.
phpLiveDocs
phpLiveDocs is an extension for PHP and can be used to create Word files.

how to use php to include an image in a word file?

Somebody has asked me to make an app in php that will generate a .doc file with an image and a few tables in it. My first approach was:
<?php
function data_uri($file, $mime)
{
$contents = file_get_contents($file);
$base64 = base64_encode($contents);
return ('data:' . $mime . ';base64,' . $base64);
}
$file = 'new.doc';
$fh = fopen($file,'w');
$uri = data_uri('pic.png','image/png');
fwrite($fh,'<table border="1"><tr><td><b>something</b></td><td>something else</td></tr><tr><td></td><td></td></tr></table>
<br/><img src="'.$uri.'" alt="some text" />
<br/>
<table border="1"><tr><td><b>ceva</b></td><td>altceva</td></tr><tr><td></td><td></td></tr></table>');
fclose($fh);
?>
This uses the data uri technique of embedding an image.
This will generate an html file that will be rendered ok in web browsers but the image is missing in Microsoft Office Word, at least in the standard setup. Then, while editing the file with Word, i've replace the image with an image from file and Microsoft Word changed the contents of the file into Open XML and added a folder, new_files where he put the imported image (which was a .png), a .gif version of the image and a xml file:
<xml xmlns:o="urn:schemas-microsoft-com:office:office">
<o:MainFile HRef="../new.doc" />
<o:File HRef="image001.jpg" />
<o:File HRef="filelist.xml" />
</xml>
Now this isn't good enough either since i want this to be all kept in a single .doc file.
Is there a way to embed an image in an OpenXML-formatted .doc file?
look here http://www.tkachenko.com/blog/archives/000106.html
<w:pict>
<v:shapetype id="_x0000_t75" ...>
... VML shape template definition ...
</v:shapetype>
<w:binData w:name="wordml://02000001.jpg">
... Base64 encoded image goes here ...
</w:binData>
<v:shape id="_x0000_i1025" type="#_x0000_t75"
style="width:212.4pt;height:159pt">
<v:imagedata src="wordml://02000001.jpg"
o:title="Image title"/>
</v:shape>
</w:pict>
There is PHPWord project to manipulate MS Word from within PHP.
PHPWord is a library written in PHP
that create word documents. No Windows
operating system is needed for usage
because the result are docx files
(Office Open XML) that can be opened
by all major office software.
PHPWord can write them http://phpword.codeplex.com/ (note: its still in Beta. I've used PHpExcel by the same guy a lot... never tried the Word version).
Have a look at the phpdocx library for generating real .docx files rather than html files with a .doc extension
PS the extension should strictly be .docx rather than .doc for Open XML Word 2007 files
OpenTBS can create DOCX (and other OpenXML files) dynamic documents in PHP using the technique of templates.
No temporary files needed, no command lines, all in PHP.
It can add or delete pictures. The created document can be produced as a HTML download, a file saved on the server, or as binary contents in PHP.
It can also merge OpenDocument files (ODT, ODS, ODF, ...)
http://www.tinybutstrong.com/opentbs.php
I would use PHPExcel. It can work with OpenXML too.
Here's the link: http://phpexcel.codeplex.com/

Is it possible to read a pdf file as a txt?

I need to find a certain key in a pdf file. As far as I know the only way to do that is to interpret a pdf as txt file. I want to do this in PHP without installing a addon/framework/etc.
Thanks
You can certainly open a PDF file as text. PDF file format is actually a collection of objects. There is a header in the first line that tells you the version. You would then go to the bottom to find the offset to the start of the xref table that tells where all the objects are located. The contents of individual objects in the file, like graphics, are often binary and compressed. The 1.7 specification can be found here.
I found this function, hope it helps.
http://community.livejournal.com/php/295413.html
You can't just open the file as it is a binary dump of objects used to create the PDF display, including encoding, fonts, text, images. I wrote an blog post explaining how text is stored at http://pdf.jpedal.org/java-pdf-blog/bid/27187/Understanding-the-PDF-file-format-text-streams
Thank you all for your help. I owe you this piece of code:
// Proceed if file exists
if(file_exists($sourcePath)){
$pdfFile = fopen($sourcePath,"rb");
$data = fread($pdfFile, filesize($sourcePath));
fclose($pdfFile);
// Check if file is encrypted or not
if(stripos($data,$searchFor)){ // $searchFor = "/Encrypt"
$counterEncrypted++;
}else{
$counterNotEncrpyted++;
}
}else{
$counterNotExisting++;
}

Categories