How to count words in doc,xls,pdf and txt file - php

I have a scenario in which I need to count the number of words in file.
I have different file formats such as .doc, .xls, .pdf and .txt. I am using this method for counting:
<form method="post" action="" enctype="multipart/form-data">
<input type="file" name="docfile" />
<input type="submit" name="submit" />
</form>
<?php
if(isset($_POST['submit'])){
$file = $_FILES['docfile']['name'];
$file = str_replace(" ","_",$file);
//$file = file_get_contents($file);
$ext = pathinfo($file, PATHINFO_EXTENSION);
move_uploaded_file($_FILES['docfile']['tmp_name'],"uploads/".$file);
if($ext == "txt" || $ext == "pdf" || $ext == "doc" || $ext == "docx"){
$file = file_get_contents("uploads/".$file);
echo str_word_count($file);
}
}
?>
But it is not returning the correct word count for the file.

Apache Tika is a Java framework that is capable of recognizing a lot of document types and extracting meta information from them. It is is capable of ascertaining word counts for a lot of the document types it recognizes.
I mention this Java framework for your PHP question because there is a PHP wrapper for it called PhpTikaWrapper. I have never used the wrapper but Apache Tika can extract the meta information you are after so, investigating the wrapper may prove beneficial.

You've got a difficult task there. .doc .pdf and .xls are not simply readable. To test this try opening a pdf with a basic text editor like notepad or gedit. You will see what appears to be gibberish. This is the same thing PHP sees when you read a file's contents.
.xls and .doc can probably be parsed with PHPWord and PHPExcel from PHPOffice. You will need to look in to these libraries. I don't know anything for PDFs but there's probably something.
I would suggest writing a series of classes that all implement a similar interface so you can switch them out depending on the extension.

I've been working on a general purpose class that incorporates various methods found around the web and on Stack Overflow that provides word, line and page counts for doc, docx, pdf and txt files. I hope it's of use to people. If anyone can get RTF working with it I'd love a pull request! https://github.com/joeblurton/doccounter

Related

Php finfo detecting php file as a jpeg

I try to write a upload form and I want user to sent only image or pdf file. To detect mime type I use finfo but it's really easy to mess with him here an example
<?php
$cnt ='<form action="" method="get">\x0aCommand: <input type="text" name="cmd" /><input type="submit" value="Exec" />\x0a</form>\x0aOutput:<br />\x0a<pre><?php passthru($_REQUEST["cmd"], $result); ?></pre>\x0a';
echo $cnt."\n";
$finfo = new \finfo(FILEINFO_MIME);
echo $finfo->buffer($cnt) . "\n"; // text/plain; charset=us-ascii
$cnt ="\xff\xd8\xff\xe0\x0a".$cnt; // adding random utf8 char at the begining
echo $cnt."\n";
$finfo = new \finfo(FILEINFO_MIME);
echo $finfo->buffer($cnt) . "\n"; // image/jpeg; charset=iso-8859-1
Does any body know how to do it properly ?
Update:
Ok so let's reveal the magic trick : finfo like many or tool ( cmd file on unix for example) use a "magic table" to find out which kind of file is it. Look at those example
Short version finfo search for a series of specific bytes in the stream and if it found it, it return the mime type associated with those number.
To trick it, you just have to had those bytes in your file...
Which does not answer the question on how to find out properly...
There are a few methods to improve file security;
The most effective for images is to convert the image in PHP into another format such as from JPEG to PNG, and then reload it (in PHP memory) and convert it back to the desired format.
If the original image code is malformed (as you example) it will not successfully convert; this will be detected by PHP (as false or similar).
There are additional parallel test you can do such as using getimagesize to check the values returned correlate with expected values from the original file and from the converted file etc.
If an image, a vague process could be:
Check file finfo MIME type (as you already do)
Convert file to another format (ie JPG --> PNG ).
Test file dimensions (getimagesize). Remember these.
Convert file data in PHP Memory back to original format (ie PNG --> JPG).
Test if these file dimensions (getimagesize) compare with values from above.
This can't apply to PDF but can ensure images are genuine and also can remove potentially dangerous metadata from JPG images by converting them to PNG and then back to JPG again.
This post might also be useful to you re PDF
Essentially, files are only going to be blocks of data and to this extent MIME type checks (in isolation) will never be a surefire way of promising the file is "genuine". MIME type is like the cover of a book; the cover looks like a literary masterpiece, but until you flick through the pages and find it's all 1980s pornography you won't be sure.
Some useful links:
http://www.php.net/manual/en/function.exif-imagetype.php
https://stackoverflow.com/a/11039258/3536236
PHP check if file is an image
Here is how you can Get the File extension of an image :
$path = $_FILES['image']['name'];
$ext = pathinfo($path, PATHINFO_EXTENSION);
Here is the Html code :
<form action="" method="POST">
<input type="file" name="" id="" accept="application/pdf,image/jpeg">
<input type="submit" name="submit" value="">
</form>
Why are you using Finfo ?

Restrict upload to display only csvfiles, in php

I have an application which uploads data from a csv file and this is working fine. It would be useful, but not essential, if I could limit the dialog window to only show csv files, and if possible a file template say 'abc*.csv'.
The attached image shows an example of a dialog box which will only allow files that start with abc*.csv
Example of csv image dialog box
Thanks
Harry
It depends on how you're handling the uploads.
You can either use plain HTML to filter the .csv extension or handle it using PHP, or both.
Using HTML:
<input type="file" name="upload" accept=".csv">
Using PHP:
$ext = pathinfo($filename, PATHINFO_EXTENSION);
if( $ext !== 'csv' ) {
echo 'Invalid extension.';
}
Note that this only verifies the extension and not the actual filetype.
Also the accept attribute of the <input type="file"> does indeed provider a filter in the file select dialog.

How to restrict file upload using PHP?

I am trying to do a restricted file upload using PHP.
I have used
if (($_FILES["file"]["type"] == "application/dbase")
||($_FILES["file"]["type"] == "application/dbf")
||($_FILES["file"]["type"] == "application/x-dbase")
||($_FILES["file"]["type"] == "application/x-dbf")
||($_FILES["file"]["type"] == "zz-application/zz-winassoc-dbf"))
For me .dbf (i.e Microsoft Visual FoxPro Table type) files are not working. Please suggest to me what I should put for the content type for .dbf .
The browser uploading the file probably doesn't know it's an application/dbf mime-time, and sends it as the generic "application/octet-stream". The client/browser has to set the mime-type to be known on upload, and this can be altered by the user!
Thus MIME-type isn't reliable. If you want to be sure that it's the correct file-type/format, you'll have to examine the uploaded file.
There is another easy way for this problem , instead of inspecting the MIME type,
we can get the file extension of the uploaded file by using this function.
$filename=$_FILES["file"]["tmp_name"];
$ext = pathinfo($filename, PATHINFO_EXTENSION);
$ext = strtolower($ext);
if($ext=="png"||$ext=="gif"||$ext=="jpg"||$ext=="jpeg"||$ext=="pdf"
||$ext=="doc"||$ext=="docx"||$ext=="xls"
||$ext=="xlsx"||$ext=="xlsm"||$ext=="dbf")
{
// your code whatever you want to write;
}
Find an easy blob-upload and download of file here Blob-upload
Defining the content type is up to the browser (or other client application), making it easy to tamper with and cannot be relied upon. My guess is that your browser doesn't recognize the .dbf file and defaults to "application/octet-stream".
You can't depend on the type field of a file upload to actually determine its type. First, it can be spoofed by the client. Secondly, the client simply might not know what the file type actually is and just report 'application/octet-stream' instead.
you'll have to determine what kind of file was uploaded yourself. Fortunately, PHP provides the fileinfo extension, which can help you with determining the type of a file.
Code example based on one from php.net:
<?php
$finfo = finfo_open(FILEINFO_MIME_TYPE); // return mime type ala mimetype extension
echo finfo_file($finfo, $_FILES["file"]["tmp_name"]) . "\n";
finfo_close($finfo);
?>
http://www.php.net/manual/en/ref.fileinfo.php
Try inspecting the MIME type being passed to you when you upload a file of that type. Insert a temporary print $_FILES["file"]["type"]; somewhere in your code, then upload the file to run the code and see what it prints out! You can then copy that type and use it in your if-statement.

Validate the excel sheet

I have used the PHP method <form action="upload_file.php" method="post" enctype="multipart/form-data"> for uploading the excel sheets. Now I want to validate the excel sheet.
Validation:
If the sheet contain image instead of text I need to give an error to the user. Is it possible to validate the sheet's content without opening the sheet manually?
If you intend to use the data from the Excel file, you would have to parse and read it. If you can parse it as an Excel file, then its probably Excel, then you can safely rely on a php library like PHPExcel.
On the other hand, if you don't plan on using the data from the excel file, I personnally think that it would be overkill to use an entire library to validate if a file is in the right format.
$filename = $_POST['your_field_file'];
$finfo = finfo_open(FILEINFO_MIME_TYPE);
echo finfo_file($finfo, $filename);
or this
$filename = "/home/user1/whatever.xls"; //or $_POST['your_field_file'];
echo $finfo->file($filename);
It will return the MIME type, verify if it is excell file, then you can look the file size, and put some restrictions, but if you want to know what have inside the file, I think you need to open or parse the file. Try to use file_get_contents() and verify line by line if has an image there. (I'm not sure about that)
I hope this help.
Sounds like you should make use of the fileinfo PHP extension to check the mime types of the files users are uploading.

how to use php to include an image in a word file?

Somebody has asked me to make an app in php that will generate a .doc file with an image and a few tables in it. My first approach was:
<?php
function data_uri($file, $mime)
{
$contents = file_get_contents($file);
$base64 = base64_encode($contents);
return ('data:' . $mime . ';base64,' . $base64);
}
$file = 'new.doc';
$fh = fopen($file,'w');
$uri = data_uri('pic.png','image/png');
fwrite($fh,'<table border="1"><tr><td><b>something</b></td><td>something else</td></tr><tr><td></td><td></td></tr></table>
<br/><img src="'.$uri.'" alt="some text" />
<br/>
<table border="1"><tr><td><b>ceva</b></td><td>altceva</td></tr><tr><td></td><td></td></tr></table>');
fclose($fh);
?>
This uses the data uri technique of embedding an image.
This will generate an html file that will be rendered ok in web browsers but the image is missing in Microsoft Office Word, at least in the standard setup. Then, while editing the file with Word, i've replace the image with an image from file and Microsoft Word changed the contents of the file into Open XML and added a folder, new_files where he put the imported image (which was a .png), a .gif version of the image and a xml file:
<xml xmlns:o="urn:schemas-microsoft-com:office:office">
<o:MainFile HRef="../new.doc" />
<o:File HRef="image001.jpg" />
<o:File HRef="filelist.xml" />
</xml>
Now this isn't good enough either since i want this to be all kept in a single .doc file.
Is there a way to embed an image in an OpenXML-formatted .doc file?
look here http://www.tkachenko.com/blog/archives/000106.html
<w:pict>
<v:shapetype id="_x0000_t75" ...>
... VML shape template definition ...
</v:shapetype>
<w:binData w:name="wordml://02000001.jpg">
... Base64 encoded image goes here ...
</w:binData>
<v:shape id="_x0000_i1025" type="#_x0000_t75"
style="width:212.4pt;height:159pt">
<v:imagedata src="wordml://02000001.jpg"
o:title="Image title"/>
</v:shape>
</w:pict>
There is PHPWord project to manipulate MS Word from within PHP.
PHPWord is a library written in PHP
that create word documents. No Windows
operating system is needed for usage
because the result are docx files
(Office Open XML) that can be opened
by all major office software.
PHPWord can write them http://phpword.codeplex.com/ (note: its still in Beta. I've used PHpExcel by the same guy a lot... never tried the Word version).
Have a look at the phpdocx library for generating real .docx files rather than html files with a .doc extension
PS the extension should strictly be .docx rather than .doc for Open XML Word 2007 files
OpenTBS can create DOCX (and other OpenXML files) dynamic documents in PHP using the technique of templates.
No temporary files needed, no command lines, all in PHP.
It can add or delete pictures. The created document can be produced as a HTML download, a file saved on the server, or as binary contents in PHP.
It can also merge OpenDocument files (ODT, ODS, ODF, ...)
http://www.tinybutstrong.com/opentbs.php
I would use PHPExcel. It can work with OpenXML too.
Here's the link: http://phpexcel.codeplex.com/

Categories