How to count words from .doc file using php script? - php

I have tried many things like How to extract text from word file .doc,docx,.xlsx,.pptx php.
But this isn't a solution.
My server is Linux based so enabling extension=php_com_dotnet.dll is not the solution.
Another solution was installing LIBRE office on server and converting the .doc file to .txt on the fly and then counting the words from that file. This is very tedious job and time consuming.
I just need a simple php script that removes the special characters from the .doc file and count the number of words.

You can try with this PHP class that claims to be able to convert both .doc and .docx files in textual format.
http://www.phpclasses.org/package/7934-PHP-Convert-MS-Word-Docx-files-to-text.html
According to the example given, that's how you can use it:
require("doc2txt.class.php");
$docObj = new Doc2Txt("test.docx");
//$docObj = new Doc2Txt("test.doc");
$txt = $docObj->convertToText();
echo $txt;
As you pointed out, the core function of this library, as of many others, is something like this:
<?php
function read_doc($filename)
{
$fileHandle = fopen($filename, "r");
$line = #fread($fileHandle, filesize($filename));
$lines = explode(chr(0x0D) , $line);
$outtext = "";
foreach($lines as $thisline)
{
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE) || (strlen($thisline) == 0))
{
}
else
{
$outtext.= $thisline . " ";
}
}
$outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t#\/_()]/", "", $outtext);
return $outtext;
}
echo read_doc("sample.doc");
?>
I've tested this function with a .doc file and it seems to work quite well. It needs some fixes with the last part of the document (there is still some random text that is generated at the end of the output), but with some fine tuning it works reasonably.
EDIT:
You are right, this functions works correctly only with .docx documents (the document I tested was probably made using the same mechanism). Saving a file with .doc extension, this function doesn't work!
The only help I'm able to give you right now is the .doc binary specifications link (here is an even more complete file), where you can actually see how the binary structure is made and extract the informations from there. I can't do it now, so I hope that somebody else may help you through this!

At the end i had to use Libreoffice. But its very efficient to use it. It solved my all the problem.
So my advice would be to install the 'HEADLESS' package of libreoffice on server and use the command line conversion

I've built a tool that incorporates various methods found around the web and on Stack Overflow that provides word, line and page counts for doc, docx, pdf and txt files. I hope it's of use to people. If anyone can get rtf working with it I'd love a pull request! https://github.com/joeblurton/doccounter

Related

How to read any of the file type using php7?

I have tried to extract the user email addresses from my server. But the problem is maximum files are .txt but some are CSV files with txt extension. When I am trying to read and extract, I could not able to read the CSV files which with TXT extension. Here is my code:
<?php
$handle = fopen('2.txt', "r");
while(!feof($handle)) {
$string = fgets($handle);
$pattern = '/[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4}/i';
preg_match_all($pattern, $string, $matches);
foreach($matches[0] as $match)
{
echo $match;
echo '<br><br>';
}
}
?>
I have tried to use this code for that. The program is reading the complete file which are CSV, and line by line which are Text file. There are thousands of file and hence it is difficult to identify.
Kindly, suggest me what I should do to resolve my problem? Is there any solution which can read any format, then it will be awesome.
Well your files are different. Because of that you will have to take a different approach for each of those. In more general terms this is usually calling adapting and is mostly provided using the Adapter design pattern.
Should you use the adapter design pattern you would have a code inspecting the extension of a file to be opened and a switch with either txt or csv. Based on the value you would retrieve aTxtParseror aCsvParser` respectively.
However, before diving deep into this territory you might want to have a look at the files first. I cannot say this for sure without seeing the structures but you can. If the contents of both the text and csv files are the same then a very simple approach is to change the extension to either txt or a csv for all files and then process them using same logic, knowing files with the same extension will now be processed in the same manner.
But from what I understood the file structures actually differ. So to keep your code concise the adapter pattern, having two separate classes/functions for parsing and another one on top of that for choosing the right parsing function (this top function would actually be a form of a strategy) and running it.
Either way, I very much doubt so there is a solution for the problem you are facing as a file structure is mostly your and your own.
Ok, so problem is when CSV file has too long string line. Based on this restriction I suggest you to use example from php.net Here is an example:
$handle = #fopen("/tmp/inputfile.txt", "r");
if ($handle) {
while (($buffer = fgets($handle, 4096)) !== false) {
echo $buffer;
// do your operation for searching here
}
if (!feof($handle)) {
echo "Error: unexpected fgets() fail\n";
}
fclose($handle);
}

Editing a .doc in PHP

I am trying to replace strings in a word document by reading the file into a variable $content and then using str_ireplace() to change the string. I can read the content from the file but I str_ireplace() does not seem to be able to replace the string. I assumed it would because the string is 'binary safe' according to the PHP documentation. Sorry, I am a beginner with PHP file manipulation so all this is quite new to me.
This is what I have written.
copy('jack.doc' , 'newFile.doc');
$handle = fopen('newFile.doc','rb');
$content = '';
while (!feof($handle))
{
$content .= fread($handle, 1);
}
fclose($handle);
$handle = fopen('newFile.doc','wb');
$content = str_ireplace('USING_ICT_BOX', 'YOUR ICT CONTENT', $content);
fwrite($handle, $content);
fclose($handle);
When I download the new file, it opens as it should in MS Word but it shows the old string and not the one that should be replaced.
Can I fix this issue? Is there any better tool I can use for replacing strings in MS Word thourgh PHP?
I have same requirement for Edit .doc or .docx file using php and i have find solution for it.
And i have write post on It :: http://www.onlinecode.org/update-docx-file-using-php/
copy('jack.doc' , 'newFile.doc');
$full_path = 'newFile.doc';
if($zip_val->open($full_path) == true)
{
// In the Open XML Wordprocessing format content is stored.
// In the document.xml file located in the word directory.
$key_file_name = 'word/document.xml';
$message = $zip_val->getFromName($key_file_name);
$timestamp = date('d-M-Y H:i:s');
// this data Replace the placeholders with actual values
$message = str_replace("client_full_name", "onlinecode org", $message);
$message = str_replace("client_email_address", "ingo#onlinecode.org", $message);
$message = str_replace("date_today", $timestamp, $message);
$message = str_replace("client_website", "www.onlinecode.org", $message);
$message = str_replace("client_mobile_number", "+1999999999", $message);
//Replace the content with the new content created above.
$zip_val->addFromString($key_file_name, $message);
$zip_val->close();
}
Maybe this would point you to the right direction: http://davidwalsh.name/read-pdf-doc-file-php
Solutions I've found so far (not tested though):
Docvert - works for Doc, free, but not directly usable
PHPWordLib - works for Doc, not free
PHPDocX - DocX only, needs Zend.
I am going to opt for PHPWord www.phpword.codeplex.com as I believe teachers are going to get Office 2007 next year and also I will try and find some way to convert between .docx and .doc through PHP to support them in the mean time.
If you can reach a web-service, look at Docmosis Cloud services since it can mailmerge a doc file with your data and give you back a doc/pdf/other. You can https post to the service to make the request so is pretty straight forward from PHP.
There is many way to handle word document file on linux
antiword - not much effective as it converts into plain text.
pyODconvert
open-office or liboffice - through UNO
unoconv utility - need to installation permission on server
There is one python script which is most usable for online file conversion but you need to convert those file through command line.
There is no specific and satisfied solution to handle word files by only using php code.
I hunted for a long time to reach at this suggestion.

Verifying a CSV file is really a CSV file

I want to make sure a CSV file uploaded by one of our clients is really a CSV file in PHP. I'm handling the upload itself just fine. I'm not worried about malicious users, but I am worried about the ones that will try to upload Excel workbooks instead. Unless I'm mistaken, an Excel workbook and a CSV can still have the same MIME, so checking that isn't good enough.
Is there one regular expression that can handle verifying a CSV file is really a CSV file? (I don't need parsing... that's what PHP's fgetcsv() is for.) I've seen several, but they are usually followed by comments like "it didn't work for case X."
Is there some other better way of handling this?
(I expect the CSV to hold first/last names, department names... nothing fancy.)
Unlike other file formats, CSV has no tell-tale bytes in the file header. It starts straight away with the actual data.
I don't see any way except to actually parse it, and to count whether there is the expected number of columns in the result.
It may be enough to read as many characters as are needed to determine the first line (= until the first line break).
You can write a RE that will give you a guess if the file is valid CSV or not - but perhaps a better approach would be to try and parse the file as if it was CSV (with your fgetcsv() call), and assume it's NOT a valid one if the call fails?
In other words, the best way to see if the file is a valid CSV file is to try and parse it as such, and assume that if you failed to parse, it wasn't a CSV!
The easiest way is to try parsing the CSV and attempting to read value from it. Parse it using str_getcsv and then attempt to read a value from it. If you are able to read and validate at least a couple of values, then the CSV is valid.
EDIT
If you don't have access to str_getcsv, use this, a drop-in replacement for str_getcsv from http://www.electrictoolbox.com/php-str-getcsv-function/:
if (!function_exists('str_getcsv')) {
function str_getcsv($input, $delimiter = ",", $enclosure = '"', $escape = "\\") {
$fp = fopen("php://memory", 'r+');
fputs($fp, $input);
rewind($fp);
$data = fgetcsv($fp, null, $delimiter, $enclosure); // $escape only got added in 5.3.0
fclose($fp);
return $data;
}
}
Technically speaking, almost any text file could be a CSV file (barring quotes that don't match, etc.). You can try to guess if it's a binary file, but there isn't a reliable way to do that unless your data only has ASCII or something of the sort. If all you care is that people don't upload Excel files by mistake, check the file extension.
Any text file is a valid CSV file so it is impossible to come up with a standard way of verifying its correctness because it depends on what you really expect it to be.
Before you even start, you have to know what delimiter is used in that CSV file. After that, the easiest way to verify is to use fgetcsv function. For example:
<?php
$row = 1;
if (($handle = fopen("test.csv", "r")) !== FALSE) {
while (($data = fgetcsv($handle, 1000, ",")) !== FALSE) {
$num = count($data); // Number of fields in a row.
if ($num !== 5)
{
// OMG! Column count is not five!
}
else if (intval($data[$c]) == 0)
{
// OMG! Customer thinks we sold a car for $0!
}
}
fclose($handle);
}
?>

Efficient flat file searching in PHP

I'd like to store 0 to ~5000 IP addresses in a plain text file, with an unrelated header at the top. Something like this:
Unrelated data
Unrelated data
----SEPARATOR----
1.2.3.4
5.6.7.8
9.1.2.3
Now I'd like to find if '5.6.7.8' is in that text file using PHP. I've only ever loaded an entire file and processed it in memory, but I wondered if there was a more efficient way of searching a text file in PHP. I only need a true/false if it's there.
Could anyone shed any light? Or would I be stuck with loading in the whole file first?
Thanks in advance!
5000 isn't a lot of records. You could easily do this:
$addresses = explode("\n", file_get_contents('filename.txt'));
and search it manually and it'll be quick.
If you were storing a lot more I would suggest storing them in a database, which is designed for that kind of thing. But for 5000 I think the full load plus brute force search is fine.
Don't optimize a problem until you have a problem. There's no point needlessly overcomplicating your solution.
I'm not sure if perl's command line tool needs to load the whole file to handle it, but you could do something similar to this:
<?php
...
$result = system("perl -p -i -e '5\.6\.7\.8' yourfile.txt");
if ($result)
....
else
....
...
?>
Another option would be to store the IP's in separate files based on the first or second group:
# 1.2.txt
1.2.3.4
1.2.3.5
1.2.3.6
...
# 5.6.txt
5.6.7.8
5.6.7.9
5.6.7.10
...
... etc.
That way you wouldn't necessarily have to worry about the files being so large you incur a performance penalty by loading the whole file into memory.
You could shell out and grep for it.
You might try fgets()
It reads a file line by line. I'm not sure how much more efficient this is though. I'm guessing that if the IP was towards the top of the file it would be more efficient and if the IP was towards the bottom it would be less efficient than just reading in the whole file.
You could use the GREP command with backticks in your on a Linux server. Something like:
$searchFor = '5.6.7.8';
$file = '/path/to/file.txt';
$grepCmd = `grep $searchFor $file`;
echo $grepCmd;
I haven't tested this personally, but there is a snippet of code in the PHP manual that is written for large file parsing:
http://www.php.net/manual/en/function.fgets.php#59393
//File to be opened
$file = "huge.file";
//Open file (DON'T USE a+ pointer will be wrong!)
$fp = fopen($file, 'r');
//Read 16meg chunks
$read = 16777216;
//\n Marker
$part = 0;
while(!feof($fp)) {
$rbuf = fread($fp, $read);
for($i=$read;$i > 0 || $n == chr(10);$i--) {
$n=substr($rbuf, $i, 1);
if($n == chr(10))break;
//If we are at the end of the file, just grab the rest and stop loop
elseif(feof($fp)) {
$i = $read;
$buf = substr($rbuf, 0, $i+1);
break;
}
}
//This is the buffer we want to do stuff with, maybe thow to a function?
$buf = substr($rbuf, 0, $i+1);
//Point marker back to last \n point
$part = ftell($fp)-($read-($i+1));
fseek($fp, $part);
}
fclose($fp);
The snippet was written by the original author: hackajar yahoo com
are you trying to compare the current IP with the text files listed IP's? the unrelated data wouldnt match anyway.
so just use strpos on the on the full file contents (file_get_contents).
<?php
$file = file_get_contents('data.txt');
$pos = strpos($file, $_SERVER['REMOTE_ADDR']);
if($pos === false) {
echo "no match for $_SERVER[REMOTE_ADDR]";
}
else {
echo "match for $_SERVER[REMOTE_ADDR]!";
}
?>

Can PHP read text from a PowerPoint file?

I want to have PHP read an (uploaded) powerpoint presentation, and minimally extract the text from each slide (grabbing more info like images and layouts would even be better, but I would settle for just the text at this point).
I know that google apps does it in its presentation app, so I am guessing there is some way to translate the powerpoint binary, but I can't seem to find any info on how to do it.
Any ideas on what to try?
Thanks -
Depending on the version, you can take a look on the Zend Framework as Zend_Search_Lucene is able to index PowerPoint 2007 files. Just take a look at the corresponding class file, i think it's something like Zend_Search_Lucene_Document_Pptx.
Yes of course it's possible.
[Here's a start.](http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-4342ED7AD886/PowerPoint97-2007BinaryFileFormat(ppt)Specification.pdf) I wouldn't say it's very well documented/formated, but it's not that hard once you get started. Start by focusing only on elements you need (slides, text, etc).
A less detailed and simpler approach would be to open .ppt file in hex editor and look for information you are interesed in (you should be able to see text within the binary data) and what surrounds it. Then based on what surrounds that information you could write a parser which extracts this information.
Here's a sample function I created form a similar one that extracts text from Word documents. I tested it with Microsoft PowerPoint files, but it won't decode OpenOfficeImpress files saved as .ppt
For .pptx files you might want to take a look at Zend Lucene.
function parsePPT($filename) {
// This approach uses detection of the string "chr(0f).Hex_value.chr(0x00).chr(0x00).chr(0x00)" to find text strings, which are then terminated by another NUL chr(0x00). [1] Get text between delimiters [2]
$fileHandle = fopen($filename, "r");
$line = #fread($fileHandle, filesize($filename));
$lines = explode(chr(0x0f),$line);
$outtext = '';
foreach($lines as $thisline) {
if (strpos($thisline, chr(0x00).chr(0x00).chr(0x00)) == 1) {
$text_line = substr($thisline, 4);
$end_pos = strpos($text_line, chr(0x00));
$text_line = substr($text_line, 0, $end_pos);
$text_line = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t#\/\_\(\)]/","",$text_line);
if (strlen($text_line) > 1) {
$outtext.= substr($text_line, 0, $end_pos)."\n";
}
}
}
return $outtext;
}
I wanted to post my resolution to this.
Unfortunately, I was unable to get PHP to reliably read the binary data.
My solution was to write a small vb6 app that does the work by automating PowerPoint.
Not what I was looking for, but, solves the issue for now.
That being said, the Zend option looks like it may be viable at some point, so I will watch that.
Thanks.

Categories