Extract text from doc and docx - php

I would like to know how can I read the contents of a doc or docx. I'm using a Linux VPS and PHP, but if there is a simpler solution using other language, please let me know, as long as it works under a linux webserver.

Here i have added the solution to get the text from .doc,.docx word files
How to extract text from word file .doc,docx php
For .doc
private function read_doc() {
$fileHandle = fopen($this->filename, "r");
$line = #fread($fileHandle, filesize($this->filename));
$lines = explode(chr(0x0D),$line);
$outtext = "";
foreach($lines as $thisline)
{
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE)||(strlen($thisline)==0))
{
} else {
$outtext .= $thisline." ";
}
}
$outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t#\/\_\(\)]/","",$outtext);
return $outtext;
}
For .docx
private function read_docx(){
$striped_content = '';
$content = '';
$zip = zip_open($this->filename);
if (!$zip || is_numeric($zip)) return false;
while ($zip_entry = zip_read($zip)) {
if (zip_entry_open($zip, $zip_entry) == FALSE) continue;
if (zip_entry_name($zip_entry) != "word/document.xml") continue;
$content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));
zip_entry_close($zip_entry);
}// end while
zip_close($zip);
$content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
$content = str_replace('</w:r></w:p>', "\r\n", $content);
$striped_content = strip_tags($content);
return $striped_content;
}

This is a .DOCX solution only. For .DOC or .PDF you'll need to use something else like pdf2text.php for PDF
function docx2text($filename) {
return readZippedXML($filename, "word/document.xml");
}
function readZippedXML($archiveFile, $dataFile) {
// Create new ZIP archive
$zip = new ZipArchive;
// Open received archive file
if (true === $zip->open($archiveFile)) {
// If done, search for the data file in the archive
if (($index = $zip->locateName($dataFile)) !== false) {
// If found, read it to the string
$data = $zip->getFromIndex($index);
// Close archive file
$zip->close();
// Load XML from a string
// Skip errors and warnings
$xml = new DOMDocument();
$xml->loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
// Return data without XML formatting tags
return strip_tags($xml->saveXML());
}
$zip->close();
}
// In case of failure return empty string
return "";
}
echo docx2text("test.docx"); // Save this contents to file

Parse .docx, .odt, .doc and .rtf documents
I wrote a library that parses the docx, odt and rtf documents based on answers here and elsewhere.
The major improvement I have made to the .docx and .odt parsing is the that the library processes the XML that describes the document and attempts to conform it to HTML tags, i.e. em and strong tags. This means that if you're using the library for a CMS, text formatting is not lost
You can get it here

My solution is Antiword for .doc and docx2txt for .docx
Assuming a linux server that you control, download each one, extract then install. I installed each one system wide:
Antiword: make global_install
docx2txt: make install
Then to use these tools to extract the text into a string in php:
//for .doc
$text = shell_exec('/usr/local/bin/antiword -w 0 ' .
escapeshellarg($docFilePath));
//for .docx
$text = shell_exec('/usr/local/bin/docx2txt.pl ' .
escapeshellarg($docxFilePath) . ' -');
docx2txt requires perl
no_freedom's solution does extract text from docx files, but it can butcher whitespace. Most files I tested had instances where words that should be separated had no space between them. Not good when you want to full text search the documents you're processing.

Try ApachePOI. It works well for Java. I suppose you won't have any difficulties installing Java on Linux.

I would suggest, Extract text using apache Tika, you can extract multiple type of file content like .doc/.docx and pdf and many other.

I used docxtotxt to extract docx file content. My code is as follows:
if($extention == "docx")
{
$docxFilePath = "/var/www/vhosts/abc.com/httpdocs/writers/filename.docx";
$content = shell_exec('/var/www/vhosts/abc.com/httpdocs/docx2txt/docx2txt.pl
'.escapeshellarg($docxFilePath) . ' -');
}

I insert little improvements in doc to txt converter function
private function read_doc() {
$line_array = array();
$fileHandle = fopen( $this->filename, "r" );
$line = #fread( $fileHandle, filesize( $this->filename ) );
$lines = explode( chr( 0x0D ), $line );
$outtext = "";
foreach ( $lines as $thisline ) {
$pos = strpos( $thisline, chr( 0x00 ) );
if ( $pos !== false ) {
} else {
$line_array[] = preg_replace( "/[^a-zA-Z0-9\s\,\.\-\n\r\t#\/\_\(\)]/", "", $thisline );
}
}
return implode("\n",$line_array);
}
Now it saves empty rows and txt file looks row by row .

You can use Apache Tika as complete solution it provides REST API.
Another good library is RawText, as it can do an OCR over images, and extract text from any doc. It's non-free, and it works over REST API.
The sample code extracting your file with RawText:
$result = $rawText->extract($your_file)

Related

Scan word document using php [duplicate]

Is it possible to read and write Word (2003 and 2007) files in PHP without using a COM object?
I know that I can:
$file = fopen('c:\file.doc', 'w+');
fwrite($file, $text);
fclose();
but Word will read it as an HTML file not a native .doc file.
Reading binary Word documents would involve creating a parser according to the published file format specifications for the DOC format. I think this is no real feasible solution.
You could use the Microsoft Office XML formats for reading and writing Word files - this is compatible with the 2003 and 2007 version of Word. For reading you have to ensure that the Word documents are saved in the correct format (it's called Word 2003 XML-Document in Word 2007). For writing you just have to follow the openly available XML schema. I've never used this format for writing out Office documents from PHP, but I'm using it for reading in an Excel worksheet (naturally saved as XML-Spreadsheet 2003) and displaying its data on a web page. As the files are plainly XML data it's no problem to navigate within and figure out how to extract the data you need.
The other option - a Word 2007 only option (if the OpenXML file formats are not installed in your Word 2003) - would be to ressort to OpenXML. As databyss pointed out here the DOCX file format is just a ZIP archive with XML files included. There are a lot of resources on MSDN regarding the OpenXML file format, so you should be able to figure out how to read the data you want. Writing will be much more complicated I think - it just depends on how much time you'll invest.
Perhaps you can have a look at PHPExcel which is a library able to write to Excel 2007 files and read from Excel 2007 files using the OpenXML standard. You could get an idea of the work involved when trying to read and write OpenXML Word documents.
this works with vs < office 2007 and its pure PHP, no COM crap, still trying to figure 2007
<?php
/*****************************************************************
This approach uses detection of NUL (chr(00)) and end line (chr(13))
to decide where the text is:
- divide the file contents up by chr(13)
- reject any slices containing a NUL
- stitch the rest together again
- clean up with a regular expression
*****************************************************************/
function parseWord($userDoc)
{
$fileHandle = fopen($userDoc, "r");
$line = #fread($fileHandle, filesize($userDoc));
$lines = explode(chr(0x0D),$line);
$outtext = "";
foreach($lines as $thisline)
{
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE)||(strlen($thisline)==0))
{
} else {
$outtext .= $thisline." ";
}
}
$outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t#\/\_\(\)]/","",$outtext);
return $outtext;
}
$userDoc = "cv.doc";
$text = parseWord($userDoc);
echo $text;
?>
You can use Antiword, it is a free MS Word reader for Linux and most popular OS.
$document_file = 'c:\file.doc';
$text_from_doc = shell_exec('/usr/local/bin/antiword '.$document_file);
I don't know about reading native Word documents in PHP, but if you want to write a Word document in PHP, WordprocessingML (aka WordML) might be a good solution. All you have to do is create an XML document in the correct format. I believe Word 2003 and 2007 both support WordML.
Just updating the code
<?php
/*****************************************************************
This approach uses detection of NUL (chr(00)) and end line (chr(13))
to decide where the text is:
- divide the file contents up by chr(13)
- reject any slices containing a NUL
- stitch the rest together again
- clean up with a regular expression
*****************************************************************/
function parseWord($userDoc)
{
$fileHandle = fopen($userDoc, "r");
$word_text = #fread($fileHandle, filesize($userDoc));
$line = "";
$tam = filesize($userDoc);
$nulos = 0;
$caracteres = 0;
for($i=1536; $i<$tam; $i++)
{
$line .= $word_text[$i];
if( $word_text[$i] == 0)
{
$nulos++;
}
else
{
$nulos=0;
$caracteres++;
}
if( $nulos>1996)
{
break;
}
}
//echo $caracteres;
$lines = explode(chr(0x0D),$line);
//$outtext = "<pre>";
$outtext = "";
foreach($lines as $thisline)
{
$tam = strlen($thisline);
if( !$tam )
{
continue;
}
$new_line = "";
for($i=0; $i<$tam; $i++)
{
$onechar = $thisline[$i];
if( $onechar > chr(240) )
{
continue;
}
if( $onechar >= chr(0x20) )
{
$caracteres++;
$new_line .= $onechar;
}
if( $onechar == chr(0x14) )
{
$new_line .= "</a>";
}
if( $onechar == chr(0x07) )
{
$new_line .= "\t";
if( isset($thisline[$i+1]) )
{
if( $thisline[$i+1] == chr(0x07) )
{
$new_line .= "\n";
}
}
}
}
//troca por hiperlink
$new_line = str_replace("HYPERLINK" ,"<a href=",$new_line);
$new_line = str_replace("\o" ,">",$new_line);
$new_line .= "\n";
//link de imagens
$new_line = str_replace("INCLUDEPICTURE" ,"<br><img src=",$new_line);
$new_line = str_replace("\*" ,"><br>",$new_line);
$new_line = str_replace("MERGEFORMATINET" ,"",$new_line);
$outtext .= nl2br($new_line);
}
return $outtext;
}
$userDoc = "custo.doc";
$userDoc = "Cultura.doc";
$text = parseWord($userDoc);
echo $text;
?>
Most probably you won't be able to read Word documents without COM.
Writing was covered in this topic
2007 might be a bit complicated as well.
The .docx format is a zip file that contains a few folders with other files in them for formatting and other stuff.
Rename a .docx file to .zip and you'll see what I mean.
So if you can work within zip files in PHP, you should be on the right path.
www.phplivedocx.org is a SOAP based service that means that you always need to be online for testing the Files also does not have enough examples for its use . Strangely I found only after 2 days of downloading (requires additionaly zend framework too) that its a SOAP based program(cursed me !!!)...I think without COM its just not possible on a Linux server and the only idea is to change the doc file in another usable file which PHP can parse...
Source gotten from
Use following class directly to read word document
class DocxConversion{
private $filename;
public function __construct($filePath) {
$this->filename = $filePath;
}
private function read_doc() {
$fileHandle = fopen($this->filename, "r");
$line = #fread($fileHandle, filesize($this->filename));
$lines = explode(chr(0x0D),$line);
$outtext = "";
foreach($lines as $thisline)
{
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE)||(strlen($thisline)==0))
{
} else {
$outtext .= $thisline." ";
}
}
$outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t#\/\_\(\)]/","",$outtext);
return $outtext;
}
private function read_docx(){
$striped_content = '';
$content = '';
$zip = zip_open($this->filename);
if (!$zip || is_numeric($zip)) return false;
while ($zip_entry = zip_read($zip)) {
if (zip_entry_open($zip, $zip_entry) == FALSE) continue;
if (zip_entry_name($zip_entry) != "word/document.xml") continue;
$content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));
zip_entry_close($zip_entry);
}// end while
zip_close($zip);
$content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
$content = str_replace('</w:r></w:p>', "\r\n", $content);
$striped_content = strip_tags($content);
return $striped_content;
}
/************************excel sheet************************************/
function xlsx_to_text($input_file){
$xml_filename = "xl/sharedStrings.xml"; //content file name
$zip_handle = new ZipArchive;
$output_text = "";
if(true === $zip_handle->open($input_file)){
if(($xml_index = $zip_handle->locateName($xml_filename)) !== false){
$xml_datas = $zip_handle->getFromIndex($xml_index);
$xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
$output_text = strip_tags($xml_handle->saveXML());
}else{
$output_text .="";
}
$zip_handle->close();
}else{
$output_text .="";
}
return $output_text;
}
/*************************power point files*****************************/
function pptx_to_text($input_file){
$zip_handle = new ZipArchive;
$output_text = "";
if(true === $zip_handle->open($input_file)){
$slide_number = 1; //loop through slide files
while(($xml_index = $zip_handle->locateName("ppt/slides/slide".$slide_number.".xml")) !== false){
$xml_datas = $zip_handle->getFromIndex($xml_index);
$xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
$output_text .= strip_tags($xml_handle->saveXML());
$slide_number++;
}
if($slide_number == 1){
$output_text .="";
}
$zip_handle->close();
}else{
$output_text .="";
}
return $output_text;
}
public function convertToText() {
if(isset($this->filename) && !file_exists($this->filename)) {
return "File Not exists";
}
$fileArray = pathinfo($this->filename);
$file_ext = $fileArray['extension'];
if($file_ext == "doc" || $file_ext == "docx" || $file_ext == "xlsx" || $file_ext == "pptx")
{
if($file_ext == "doc") {
return $this->read_doc();
} elseif($file_ext == "docx") {
return $this->read_docx();
} elseif($file_ext == "xlsx") {
return $this->xlsx_to_text();
}elseif($file_ext == "pptx") {
return $this->pptx_to_text();
}
} else {
return "Invalid File Type";
}
}
}
$docObj = new DocxConversion("test.docx"); //replace your document name with correct extension doc or docx
echo $docText= $docObj->convertToText();
Office 2007 .docx should be possible since it's an XML standard. Word 2003 most likely requires COM to read, even with the standards now published by MS, since those standards are huge. I haven't seen many libraries written to match them yet.
I don't know what you are going to use it for, but I needed .doc support for search indexing; What I did was use a little commandline tool called "catdoc"; This transfers the contents of the Word document to plain text so it can be indexed. If you need to keep formatting and stuff this is not your tool.
phpLiveDocx is a Zend Framework component and can read and write DOC and DOCX files in PHP on Linux, Windows and Mac.
See the project web site at:
http://www.phplivedocx.org
One way to manipulate Word files with PHP that you may find interesting is with the help of PHPDocX.
You may see how it works having a look at its online tutorial.
You can insert or extract contents or even merge multiple Word files into a asingle one.
Would the .rtf format work for your purposes? .rtf can easily be converted to and from .doc format, but it is written in plaintext (with control commands embedded). This is how I plan to integrate my application with Word documents.
even i'm working on same kind of project [An Onlinw Word Processor]!
But i've choosen c#.net and ASP.net. But through the survey i did; i got to know that
By Using Open XML SDK and VSTO [Visual Studio Tools For Office]
we may easily work with a word file manipulate them and even convert internally to different into several formats such as .odt,.pdf,.docx etc..
So, goto msdn.microsoft.com and be thorough about the office development tab. Its the easiest way to do this as all functions we need to implement are already available in .net!!
But as u want to do ur project in PHP, u can do it in Visual Studio and .net as PHP is also one of the .net Compliant Language!!
I have the same case
I guess I am going to use a cheap 50 mega windows based hosting with free domain to use it to convert my files on, for PHP server. And linking them is easy.
All you need is make an ASP.NET page that recieves the doc file via post and replies it via HTTP
so simple CURL would do it.

how to parse a Microsoft Word 97-2004 .doc file with PHP

How can i parse with PHP a .doc file "Microsoft Word 97-2004 document"?
I can parse "normal" .doc files with
private function read_doc() {
$fileHandle = fopen($this->filename, "r");
$line = #fread($fileHandle, filesize($this->filename));
$lines = explode(chr(0x0D),$line);
$outtext = "";
foreach($lines as $thisline)
{
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE)||(strlen($thisline)==0))
{
} else {
$outtext .= $thisline." ";
}
}
//print_r($outtext);die();
$outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t#\/\_\(\)]/","",$outtext);
return $outtext;
}
but that doesn't work with Microsoft Word 97-2004 .doc files.
I just want to extract the pure text. Nothing else.
--> Solution is PHPWord like Mark Baker recommends in his comment.
At the end i had to install linux catdoc 0.94.2 to resolve the problem. PHPWord couldn't convert all the files in a correct way to pure .txt format.
So heres a solution for linux (for example Unbuntu) users:
On command line install catdoc
sudo apt-get install catdoc
If you are on a Windows Server, have a look at this. It worked also for me:
http://blog.brush.co.nz/2009/09/catdoc-windows/
Then in your PHP Code you can call it like this (for Linux calls):
$escapeFile = escapeshellarg($data['tmp_name']);
$command = "catdoc $escapeFile";
$output = array();
exec($command,$output);
$text = implode("\n",$output);
then you can do for example
$text = strip_tags($text);
$text = preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", "\n", $text);
echo nl2br($text) ;
to see the result on screen.
That's what for me works best up to now.
If someone has a better solution, please tell me.

how to get first page content of docx in php

i want to get docx file fist page text only, suppose business.docx file has 5 pages, now i want to show only first page content and hide rest of the 4 pages text.
following is the code that read docx file.
function read_file_docx($filename)
{
$striped_content = '';
$content = '';
if(!$filename || !file_exists($filename)) return false;
$zip = zip_open($filename);
if (!$zip || is_numeric($zip)) return false;
while ($zip_entry = zip_read($zip)) {
if (zip_entry_open($zip, $zip_entry) == FALSE) continue;
if (zip_entry_name($zip_entry) != "word/document.xml") continue;
$content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));
zip_entry_close($zip_entry);
}// end while
zip_close($zip);
//echo $content;
//echo "<hr>";
//file_put_contents('1.xml', $content);
$content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
$content = str_replace('</w:r></w:p>', "\r\n", $content);
$striped_content = strip_tags($content);
return $striped_content;
}
i also want ms word text formatting like bold, italic, bullet, etc... i have google but cant get code which give me option to show docx first page content.
I suggest you to look into below php library
https://github.com/PHPOffice/PHPWord
you can find samples https://github.com/PHPOffice/PHPWord/tree/master/samples

Showing different characters when I convert .doc extenstion file to text format using php

I've word file with .doc extension which have only this content Test1001. When I convert this word file it's showing me following characters
Y, dXiJ(x(I_TS1EZBmU/xYy5g/GMGeD3Vqq8K)fw9 xrxwrTZaGy8IjbRcXI u3KGnD1NIBs
RuKV.ELM2fiVvlu8zH (W )6-rCSj id DAIqbJx6kASht(QpmcaSlXP1Mh9MVdDAaVBfJP8AVf 6Q
instead of Test1001
Do you know why it's showing this characters ?
I'm using following class:
private function read_doc() {
$fileHandle = fopen($this->filename, "r");
$line = #fread($fileHandle, filesize($this->filename));
$lines = explode(chr(0x0D),$line);
$outtext = "";
foreach($lines as $thisline)
{
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE)||(strlen($thisline)==0))
{
} else {
$outtext .= $thisline." ";
}
}
$outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t#\/\_\(\)]/","",$outtext);
return $outtext;
}
Word .doc formats are binary not text, you cannot simply assume they are text as you have done in your PHP.
To get the data out of a Word file, you would need to actually use Word or a library that understands and converts the Word binary format into text.
Most people try to use the .docx format that newer versions of Word create as this uses XML inside a zip archive. There are a number of libraries that will handle that format. Such as PHPOffice.
For the binary .doc format, there is Docvert that will convert from .doc to HTML on the fly.

Reading/Writing a MS Word file in PHP

Is it possible to read and write Word (2003 and 2007) files in PHP without using a COM object?
I know that I can:
$file = fopen('c:\file.doc', 'w+');
fwrite($file, $text);
fclose();
but Word will read it as an HTML file not a native .doc file.
Reading binary Word documents would involve creating a parser according to the published file format specifications for the DOC format. I think this is no real feasible solution.
You could use the Microsoft Office XML formats for reading and writing Word files - this is compatible with the 2003 and 2007 version of Word. For reading you have to ensure that the Word documents are saved in the correct format (it's called Word 2003 XML-Document in Word 2007). For writing you just have to follow the openly available XML schema. I've never used this format for writing out Office documents from PHP, but I'm using it for reading in an Excel worksheet (naturally saved as XML-Spreadsheet 2003) and displaying its data on a web page. As the files are plainly XML data it's no problem to navigate within and figure out how to extract the data you need.
The other option - a Word 2007 only option (if the OpenXML file formats are not installed in your Word 2003) - would be to ressort to OpenXML. As databyss pointed out here the DOCX file format is just a ZIP archive with XML files included. There are a lot of resources on MSDN regarding the OpenXML file format, so you should be able to figure out how to read the data you want. Writing will be much more complicated I think - it just depends on how much time you'll invest.
Perhaps you can have a look at PHPExcel which is a library able to write to Excel 2007 files and read from Excel 2007 files using the OpenXML standard. You could get an idea of the work involved when trying to read and write OpenXML Word documents.
this works with vs < office 2007 and its pure PHP, no COM crap, still trying to figure 2007
<?php
/*****************************************************************
This approach uses detection of NUL (chr(00)) and end line (chr(13))
to decide where the text is:
- divide the file contents up by chr(13)
- reject any slices containing a NUL
- stitch the rest together again
- clean up with a regular expression
*****************************************************************/
function parseWord($userDoc)
{
$fileHandle = fopen($userDoc, "r");
$line = #fread($fileHandle, filesize($userDoc));
$lines = explode(chr(0x0D),$line);
$outtext = "";
foreach($lines as $thisline)
{
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE)||(strlen($thisline)==0))
{
} else {
$outtext .= $thisline." ";
}
}
$outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t#\/\_\(\)]/","",$outtext);
return $outtext;
}
$userDoc = "cv.doc";
$text = parseWord($userDoc);
echo $text;
?>
You can use Antiword, it is a free MS Word reader for Linux and most popular OS.
$document_file = 'c:\file.doc';
$text_from_doc = shell_exec('/usr/local/bin/antiword '.$document_file);
I don't know about reading native Word documents in PHP, but if you want to write a Word document in PHP, WordprocessingML (aka WordML) might be a good solution. All you have to do is create an XML document in the correct format. I believe Word 2003 and 2007 both support WordML.
Just updating the code
<?php
/*****************************************************************
This approach uses detection of NUL (chr(00)) and end line (chr(13))
to decide where the text is:
- divide the file contents up by chr(13)
- reject any slices containing a NUL
- stitch the rest together again
- clean up with a regular expression
*****************************************************************/
function parseWord($userDoc)
{
$fileHandle = fopen($userDoc, "r");
$word_text = #fread($fileHandle, filesize($userDoc));
$line = "";
$tam = filesize($userDoc);
$nulos = 0;
$caracteres = 0;
for($i=1536; $i<$tam; $i++)
{
$line .= $word_text[$i];
if( $word_text[$i] == 0)
{
$nulos++;
}
else
{
$nulos=0;
$caracteres++;
}
if( $nulos>1996)
{
break;
}
}
//echo $caracteres;
$lines = explode(chr(0x0D),$line);
//$outtext = "<pre>";
$outtext = "";
foreach($lines as $thisline)
{
$tam = strlen($thisline);
if( !$tam )
{
continue;
}
$new_line = "";
for($i=0; $i<$tam; $i++)
{
$onechar = $thisline[$i];
if( $onechar > chr(240) )
{
continue;
}
if( $onechar >= chr(0x20) )
{
$caracteres++;
$new_line .= $onechar;
}
if( $onechar == chr(0x14) )
{
$new_line .= "</a>";
}
if( $onechar == chr(0x07) )
{
$new_line .= "\t";
if( isset($thisline[$i+1]) )
{
if( $thisline[$i+1] == chr(0x07) )
{
$new_line .= "\n";
}
}
}
}
//troca por hiperlink
$new_line = str_replace("HYPERLINK" ,"<a href=",$new_line);
$new_line = str_replace("\o" ,">",$new_line);
$new_line .= "\n";
//link de imagens
$new_line = str_replace("INCLUDEPICTURE" ,"<br><img src=",$new_line);
$new_line = str_replace("\*" ,"><br>",$new_line);
$new_line = str_replace("MERGEFORMATINET" ,"",$new_line);
$outtext .= nl2br($new_line);
}
return $outtext;
}
$userDoc = "custo.doc";
$userDoc = "Cultura.doc";
$text = parseWord($userDoc);
echo $text;
?>
Most probably you won't be able to read Word documents without COM.
Writing was covered in this topic
2007 might be a bit complicated as well.
The .docx format is a zip file that contains a few folders with other files in them for formatting and other stuff.
Rename a .docx file to .zip and you'll see what I mean.
So if you can work within zip files in PHP, you should be on the right path.
www.phplivedocx.org is a SOAP based service that means that you always need to be online for testing the Files also does not have enough examples for its use . Strangely I found only after 2 days of downloading (requires additionaly zend framework too) that its a SOAP based program(cursed me !!!)...I think without COM its just not possible on a Linux server and the only idea is to change the doc file in another usable file which PHP can parse...
Source gotten from
Use following class directly to read word document
class DocxConversion{
private $filename;
public function __construct($filePath) {
$this->filename = $filePath;
}
private function read_doc() {
$fileHandle = fopen($this->filename, "r");
$line = #fread($fileHandle, filesize($this->filename));
$lines = explode(chr(0x0D),$line);
$outtext = "";
foreach($lines as $thisline)
{
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE)||(strlen($thisline)==0))
{
} else {
$outtext .= $thisline." ";
}
}
$outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t#\/\_\(\)]/","",$outtext);
return $outtext;
}
private function read_docx(){
$striped_content = '';
$content = '';
$zip = zip_open($this->filename);
if (!$zip || is_numeric($zip)) return false;
while ($zip_entry = zip_read($zip)) {
if (zip_entry_open($zip, $zip_entry) == FALSE) continue;
if (zip_entry_name($zip_entry) != "word/document.xml") continue;
$content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));
zip_entry_close($zip_entry);
}// end while
zip_close($zip);
$content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
$content = str_replace('</w:r></w:p>', "\r\n", $content);
$striped_content = strip_tags($content);
return $striped_content;
}
/************************excel sheet************************************/
function xlsx_to_text($input_file){
$xml_filename = "xl/sharedStrings.xml"; //content file name
$zip_handle = new ZipArchive;
$output_text = "";
if(true === $zip_handle->open($input_file)){
if(($xml_index = $zip_handle->locateName($xml_filename)) !== false){
$xml_datas = $zip_handle->getFromIndex($xml_index);
$xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
$output_text = strip_tags($xml_handle->saveXML());
}else{
$output_text .="";
}
$zip_handle->close();
}else{
$output_text .="";
}
return $output_text;
}
/*************************power point files*****************************/
function pptx_to_text($input_file){
$zip_handle = new ZipArchive;
$output_text = "";
if(true === $zip_handle->open($input_file)){
$slide_number = 1; //loop through slide files
while(($xml_index = $zip_handle->locateName("ppt/slides/slide".$slide_number.".xml")) !== false){
$xml_datas = $zip_handle->getFromIndex($xml_index);
$xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
$output_text .= strip_tags($xml_handle->saveXML());
$slide_number++;
}
if($slide_number == 1){
$output_text .="";
}
$zip_handle->close();
}else{
$output_text .="";
}
return $output_text;
}
public function convertToText() {
if(isset($this->filename) && !file_exists($this->filename)) {
return "File Not exists";
}
$fileArray = pathinfo($this->filename);
$file_ext = $fileArray['extension'];
if($file_ext == "doc" || $file_ext == "docx" || $file_ext == "xlsx" || $file_ext == "pptx")
{
if($file_ext == "doc") {
return $this->read_doc();
} elseif($file_ext == "docx") {
return $this->read_docx();
} elseif($file_ext == "xlsx") {
return $this->xlsx_to_text();
}elseif($file_ext == "pptx") {
return $this->pptx_to_text();
}
} else {
return "Invalid File Type";
}
}
}
$docObj = new DocxConversion("test.docx"); //replace your document name with correct extension doc or docx
echo $docText= $docObj->convertToText();
Office 2007 .docx should be possible since it's an XML standard. Word 2003 most likely requires COM to read, even with the standards now published by MS, since those standards are huge. I haven't seen many libraries written to match them yet.
I don't know what you are going to use it for, but I needed .doc support for search indexing; What I did was use a little commandline tool called "catdoc"; This transfers the contents of the Word document to plain text so it can be indexed. If you need to keep formatting and stuff this is not your tool.
phpLiveDocx is a Zend Framework component and can read and write DOC and DOCX files in PHP on Linux, Windows and Mac.
See the project web site at:
http://www.phplivedocx.org
One way to manipulate Word files with PHP that you may find interesting is with the help of PHPDocX.
You may see how it works having a look at its online tutorial.
You can insert or extract contents or even merge multiple Word files into a asingle one.
Would the .rtf format work for your purposes? .rtf can easily be converted to and from .doc format, but it is written in plaintext (with control commands embedded). This is how I plan to integrate my application with Word documents.
even i'm working on same kind of project [An Onlinw Word Processor]!
But i've choosen c#.net and ASP.net. But through the survey i did; i got to know that
By Using Open XML SDK and VSTO [Visual Studio Tools For Office]
we may easily work with a word file manipulate them and even convert internally to different into several formats such as .odt,.pdf,.docx etc..
So, goto msdn.microsoft.com and be thorough about the office development tab. Its the easiest way to do this as all functions we need to implement are already available in .net!!
But as u want to do ur project in PHP, u can do it in Visual Studio and .net as PHP is also one of the .net Compliant Language!!
I have the same case
I guess I am going to use a cheap 50 mega windows based hosting with free domain to use it to convert my files on, for PHP server. And linking them is easy.
All you need is make an ASP.NET page that recieves the doc file via post and replies it via HTTP
so simple CURL would do it.

Categories