how to read docx file math equations using php code-igniter - php

I am trying to read a docx file from php, as i read successfully but i didnt get some equation in the word document, as i am newbie in php i didnt know how to read that please suggest some ideas, the function i have tried to read the document is
function index()
{
$document = 'file_path';
$text_output = $this->read_docx($document);
echo nl2br($text_output);
}
private function read_docx($filename)
{
var_dump($filename);
$striped_content = '';
$content = '';
$zip = zip_open($filename);
if (!$zip || is_numeric($zip))
return false;
while ($zip_entry = zip_read($zip)) {
if (zip_entry_open($zip, $zip_entry) == FALSE)
continue;
if (zip_entry_name($zip_entry) != "word/document.xml")
continue;
$content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));
zip_entry_close($zip_entry);
}// end while
zip_close($zip);
$content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
$content = str_replace('</w:r></w:p>', "\r\n", $content);
$striped_content = strip_tags($content);
return $striped_content;
}
This is the sample math equation in the docx file which i am trying to read and render to html page. thanks

I fully go through this https://msdn.microsoft.com/en-us/library/aa982683(v=office.12).aspx#Office2007ManipulatingXMLDocs_exploring and parse the xml using php xmlreader()
$document = 'url';
/*Function to extract images*/
function readZippedImages($filename)
{
$for_image = $filename;
/*Create a new ZIP archive object*/
$zip = new ZipArchive;
/*Open the received archive file*/
$final_arr=array();
$repo = array();
if (true === $zip->open($filename))
{
for ($i=0; $i<$zip->numFiles;$i++)
{
if($i==3)//should be document.xml
{
//======function using xml parser================================//
$check = $zip->getFromIndex($i);
//Create a new XMLReader Instance
$reader = new XMLReader();
//Loading from a XML File or URL
//$reader->open($check);
//Loading from PHP variable
$reader->xml($check);
//====================parsing through the document==================//
while($reader->read())
{
$node_loc = $reader->localName;
if($reader->nodeType == XMLREADER::ELEMENT && $reader->localName == 'body')
{
$reader->read();
$read_content = $reader->value. "\n";
}
if($node_loc == '#text')//parsing all the text from document using #text tag
{
$temp_value = array("text"=>$reader->value);
array_push($final_arr,$temp_value);
$reader->read();
$read_content = $reader->value. "\n";
}
if($node_loc == 'blip')//parsing all the images using blip tag which is under drawing tag
{
$attri_r = $reader->getAttribute("r:embed");
$current_image_name = $repo[$attri_r];
$image_stream = $this->showimage($for_image,$current_image_name);//return the base64 string
$temp_value = array("image"=>$image_stream);
array_push($final_arr,$temp_value);
}
}
//==================xml parser end============================//
}
if($i==2)//should be rels.xml
{
$check_id = $zip->getFromIndex($i);
$reader_relation = new XMLReader();
$reader_relation->xml($check_id);
//====================parsing through the document==================//
while($reader_relation->read())
{
$node_loc = $reader_relation->localName;
if($reader_relation->nodeType == XMLREADER::ELEMENT && $reader_relation->localName == 'Relationship')
{
$read_content_id = $reader_relation->getAttribute("Id");
$read_content_name = $reader_relation->getAttribute("Target");
$repo[$read_content_id]=$read_content_name;
}
}
}
}
}
}
function showimage($zip_file_original, $file_name_image)
{
$file_name_image = 'word/'.$file_name_image.'';// getting the image in the zip using its name
$z_show = new ZipArchive();
if ($z_show->open($zip_file_original) !== true) {
echo "File not found.";
return false;
}
$stat = $z_show->statName($file_name_image);
$fp = $z_show->getStream($file_name_image);
if(!$fp) {
echo "Could not load image.";
return false;
}
header('Content-Type: image/jpeg');
header('Content-Length: ' . $stat['size']);
$image = stream_get_contents($fp);
$picture = base64_encode($image);
return $picture;//return the base62 string for the current image.
fclose($fp);
}
readZippedImages($document);
print the $final_arr you will get the all text and images in the document.

First of all it is a very bad idea to parse XML using a regular expression. Instead use PHP's XML parser that is designed to do this kind of tasks.
You need to read the specification for Open XML (standard that used by Microsoft Office) to learn about the internal data structure that Microsoft use for storing these kinds of math equation.

Related

how to get first page content of docx in php

i want to get docx file fist page text only, suppose business.docx file has 5 pages, now i want to show only first page content and hide rest of the 4 pages text.
following is the code that read docx file.
function read_file_docx($filename)
{
$striped_content = '';
$content = '';
if(!$filename || !file_exists($filename)) return false;
$zip = zip_open($filename);
if (!$zip || is_numeric($zip)) return false;
while ($zip_entry = zip_read($zip)) {
if (zip_entry_open($zip, $zip_entry) == FALSE) continue;
if (zip_entry_name($zip_entry) != "word/document.xml") continue;
$content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));
zip_entry_close($zip_entry);
}// end while
zip_close($zip);
//echo $content;
//echo "<hr>";
//file_put_contents('1.xml', $content);
$content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
$content = str_replace('</w:r></w:p>', "\r\n", $content);
$striped_content = strip_tags($content);
return $striped_content;
}
i also want ms word text formatting like bold, italic, bullet, etc... i have google but cant get code which give me option to show docx first page content.
I suggest you to look into below php library
https://github.com/PHPOffice/PHPWord
you can find samples https://github.com/PHPOffice/PHPWord/tree/master/samples

Is there a way to read Doc files in PHP similar to Docx?

I am able to extract the text content of a Docx File, I want to do the same for Doc file. I tried using the same code but could not read anything. I guess the reason is "Doc formats are not zipped archives." Here is the code:
function readDocx ($filePath)
{
// Create new ZIP archive
$zip = new ZipArchive;
$dataFile = 'word/document.xml';
// Open received archive file
if (true === $zip->open($filePath)) {
// If done, search for the data file in the archive
if (($index = $zip->locateName($dataFile)) !== false) {
// If found, read it to the string
$data = $zip->getFromIndex($index);
// Close archive file
$zip->close();
// Load XML from a string
// Skip errors and warnings
$xml = DOMDocument::loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
$contents = explode('\n',strip_tags($xml->saveXML()));
$text = '';
foreach($contents as $i=>$content) {
$text .= $contents[$i];
}
return $text;
}
$zip->close();
}
return "";
}
Please let me know if there is a way to fetch text content from Doc file.
Well I finally got the Answer, so thought I should share it here. I simply used COM Objects:
$DocumentPath="C:/xampp/htdocs/abcd.doc";
$word = new COM("word.application") or die("Unable to instantiate application object");
$wordDocument = new COM("word.document") or die("Unable to instantiate document object");
$word->Visible = 0;
$wordDocument = $word->Documents->Open($DocumentPath);
$HTMLPath = substr_replace($DocumentPath, 'html', -3, 3);
$wordDocument->SaveAs($HTMLPath, 3);
$wordDocument = null;
$word->Quit();
$word = null;
readfile($HTMLPath);
unlink($HTMLPath);

Extract text from doc and docx

I would like to know how can I read the contents of a doc or docx. I'm using a Linux VPS and PHP, but if there is a simpler solution using other language, please let me know, as long as it works under a linux webserver.
Here i have added the solution to get the text from .doc,.docx word files
How to extract text from word file .doc,docx php
For .doc
private function read_doc() {
$fileHandle = fopen($this->filename, "r");
$line = #fread($fileHandle, filesize($this->filename));
$lines = explode(chr(0x0D),$line);
$outtext = "";
foreach($lines as $thisline)
{
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE)||(strlen($thisline)==0))
{
} else {
$outtext .= $thisline." ";
}
}
$outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t#\/\_\(\)]/","",$outtext);
return $outtext;
}
For .docx
private function read_docx(){
$striped_content = '';
$content = '';
$zip = zip_open($this->filename);
if (!$zip || is_numeric($zip)) return false;
while ($zip_entry = zip_read($zip)) {
if (zip_entry_open($zip, $zip_entry) == FALSE) continue;
if (zip_entry_name($zip_entry) != "word/document.xml") continue;
$content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));
zip_entry_close($zip_entry);
}// end while
zip_close($zip);
$content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
$content = str_replace('</w:r></w:p>', "\r\n", $content);
$striped_content = strip_tags($content);
return $striped_content;
}
This is a .DOCX solution only. For .DOC or .PDF you'll need to use something else like pdf2text.php for PDF
function docx2text($filename) {
return readZippedXML($filename, "word/document.xml");
}
function readZippedXML($archiveFile, $dataFile) {
// Create new ZIP archive
$zip = new ZipArchive;
// Open received archive file
if (true === $zip->open($archiveFile)) {
// If done, search for the data file in the archive
if (($index = $zip->locateName($dataFile)) !== false) {
// If found, read it to the string
$data = $zip->getFromIndex($index);
// Close archive file
$zip->close();
// Load XML from a string
// Skip errors and warnings
$xml = new DOMDocument();
$xml->loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
// Return data without XML formatting tags
return strip_tags($xml->saveXML());
}
$zip->close();
}
// In case of failure return empty string
return "";
}
echo docx2text("test.docx"); // Save this contents to file
Parse .docx, .odt, .doc and .rtf documents
I wrote a library that parses the docx, odt and rtf documents based on answers here and elsewhere.
The major improvement I have made to the .docx and .odt parsing is the that the library processes the XML that describes the document and attempts to conform it to HTML tags, i.e. em and strong tags. This means that if you're using the library for a CMS, text formatting is not lost
You can get it here
My solution is Antiword for .doc and docx2txt for .docx
Assuming a linux server that you control, download each one, extract then install. I installed each one system wide:
Antiword: make global_install
docx2txt: make install
Then to use these tools to extract the text into a string in php:
//for .doc
$text = shell_exec('/usr/local/bin/antiword -w 0 ' .
escapeshellarg($docFilePath));
//for .docx
$text = shell_exec('/usr/local/bin/docx2txt.pl ' .
escapeshellarg($docxFilePath) . ' -');
docx2txt requires perl
no_freedom's solution does extract text from docx files, but it can butcher whitespace. Most files I tested had instances where words that should be separated had no space between them. Not good when you want to full text search the documents you're processing.
Try ApachePOI. It works well for Java. I suppose you won't have any difficulties installing Java on Linux.
I would suggest, Extract text using apache Tika, you can extract multiple type of file content like .doc/.docx and pdf and many other.
I used docxtotxt to extract docx file content. My code is as follows:
if($extention == "docx")
{
$docxFilePath = "/var/www/vhosts/abc.com/httpdocs/writers/filename.docx";
$content = shell_exec('/var/www/vhosts/abc.com/httpdocs/docx2txt/docx2txt.pl
'.escapeshellarg($docxFilePath) . ' -');
}
I insert little improvements in doc to txt converter function
private function read_doc() {
$line_array = array();
$fileHandle = fopen( $this->filename, "r" );
$line = #fread( $fileHandle, filesize( $this->filename ) );
$lines = explode( chr( 0x0D ), $line );
$outtext = "";
foreach ( $lines as $thisline ) {
$pos = strpos( $thisline, chr( 0x00 ) );
if ( $pos !== false ) {
} else {
$line_array[] = preg_replace( "/[^a-zA-Z0-9\s\,\.\-\n\r\t#\/\_\(\)]/", "", $thisline );
}
}
return implode("\n",$line_array);
}
Now it saves empty rows and txt file looks row by row .
You can use Apache Tika as complete solution it provides REST API.
Another good library is RawText, as it can do an OCR over images, and extract text from any doc. It's non-free, and it works over REST API.
The sample code extracting your file with RawText:
$result = $rawText->extract($your_file)

Need PHP script to decompress and loop through zipped file

I am using a fairly straight-forward script to open and parse several xml files that are gzipped. I also need to do the same basic operation with a ZIP file. It seems like it should be simple, but I haven't been able to find what looked like equivalent code anywhere.
Here is the simple version of what I am already doing:
$import_file = "source.gz";
$sfp = gzopen($import_file, "rb"); ///// OPEN GZIPPED data
while ($string = gzread($sfp, 4096)) { //Loop through the data
/// Parse Output And Do Stuff with $string
}
gzclose($sfp);
What would do the same thing for a zipped file?
If you have PHP 5 >= 5.2.0, PECL zip >= 1.5.0 then you may use the ZipArchive libraries:
$zip = new ZipArchive;
if ($zip->open('source.zip') === TRUE)
{
for($i = 0; $i < $zip->numFiles; $i++)
{
$fp = $zip->getStream($zip->getNameIndex($i));
if(!$fp) exit("failed\n");
while (!feof($fp)) {
$contents = fread($fp, 8192);
// do some stuff
}
fclose($fp);
}
}
else
{
echo 'Error reading zip-archive!';
}
There is a smart way to do it with ZipArchive. You can use a for loop with ZipArchive::statIndex() to get all the information you need. You can access the files by their index (ZipArchive::getFromIndex()) or name (ZipArchive::getFromName()).
For example:
function processZip(string $zipFile): bool
{
$zip = new ZipArchive();
if ($zip->open($zipFile) !== true) {
echo '<p>Can\'t open zip archive!</p>';
return false;
}
// As long as statIndex() does not return false keep iterating
for ($idx = 0; $zipFile = $zip->statIndex($idx); $idx++) {
$directory = \dirname($zipFile['name']);
if (!\is_dir($zipFile['name'])) {
// file contents
$contents = $zip->getFromIndex($idx);
}
}
$zip->close();
}

How can I read PNG Metadata from PHP?

This is what I have so far:
<?php
$file = "18201010338AM16390621000846.png";
$test = file_get_contents($file, FILE_BINARY);
echo str_replace("\n","<br>",$test);
?>
The output is sorta what I want, but I really only need lines 3-7 (inclusively). This is what the output looks like now: http://silentnoobs.com/pbss/collector/test.php. I am trying to get the data from "PunkBuster Screenshot (±) AAO Bridge Crossing" to "Resulting: w=394 X h=196 sample=2". I think it'd be fairly straight forward to read through the file, and store each line in an array, line[0] would need to be "PunkBuster Screenshot (±) AAO Bridge Crossing", and so on. All those lines are subject to change, so I can't just search for something finite.
I've tried for a few days now, and it doesn't help much that I'm poor at php.
The PNG file format defines that a PNG document is split up into multiple chunks of data. You must therefore navigate your way to the chunk you desire.
The data you want to extract seem to be defined in a tEXt chunk. I've written the following class to allow you to extract chunks from PNG files.
class PNG_Reader
{
private $_chunks;
private $_fp;
function __construct($file) {
if (!file_exists($file)) {
throw new Exception('File does not exist');
}
$this->_chunks = array ();
// Open the file
$this->_fp = fopen($file, 'r');
if (!$this->_fp)
throw new Exception('Unable to open file');
// Read the magic bytes and verify
$header = fread($this->_fp, 8);
if ($header != "\x89PNG\x0d\x0a\x1a\x0a")
throw new Exception('Is not a valid PNG image');
// Loop through the chunks. Byte 0-3 is length, Byte 4-7 is type
$chunkHeader = fread($this->_fp, 8);
while ($chunkHeader) {
// Extract length and type from binary data
$chunk = #unpack('Nsize/a4type', $chunkHeader);
// Store position into internal array
if ($this->_chunks[$chunk['type']] === null)
$this->_chunks[$chunk['type']] = array ();
$this->_chunks[$chunk['type']][] = array (
'offset' => ftell($this->_fp),
'size' => $chunk['size']
);
// Skip to next chunk (over body and CRC)
fseek($this->_fp, $chunk['size'] + 4, SEEK_CUR);
// Read next chunk header
$chunkHeader = fread($this->_fp, 8);
}
}
function __destruct() { fclose($this->_fp); }
// Returns all chunks of said type
public function get_chunks($type) {
if ($this->_chunks[$type] === null)
return null;
$chunks = array ();
foreach ($this->_chunks[$type] as $chunk) {
if ($chunk['size'] > 0) {
fseek($this->_fp, $chunk['offset'], SEEK_SET);
$chunks[] = fread($this->_fp, $chunk['size']);
} else {
$chunks[] = '';
}
}
return $chunks;
}
}
You may use it as such to extract your desired tEXt chunk as such:
$file = '18201010338AM16390621000846.png';
$png = new PNG_Reader($file);
$rawTextData = $png->get_chunks('tEXt');
$metadata = array();
foreach($rawTextData as $data) {
$sections = explode("\0", $data);
if($sections > 1) {
$key = array_shift($sections);
$metadata[$key] = implode("\0", $sections);
} else {
$metadata[] = $data;
}
}
<?php
$fp = fopen('18201010338AM16390621000846.png', 'rb');
$sig = fread($fp, 8);
if ($sig != "\x89PNG\x0d\x0a\x1a\x0a")
{
print "Not a PNG image";
fclose($fp);
die();
}
while (!feof($fp))
{
$data = unpack('Nlength/a4type', fread($fp, 8));
if ($data['type'] == 'IEND') break;
if ($data['type'] == 'tEXt')
{
list($key, $val) = explode("\0", fread($fp, $data['length']));
echo "<h1>$key</h1>";
echo nl2br($val);
fseek($fp, 4, SEEK_CUR);
}
else
{
fseek($fp, $data['length'] + 4, SEEK_CUR);
}
}
fclose($fp);
?>
It assumes a basically well formed PNG file.
I found this problem a few days ago, so I made a library to extract the metadata (Exif, XMP and GPS) of a PNG in PHP, 100% native, I hope it helps. :) PNGMetadata
How about:
http://www.php.net/manual/en/function.getimagesize.php

Categories