My PHP script successfully reads all text from a .docx file, but I cannot figure out where the line breaks should be so it makes the text bunched up and hard to read (one huge paragraph). I have manually gone over all of the XML files to try and figure it out but I cannot figure it out.
Here are the functions I use to retrieve the file data and return the plain text.
public function read($FilePath)
{
// Save name of the file
parent::SetDocName($FilePath);
$Data = $this->docx2text($FilePath);
$Data = str_replace("<", "<", $Data);
$Data = str_replace(">", ">", $Data);
$Breaks = array("\r\n", "\n", "\r");
$Data = str_replace($Breaks, '<br />', $Data);
$this->Content = $Data;
}
function docx2text($filename) {
return $this->readZippedXML($filename, "word/document.xml");
}
function readZippedXML($archiveFile, $dataFile)
{
// Create new ZIP archive
$zip = new ZipArchive;
// Open received archive file
if (true === $zip->open($archiveFile))
{
// If done, search for the data file in the archive
if (($index = $zip->locateName($dataFile)) !== false)
{
// If found, read it to the string
$data = $zip->getFromIndex($index);
// Close archive file
$zip->close();
// Load XML from a string
// Skip errors and warnings
$xml = DOMDocument::loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
$xmldata = $xml->saveXML();
//$xmldata = str_replace("</w:t>", "\r\n", $xmldata);
// Return data without XML formatting tags
return strip_tags($xmldata);
}
$zip->close();
}
// In case of failure return empty string
return "";
}
It is actually quite a simple answer. All you need to do is add this line in readZippedXML():
$xmldata = str_replace("</w:p>", "\r\n", $xmldata);
This is because </w:p> is what word uses to mark the end of a paragraph. E.g.
<w:p>This is a paragraph.</w:p>
<w:p>And a second one.</w:p>
Actually, why don't you use OpenXML? I think it works with PHP too. And then you don't have to go down to the nitty gritty file xml details.
Here is a link:
http://openxmldeveloper.org/articles/4606.aspx
Related
I have this code:
libxml_use_internal_errors(TRUE);
$dom = new DOMDocument;
$dom->Load('/home/dom/public_html/cache/feed.xml');
$xmlor = '/home/dom/public_html/cache/feed.xml';
// open file and prepare mods
$fh = fopen($xmlor, 'r+');
$data = fread($fh, filesize($xmlor));
$dmca_claim_jpg = array( 'baduser_.jpg','user78.jpg' );
$dmca_claim_link = array( 'mydomain.com/baduser_','mydomain.com/user78' );
echo "Opening local XML for edit..." . PHP_EOL;
$new_data = str_replace("extdomain.com", "mydomain.com", $data);
$new_data2 = str_replace($dmca_claim_jpg, "DMCA.jpg", $data);
$new_data3 = str_replace($dmca_claim_link, "#", $data);
fclose($fh);
// run mods
$fh = fopen($xmlor, 'r+');
fwrite($fh, $new_data);
fwrite($fh, $new_data2);
fwrite($fh, $new_data3);
echo "Updated feed URL and DMCA claims in local XML..." . PHP_EOL;
fclose($fh);
It does not give any errors when executing but messes up the xml file by removing the first two lines (weird) when fwriting $new_data2 and $new_data3 to xml file.
It works fine writing only $new_data...
I think it has to do with the $dmca_claim_jpg/link arrays.
Parse XML using SimpleXML or DOMDocument it's cleaner and you have a standard OOP way of accessing nodes
I have a JSON file badly formatted (doc1.json):
{"text":"xxx","user":{"id":96525997,"name":"ss"},"id":29005752194568192}
{"text":"yyy","user":{"id":32544632,"name":"cc"},"id":29005753951977472}
{...}{...}
And I have to change it in this:
{"u":[
{"text":"xxx","user":{"id":96525997,"name":"ss"},"id":29005752194568192},
{"text":"yyy","user":{"id":32544632,"name":"cc"},"id":29005753951977472},
{...},{...}
]}
Can I do this in a PHP file?
//Get the contents of file
$fileStr = file_get_contents(filelocation);
//Make proper json
$fileStr = str_replace('}{', '},{', $fileStr);
//Create new json
$fileStr = '{"u":[' . $fileStr . ']}';
//Insert the new string into the file
file_put_contents(filelocation, $fileStr);
I would build the data structure you want from the file:
$file_path = '/path/to/file';
$array_from_file = file($file_path);
// set up object container
$obj = new StdClass;
$obj->u = array();
// iterate through lines from file
// load data into object container
foreach($array_from_file as $json) {
$line_obj = json_decode($json);
if(is_null($line_obj)) {
throw new Exception('We have some bad JSON here.');
} else {
$obj->u[] = $line_obj;
}
}
// encode to JSON
$json = json_encode($obj);
// overwrite existing file
// use 'w' mode to truncate file and open for writing
$fh = fopen($file_path, 'w');
// write JSON to file
$bytes_written = fwrite($fh, $json);
fclose($fh);
This assumes each of the JSON object repsentations in your original file are on a separate line.
I prefer this approach over string manipulation, as you can then have built in checks where you are decoding JSON to see if the input is valid JSON format that can be de-serialized. If the script operates successfully, this guarantees that your output will be able to be de-serialized by the caller to the script.
Just wondering if anyone can point me in the direction of some tips / a script that will help me create an XML from an original CSV File, using PHP.
Cheers
This is quite easy to do, just look at fgetcsv to read csv files and then DomDocument to write an xml file. This version uses the headers from the file as the keys of the xml document.
<?php
error_reporting(E_ALL | E_STRICT);
ini_set('display_errors', true);
ini_set('auto_detect_line_endings', true);
$inputFilename = 'input.csv';
$outputFilename = 'output.xml';
// Open csv to read
$inputFile = fopen($inputFilename, 'rt');
// Get the headers of the file
$headers = fgetcsv($inputFile);
// Create a new dom document with pretty formatting
$doc = new DomDocument();
$doc->formatOutput = true;
// Add a root node to the document
$root = $doc->createElement('rows');
$root = $doc->appendChild($root);
// Loop through each row creating a <row> node with the correct data
while (($row = fgetcsv($inputFile)) !== FALSE)
{
$container = $doc->createElement('row');
foreach($headers as $i => $header)
{
$child = $doc->createElement($header);
$child = $container->appendChild($child);
$value = $doc->createTextNode($row[$i]);
$value = $child->appendChild($value);
}
$root->appendChild($container);
}
$strxml = $doc->saveXML();
$handle = fopen($outputFilename, "w");
fwrite($handle, $strxml);
fclose($handle);
The code given above creates an XML document, but does not store it on any physical device. So replace echo $doc->saveXML(); with
$strxml = $doc->saveXML();
$handle = fopen($outputFilename, "w");
fwrite($handle, $strxml);
fclose($handle);
There are a number of sites out there that will do it for you.
If this is going to be a regular process rather than a one-time thing it may be ideal to just parse the CSV and output the XML yourself:
$csv = file("path/to/csv.csv");
foreach($csv as $line)
{
$data = explode(",", $line);
echo "<xmltag>".$data[0]."</xmltag>";
//etc...
}
Look up PHP's file and string functions.
Sorry to bring up an old thread but I tried this script and I'm getting a DOM Exception error
The headers of our CSV Files are display_name office_number mobile_number and the error I'm receiving is DOMDocument->createElement('\xEF\xBB\xBFdisplay_name') #1 {main}
I had an interesting task today and couldn't find much on the subject. I wanted to share this, and ask for any suggestions on how this could have been done more elegantly. I consider myself a mediocre programmer who really wants to improve so any feedback is highly appreciated. There is also a strange bug I can't figure out. So here goes..and hopefully this helps someone who ever has to do something similar.
A client was redoing a site, moving content around, and had a couple thousand redirects that needed to be made. Marketing sent me an XLS with old URLs in one column, new URLs in the next. These were the actions I took:
Saved the XLS as CSV
Wrote a script which:
Formatted the list as valid 301 redirects
Exported the list to a text file
I then copy / pasted all the new directives into my .htaccess file.
Then, I wrote another script that checked to make sure each of the new links was valid (no 404s). The first script worked exactly as expected. For some reason, I can get the second script to print out all the 404 errors (there were several), but the script doesn't die when it finishes traversing the loop, and it doesn't write to the file, it just hangs in command line. No errors get reported. Any idea what's going on? Here is the code for both scripts:
Formatting 301s:
<?php
$source = "301.csv";
$output = "301.txt";
//grab the contents of the source file as an array, prepare the output file for writing
$sourceArray = file($source);
$handleOutput = fopen($output, "w");
//Set the strings we want to replace in an array. The first array are the original lines and the second are the strings to be replaced
$originalLines = array(
'http://hipaasecurityassessment.com',
','
);
$replacementStrings = array(
'',
' '
);
//Split each item from the array into two strings, one which occurs before the comma and the other which occurs after
function setContent($sourceArray, $originalLines = array(), $replacementStrings = array()){
$outputArray = array();
$text = 'redirect 301 ';
foreach ($sourceArray as $number => $item){
$pattern = '/[,]/';
$item = preg_split($pattern, $item);
$item = array(
$item[0],
preg_replace('#"#', '', $item[1])
);
$item = implode(' ', $item);
$item = str_replace($originalLines, $replacementStrings, $item);
array_push($outputArray,$text,$item);
}
$outputString = implode('', $outputArray);
return $outputString;
}
//Invoke the set content function
$outputString = setContent($sourceArray, $originalLines, $replacementStrings);
//Finally, write to the text file!
fwrite($handleOutput, $outputString);
Checking for 404s:
<?php
$source = "301.txt";
$output = "print404.txt";
//grab the contents of the source file as an array, prepare the output file for writing
$sourceArray = file($source);
$handleOutput = fopen($output, "w");
//Split each item from the array into two strings, one which occurs before the space and the other which occurs after
function getUrls($sourceArray = array()){
$outputArray = array();
foreach ($sourceArray as $number => $item){
$item = str_replace('redirect 301', '', $item);
$pattern = '#[ ]+#';
$item = preg_split($pattern, $item);
$item = array(
$item[0],
$item[1],
$item[2]
);
array_push($outputArray, $item[2]);
}
return $outputArray;
}
//Check each URL for a 404 error via a curl request
function check404($url = array(), $handleOutput){
$handle = curl_init($url);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, TRUE);
$content = curl_exec( $handle );
$response = curl_getinfo( $handle );
$httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
if($httpCode == 404) {
//fwrite($handleOutput, $url);
print $url;
}
};
$outputArray = getUrls($sourceArray);
foreach ($outputArray as $url)
{
$errors = check404($url, $handleOutput);
}
You should have used fgetcsv() for generating the original URL list. This splits up CSV files into an array, simplifying the transformation.
Can't say anything about the 404s or the error cause. But using the wacky curl functions is almost always a bad indicator. For testing purposes I would have used a commandline tool like wget instead so the results can be proof-checked manually.
But maybe you could try PHPs own get_headers() instead. It's supposed to show the raw result headers; shouldn't not follow redirects itself.
I'm trying to write a parser for a xml postback listener, but can't seem to get it to dump the xml for a sample. The API support guy told me to use 'DOMDocument', maybe 'SimpleXML'? Anyways here's the code: (thanks!)
<?php
$xml_document = file_get_contents('php://input');
$doc = new DOMDocument();
$doc->loadXML($xml_document);
$doc->save("test2/".time().".sample.xml").".xml");
?>
How about use this to create an XML file?
/**
* Will output in a similar form to print_r, but the nodes are xml so can be collapsed in browsers
*
* #param mixed $mixed
*/
function print_r_xml($mixed)
{
// capture the output of print_r
$out = print_r($mixed, true);
// Replace the root item with a struct
// MATCH : '<start>element<newline> ('
$root_pattern = '/[ \t]*([a-z0-9 \t_]+)\n[ \t]*\(/i';
$root_replace_pattern = '<struct name="root" type="\\1">';
$out = preg_replace($root_pattern, $root_replace_pattern, $out, 1);
// Replace array and object items structs
// MATCH : '[element] => <newline> ('
$struct_pattern = '/[ \t]*\[([^\]]+)\][ \t]*\=\>[ \t]*([a-z0-9 \t_]+)\n[ \t]*\(/miU';
$struct_replace_pattern = '<struct name="\\1" type="\\2">';
$out = preg_replace($struct_pattern, $struct_replace_pattern, $out);
// replace ')' on its own on a new line (surrounded by whitespace is ok) with '</var>
$out = preg_replace('/^\s*\)\s*$/m', '</struct>', $out);
// Replace simple key=>values with vars
// MATCH : '[element] => value<newline>'
$var_pattern = '/[ \t]*\[([^\]]+)\][ \t]*\=\>[ \t]*([a-z0-9 \t_\S]+)/i';
$var_replace_pattern = '<var name="\\1">\\2</var>';
$out = preg_replace($var_pattern, $var_replace_pattern, $out);
$out = trim($out);
$out='<?xml version="1.0"?><data>'.$out.'</data>';
return $out;
}
Im my application I posted all of the $_POST variables to it:
$handle = fopen("data.xml", "w+");
$content = print_r_xml($_POST);
fwrite($handle,$content);
fclose();