Remove whitespace from XML file with PHP - php

I have a script that appends XML data to the end of an XML file via PHP. The only problem is that after each new line of XML I add via the PHP script, an extra line (whitespace) is created. Is there a way to remove the whitespace from the XML file with PHP without loosing the neatly formated XML file? Here is my PHP code that writes to the XML file:
<?php
function formatXmlString($xml) {
// add marker linefeeds to aid the pretty-tokeniser (adds a linefeed between all tag-end boundaries)
$xml = preg_replace('/(>)(<)(\/*)/', "$1\n$2$3", $xml);
// now indent the tags
$token = strtok($xml, "\n");
$result = ''; // holds formatted version as it is built
$pad = 0; // initial indent
$matches = array(); // returns from preg_matches()
// scan each line and adjust indent based on opening/closing tags
while ($token !== false) :
// test for the various tag states
// 1. open and closing tags on same line - no change
if (preg_match('/.+<\/\w[^>]*>$/', $token, $matches)) :
$indent=0;
// 2. closing tag - outdent now
elseif (preg_match('/^<\/\w/', $token, $matches)) :
$pad=0;
// 3. opening tag - don't pad this one, only subsequent tags
elseif (preg_match('/^<\w[^>]*[^\/]>.*$/', $token, $matches)) :
$indent=4;
// 4. no indentation needed
else :
$indent = 0;
endif;
// pad the line with the required number of leading spaces
$line = str_pad($token, strlen($token)+$pad, ' ', STR_PAD_LEFT);
$result .= $line . "\n"; // add to the cumulative result, with linefeed
$token = strtok("\n"); // get the next token
$pad += $indent; // update the pad size for subsequent lines
endwhile;
return $result;
}
function append_xml($file, $content, $sibling, $single = false) {
$doc = file_get_contents($file);
if ($single) {
$pos = strrpos($doc, "<$sibling");
$pos = strpos($doc, ">", $pos) + 1;
}
else {
$pos = strrpos($doc, "</$sibling>") + strlen("</$sibling>");
}
return file_put_contents($file, substr($doc, 0, $pos) . "\n$content" . substr($doc, $pos));
}
$content = "<product><id>3</id><name>Product 3</name><price>63.00</price></product>";
append_xml('prudcts.xml', formatXmlString($content), 'url');
?>

Do not just put all in one line and you're more flexible:
return file_put_contents($file, substr($doc, 0, $pos) . "\n$content" . substr($doc, $pos));
Instead (suggestion):
$buffer = substr($doc, 0, $pos) . "\n$content" . substr($doc, $pos);
$buffer = rtrim($buffer);
return file_put_contents($file, $buffer);
P.S: Using DomDocument might be more straight forward and save for XML processing then the string functions.

Instead of appending new data to $result and then a newline, do the reverse.
Use something like if( !empty($result) ) { result .= "\n" } to avoid beginning the XML data with a newline.

Related

PHP - After getting a csv file parsed into an array, can not match two exact strings [duplicate]

Using PHP5 (cgi) to output template files from the filesystem and having issues spitting out raw HTML.
private function fetch($name) {
$path = $this->j->config['template_path'] . $name . '.html';
if (!file_exists($path)) {
dbgerror('Could not find the template "' . $name . '" in ' . $path);
}
$f = fopen($path, 'r');
$t = fread($f, filesize($path));
fclose($f);
if (substr($t, 0, 3) == b'\xef\xbb\xbf') {
$t = substr($t, 3);
}
return $t;
}
Even though I've added the BOM fix I'm still having problems with Firefox accepting it. You can see a live copy here: http://ircb.in/jisti/ (and the template file I threw at http://ircb.in/jisti/home.html if you want to check it out)
Any idea how to fix this? o_o
you would use the following code to remove utf8 bom
//Remove UTF8 Bom
function remove_utf8_bom($text)
{
$bom = pack('H*','EFBBBF');
$text = preg_replace("/^$bom/", '', $text);
return $text;
}
try:
// -------- read the file-content ----
$str = file_get_contents($source_file);
// -------- remove the utf-8 BOM ----
$str = str_replace("\xEF\xBB\xBF",'',$str);
// -------- get the Object from JSON ----
$obj = json_decode($str);
:)
Another way to remove the BOM which is Unicode code point U+FEFF
$str = preg_replace('/\x{FEFF}/u', '', $file);
b'\xef\xbb\xbf' stands for the literal string "\xef\xbb\xbf". If you want to check for a BOM, you need to use double quotes, so the \x sequences are actually interpreted into bytes:
"\xef\xbb\xbf"
Your files also seem to contain a lot more garbage than just a single leading BOM:
$ curl http://ircb.in/jisti/ | xxd
0000000: efbb bfef bbbf efbb bfef bbbf efbb bfef ................
0000010: bbbf efbb bf3c 2144 4f43 5459 5045 2068 .....<!DOCTYPE h
0000020: 746d 6c3e 0a3c 6874 6d6c 3e0a 3c68 6561 tml>.<html>.<hea
...
if anybody using csv import then below code useful
$header = fgetcsv($handle);
foreach($header as $key=> $val) {
$bom = pack('H*','EFBBBF');
$val = preg_replace("/^$bom/", '', $val);
$header[$key] = $val;
}
This global funtion resolve for UTF-8 system base charset. Tanks!
function prepareCharset($str) {
// set default encode
mb_internal_encoding('UTF-8');
// pre filter
if (empty($str)) {
return $str;
}
// get charset
$charset = mb_detect_encoding($str, array('ISO-8859-1', 'UTF-8', 'ASCII'));
if (stristr($charset, 'utf') || stristr($charset, 'iso')) {
$str = iconv('ISO-8859-1', 'UTF-8//TRANSLIT', utf8_decode($str));
} else {
$str = mb_convert_encoding($str, 'UTF-8', 'UTF-8');
}
// remove BOM
$str = urldecode(str_replace("%C2%81", '', urlencode($str)));
// prepare string
return $str;
}
An extra method to do the same job:
function remove_utf8_bom_head($text) {
if(substr(bin2hex($text), 0, 6) === 'efbbbf') {
$text = substr($text, 3);
}
return $text;
}
The other methods I found cannot work in my case.
Hope it helps in some special case.
A solution without pack function:
$a = "1";
var_dump($a); // string(4) "1"
function deleteBom($text)
{
return preg_replace("/^\xEF\xBB\xBF/", '', $text);
}
var_dump(deleteBom($a)); // string(1) "1"
I'm not so fond of using preg_replace or preg_match for simple tasks. What about this alternative method of detecting and removing the BOM?
function remove_utf8_bom(string $text): string
{
$bomStart = mb_substr($text, 0, 1);
return ($bomStart == pack('H*','EFBBBF')) ?
mb_substr($text, 1) :
$text;
}
If you are reading some API using file_get_contents and got an inexplicable NULL from json_decode, check the value of json_last_error(): sometimes the value returned from file_get_contents will have an extraneous BOM that is almost invisible when you inspect the string, but will make json_last_error() to return JSON_ERROR_SYNTAX (4).
>>> $json = file_get_contents("http://api-guiaserv.seade.gov.br/v1/orgao/all");
=> "\t{"orgao":[{"Nome":"Tribunal de Justi\u00e7a","ID_Orgao":"59","Condicao":"1"}, ...]}"
>>> json_decode($json);
=> null
>>>
In this case, check the first 3 bytes - echoing them is not very useful because the BOM is invisible on most settings:
>>> substr($json, 0, 3)
=> " "
>>> substr($json, 0, 3) == pack('H*','EFBBBF');
=> true
>>>
If the line above returns TRUE for you, then a simple test may fix the problem:
>>> json_decode($json[0] == "{" ? $json : substr($json, 3))
=> {#204
+"orgao": [
{#203
+"Nome": "Tribunal de Justiça",
+"ID_Orgao": "59",
+"Condicao": "1",
},
],
...
}
When working with faulty software it happens that the BOM part gets multiplied with every saving.
So I am using this to get rid of it.
function remove_utf8_bom($text) {
$bom = pack('H*','EFBBBF');
while (preg_match("/^$bom/", $text)) {
$text = preg_replace("/^$bom/", '', $text);
}
return $text;
}
How about this:
function removeUTF8BomHeader($data) {
if (substr($data, 0, 3) == pack('CCC', 0xef, 0xbb, 0xbf)) {
$data = substr($data, 3);
}
return $data;
}
tested a lot and it works perfect without any issue

Compressed HTML in Laravel still contains whitespace

I am using the below code to compress HTML output in Laravel; I put it in the filters/App::after function
if (App::Environment() != 'local') {
if ($response instanceof Illuminate\Http\Response) {
$output = $response->getOriginalContent();
// Clean comments
$output = preg_replace('/<!--([^\[|(<!)].*)/', '', $output);
$output = preg_replace('/(?<!\S)\/\/\s*[^\r\n]*/', '', $output);
// Clean Whitespace
$output = preg_replace('/\s{2,}/', '', $output);
$output = preg_replace('/(\r?\n)/', '', $output);
$response->setContent($output);
}
}
But with character "à" I get this symbol: � instead. I tried to remove the line:
$output = preg_replace('/\s{2,}/', '', $output);
It fixed the symbol error, but the HTML output does not work perfectly (some white space was not removed). Can anyone help me?

PHP save file on IE Vista, adding newlines

I tried this to format the XML output in a PHP function with the formatoutput = true and that didn't do it. So I want to do this with a function. I found two different scripts for that but they all have the same issue: they do the indentation but the newline "\n" doesn't print in the file. Is there a different way to get the newline?
PHP script
<?php
ini_set('display_errors', 1);
error_reporting(E_ALL);
function make_update( $nodeid, $name, $top, $left, $width, $height ) {
$nodes = new SimpleXMLElement('linkcards.xml', null, true);
$returnArray = $nodes->xpath("//LINKCARD[#ID='$nodeid']");
$node = $returnArray[0];
$node->NAME = $name;
$node->TOP = $top;
$node->LEFT = $left;
$node->WIDTH = $width;
$node->HEIGHT = $height;
$nodes->asXML('linkcards.xml');
$formatted = formatXmlString($nodes->asXML());
$file = fopen ('linkcards.xml', "w");
fwrite($file, $formatted);
fclose ($file);
}
echo make_update(trim($_REQUEST['nodeid']),trim($_REQUEST['name']),trim($_REQUEST['top']),trim($_REQUEST['left']),trim($_REQUEST['width']),trim($_REQUEST['height']));
function formatXmlString($xml) {
// add marker linefeeds to aid the pretty-tokeniser (adds a linefeed between all tag-end boundaries)
$xml = preg_replace('/(>)(<)(\/*)/', "$1\n$2$3", $xml);
// now indent the tags
$token = strtok($xml, "\n");
$result = ''; // holds formatted version as it is built
$pad = 0; // initial indent
$matches = array(); // returns from preg_matches()
// scan each line and adjust indent based on opening/closing tags
while ($token !== false) :
// test for the various tag states
// 1. open and closing tags on same line - no change
if (preg_match('/.+<\/\w[^>]*>$/', $token, $matches)) :
$indent=0;
// 2. closing tag - outdent now
elseif (preg_match('/^<\/\w/', $token, $matches)) :
$pad--;
// 3. opening tag - don't pad this one, only subsequent tags
elseif (preg_match('/^<\w[^>]*[^\/]>.*$/', $token, $matches)) :
$indent=1;
// 4. no indentation needed
else :
$indent = 0;
endif;
// pad the line with the required number of leading spaces
$line = str_pad($token, strlen($token)+$pad, ' ', STR_PAD_LEFT);
$result .= $line . "\n"; // add to the cumulative result, with linefeed
$token = strtok("\n"); // get the next token
$pad += $indent; // update the pad size for subsequent lines
endwhile;
return $result;
}
?>
Found the answer. If I write "\r\n" for the newline, it works!!!!!!
$result .= $line . "\r\n";
Using that function instead of formatoutput = true might be a useful workaround for anyone else who had trouble formatting the XML output.

Find linebreaks in a docx file using PHP

My PHP script successfully reads all text from a .docx file, but I cannot figure out where the line breaks should be so it makes the text bunched up and hard to read (one huge paragraph). I have manually gone over all of the XML files to try and figure it out but I cannot figure it out.
Here are the functions I use to retrieve the file data and return the plain text.
public function read($FilePath)
{
// Save name of the file
parent::SetDocName($FilePath);
$Data = $this->docx2text($FilePath);
$Data = str_replace("<", "<", $Data);
$Data = str_replace(">", ">", $Data);
$Breaks = array("\r\n", "\n", "\r");
$Data = str_replace($Breaks, '<br />', $Data);
$this->Content = $Data;
}
function docx2text($filename) {
return $this->readZippedXML($filename, "word/document.xml");
}
function readZippedXML($archiveFile, $dataFile)
{
// Create new ZIP archive
$zip = new ZipArchive;
// Open received archive file
if (true === $zip->open($archiveFile))
{
// If done, search for the data file in the archive
if (($index = $zip->locateName($dataFile)) !== false)
{
// If found, read it to the string
$data = $zip->getFromIndex($index);
// Close archive file
$zip->close();
// Load XML from a string
// Skip errors and warnings
$xml = DOMDocument::loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
$xmldata = $xml->saveXML();
//$xmldata = str_replace("</w:t>", "\r\n", $xmldata);
// Return data without XML formatting tags
return strip_tags($xmldata);
}
$zip->close();
}
// In case of failure return empty string
return "";
}
It is actually quite a simple answer. All you need to do is add this line in readZippedXML():
$xmldata = str_replace("</w:p>", "\r\n", $xmldata);
This is because </w:p> is what word uses to mark the end of a paragraph. E.g.
<w:p>This is a paragraph.</w:p>
<w:p>And a second one.</w:p>
Actually, why don't you use OpenXML? I think it works with PHP too. And then you don't have to go down to the nitty gritty file xml details.
Here is a link:
http://openxmldeveloper.org/articles/4606.aspx

Dump XML Posts from 'php://input' to file

I'm trying to write a parser for a xml postback listener, but can't seem to get it to dump the xml for a sample. The API support guy told me to use 'DOMDocument', maybe 'SimpleXML'? Anyways here's the code: (thanks!)
<?php
$xml_document = file_get_contents('php://input');
$doc = new DOMDocument();
$doc->loadXML($xml_document);
$doc->save("test2/".time().".sample.xml").".xml");
?>
How about use this to create an XML file?
/**
* Will output in a similar form to print_r, but the nodes are xml so can be collapsed in browsers
*
* #param mixed $mixed
*/
function print_r_xml($mixed)
{
// capture the output of print_r
$out = print_r($mixed, true);
// Replace the root item with a struct
// MATCH : '<start>element<newline> ('
$root_pattern = '/[ \t]*([a-z0-9 \t_]+)\n[ \t]*\(/i';
$root_replace_pattern = '<struct name="root" type="\\1">';
$out = preg_replace($root_pattern, $root_replace_pattern, $out, 1);
// Replace array and object items structs
// MATCH : '[element] => <newline> ('
$struct_pattern = '/[ \t]*\[([^\]]+)\][ \t]*\=\>[ \t]*([a-z0-9 \t_]+)\n[ \t]*\(/miU';
$struct_replace_pattern = '<struct name="\\1" type="\\2">';
$out = preg_replace($struct_pattern, $struct_replace_pattern, $out);
// replace ')' on its own on a new line (surrounded by whitespace is ok) with '</var>
$out = preg_replace('/^\s*\)\s*$/m', '</struct>', $out);
// Replace simple key=>values with vars
// MATCH : '[element] => value<newline>'
$var_pattern = '/[ \t]*\[([^\]]+)\][ \t]*\=\>[ \t]*([a-z0-9 \t_\S]+)/i';
$var_replace_pattern = '<var name="\\1">\\2</var>';
$out = preg_replace($var_pattern, $var_replace_pattern, $out);
$out = trim($out);
$out='<?xml version="1.0"?><data>'.$out.'</data>';
return $out;
}
Im my application I posted all of the $_POST variables to it:
$handle = fopen("data.xml", "w+");
$content = print_r_xml($_POST);
fwrite($handle,$content);
fclose();

Categories