How to prevent XMLWriter from appending blank line to outputted XML file? - php

The following code creates an XML file, but the last line is blank which causes problems when validated.
How can I change the following code so that the outputted file does not have a blank line at the end of it?
<?php
$xmlFileName = 'testoutput.xml';
$xml = new XMLWriter;
$xml->openURI($xmlFileName);
$xml->startDocument('1.0', 'UTF-8');
$xml->setIndent(1);
$xml->startElement('name');
$xml->text('jim');
$xml->endElement();
$xml->endDocument();
$xml->flush();
?>
#DavidRR, the validation problem comes when I validate the XML file with the following code, it tells me that there is "extra content at the end of the document":
$schema = 'test.xsd';
$files[] = 'test1.xml';
$files[] = 'test2.xml';
foreach ($files as $file) {
validateXml($file, $schema);
}
function validateXml($xmlFile, $xsdFile) {
$dom = new DOMDocument;
$dom->load($xmlFile);
libxml_use_internal_errors(true); // enable user error handling
echo "Validating <b>$xmlFile</b> with <b>$xsdFile</b>:";
if ($dom->schemaValidate($xsdFile)) {
echo '<div style="margin-left:20px">ok</div>';
} else {
$errors = libxml_get_errors();
if (count($errors) > 0) {
echo '<ul style="color:red">';
foreach ($errors as $error) {
//var_dump($error);
echo '<li>' . $error->message . '</li>';
}
echo '</ul>';
}
libxml_clear_errors();
echo '</span>';
libxml_use_internal_errors(false); // enable user error handling
}
}

Reported problem: Because of the presence of a blank line at the end of an XML file, a schema validation attempt on the file results in the error:
"Extra content at the end of the document"
I'm not able to reproduce your stated problem at codepad, PHP version 5.4-dev, or any of the earlier versions of PHP on that site. I'm including my edited version of your code here as well. (My version includes functions to create the simple XSD and XML files under examination.)
Possibility: Could your problem be related to the version of PHP that you are using?
If I haven't accurately tested your scenario with my adaptation of your code, please further modify my code to precipitate the problem.
<?php
$xsdFile = sys_get_temp_dir() . '/test1.xsd';
$xmlFile = sys_get_temp_dir() . '/test1.xml';
createXsdFile($xsdFile);
createXmlFile($xmlFile);
$files[] = $xmlFile;
foreach ($files as $file) {
validateXml($file, $xsdFile);
}
function validateXml($xmlFile, $xsdFile) {
$dom = new DOMDocument;
$dom->load($xmlFile);
libxml_use_internal_errors(true); // enable user error handling
echo "Validating <b>$xmlFile</b> with <b>$xsdFile</b>:";
if ($dom->schemaValidate($xsdFile)) {
echo '<div style="margin-left:20px">ok</div>';
} else {
$errors = libxml_get_errors();
if (count($errors) > 0) {
echo '<ul style="color:red">';
foreach ($errors as $error) {
//var_dump($error);
echo '<li>' . $error->message . '</li>';
}
echo '</ul>';
}
libxml_clear_errors();
echo '</span>';
libxml_use_internal_errors(false); // enable user error handling
}
}
function createXsdFile($xsdFile) {
$file = fopen($xsdFile, 'w');
fwrite($file, "<?xml version='1.0' encoding='utf-8'?>\n");
fwrite($file, "<schema xmlns='http://www.w3.org/2001/XMLSchema'>\n");
fwrite($file, "<element name='name' type='string' />\n");
fwrite($file, "</schema>\n");
fclose($file);
}
//
// Appends a blank line at the end of the XML file.
// Does this cause a schema validation problem?
//
function createXmlFile($xmlFile) {
$xml = new XMLWriter;
$xml->openURI($xmlFile);
$xml->startDocument('1.0', 'UTF-8');
$xml->setIndent(1);
$xml->startElement('name');
$xml->text('jim');
$xml->endElement();
$xml->endDocument();
$xml->flush();
}
?>

I have found no way to change the behavior of XmlWriter in that regard. A possible fix would be to read the file, trim it and then write it back to file, e.g.
file_put_contents($xmlFileName, trim(file_get_contents($xmlFileName)));
demo
An alternative would be to ftruncate the file
ftruncate(fopen($xmlFileName, 'r+'), filesize($xmlFileName) - strlen(PHP_EOL));
demo
The latter assumes there will be a platform dependent newline in the file. If there isn't, this will likely break the file then. The trim version is more solid in that regard as it will not damage the file if there isnt a newline, but it has to read the entire file into memory in order to trim the content.

If you are on linux/unix system, you can do:
$test = `head -n -1 < $xmlFileName > $xmlFileName`;
See this.

Related

PHP gzread, gzfile, gzopen, etc.. all strip tags off of XML and return only the values [duplicate]

This question already has answers here:
How to echo XML file in PHP
(10 answers)
Output raw XML using php
(5 answers)
Closed 1 year ago.
I have .gz files that contain xml files. I've tried every combination of all the different things shown in the code below. Any time one of the gz..... methods "works" it returns the values contained inside the XML files will all the tags and metadata gone. For example, if the xml file looks like this:
<?xml version="1.0" encoding="UTF-8" ?>
<tag1>
<taga>
This
</taga>
<tagb>
is the stuff
</tagb>
</tag1>
<tag2>
<taga>
I get but only
</taga>
<tagb>
This
</tagb>
</tag2>
What I get is:
This is the stuff I get but only This
Here's the code:
<?php
$mailfileObj->zipfile = 'path/to/gzfile.gz'; //ignore the fact that it says zipfile, it is a .gz file
try{
$opengzfile = gzopen($mailfileObj->zipfile, "r");
$contents = gzread($opengzfile, filesize($mailfileObj->zipfile));
gzclose($opengzfile);
var_dump($contents);
echo '<br>';
//$opengzfile = fopen($mailfileObj->zipfile, "r");
//$contents = fread($opengzfile, filesize($mailfileObj->zipfile));
//fclose($opengzfile);
//$contents = file_get_contents($mailfileObj->zipfile);
$contents2 = '';
$lines = gzfile($mailfileObj->zipfile);
foreach ($lines as $line) {
echo $line;
$contents2 = $contents2.$line;
}
//var_dump($contents);
//echo '<br>';
//var_dump($contents);
//echo $contents . '<br><br>';
//$xmlfilegz = $mailfileObj->filename.'.xml';
//$openxmlfile = fopen($xmlfilegz, "w");
//fwrite($openxmlfile, $contents);
//fclose($openxmlfile);
$opengzfile = fopen($mailfileObj->zipfile, "r");
$contents2 = fread($opengzfile, filesize($mailfileObj->zipfile));
fclose($opengzfile);
//$contents2 = file_get_contents($mailfileObj->zipfile);
//$contents2 = gzdecode($contents);
$contents2 = gzinflate($contents);
//$contents2 = gzuncompress($contents);
var_dump($contents2);
}
catch(Exception $e){
echo 'Caught exception: ' . $e->getMessage() . '<br>';
}
?>
What is wrong here? What am I missing?
Thank you.
You're putting the XML in an HTML web page, so the browser is interpreting the XML tags as HTML tags.
Use htmlentities() to encode them so they'll be rendered literally.
foreach ($lines as $line) {
echo htmlentities($line);
$contents2 = $contents2.$line;
}
You might want to show this in a <pre> block so the newlines and indentation will be preserved.

How can I get the content of a div by php from another link

I would like to get the contents of a div with ID content using PHP, and write the contents to a text file.
Here is some code that I tried:
<?php
$html = file_get_content('www.example.com');
$divContent = $html->find('div#contentArea', 0)->plaintext;
$file = fopen("newfile.txt", w);
fwrite($file, $divContent);
fclose($file);
?>
this code isn't working that says some file_get_content error..
also tried this one
<?php
$html = file_get_html('http://www.example.com/')->plaintext;
$divContent = $html->find('div#contentArea', 0)->plaintext;
$file = fopen("newfile.txt", w);
fwrite($file, $divContent);
fclose($file);
?>
I have needed to do this on many occasions due to site maintenance and error logging. PHP Manual explains further http://php.net/manual/en/domdocument.getelementbyid.php
BASIC EXAMPLE
<?php
$page = file_get_contents('example.html');
$doc = new DOMDocument();
$doc->loadHTML($page);
$node = $doc->getElementById('thisone');
echo $doc->saveHtml($node), PHP_EOL;
?>
What error throwed by file_get_content ? Maybe it's source, of your problem.
For selecting I used phpquery library
<?php
$code = file_get_contents('http://some-url-here');
$document = phpQuery::newDocument($code);
$inner = $document->find('div.hentry')->html();
?>

how to find line number for DOM elements in php?

I want to check whether a <img> tag has alt="" text or not and also need to find what line number in DOM that img tag is. At the moment I have the following codes written but stuck with finding the line number.
for example:
$doc = new DOMDocument();
$doc->loadHTMLFile('http://www.google.com');
$htmlElement = $doc->getElementsByTagName('html');
$tags = $doc->getElementsByTagName('img');
echo $tags->item(0)->getLineNo();
foreach ($tags as $image) {
// Get sizes of elements via width and height attributes
$alt = $image->getAttribute('alt');
if($alt == ""){
$src = $image->getAttribute('src');
echo "No alt text ";
echo '<img src="http://google.com/'.$src.'" alt=""/>'. '<br>';
}
else{
$src = $image->getAttribute('src');
echo '<img src="http://google.com/'.$src.'" alt=""/>'. '<br>';
}
}
from the above code at the moment I am getting images and text saying that "no alt text" beside the image, but I want to get what line number that img tag appears.
for example here the line number is 57,
56. <div class="work_item">
57. <p class="pich"><img src="images/works/1.jpg" alt=""></p>
58. </div>
Use DOMNode::getLineNo(), e.g.$line = $image->getLineNo().
HTML has no real concept of line numbers, since they are just whitespace.
With that in mind, you might be able to count how many newlines there are in all the text nodes preceding the target node. You might be able to do this with DOMXPath:
$xpath = new DOMXPath($doc);
$node = /* your target node */;
$textnodes = $xpath->query("./preceding::*[contains(text(),'\n')]",$node);
$line = 1;
foreach($textnodes as $textnode) $line += substr_count($textnode->textContent,"\n");
// $line is now the line number of the node.
Please note that I have not tested this, nor have I ever used axes in xpath.
I think i have figured out what i was trying to achieve but not sure is that the right way. It is doing the job. Please leave comments or any other idea how can i improve it.
If you go to the following site and type any URL. It will produce a report with accessibility issues in a webpage. It is an accessibility checker tool.
http://valet.webthing.com/page/
All i am trying to do is achieve that kind of layout. The code below will produce the DOM of supplied URL and find any image tag that does not have alternative text.
<html>
<body>
<?php
$dom = new domDocument;
// load the html into the object
$dom->loadHTMLFile('$yourURLAddress');
// keep white space
$dom->preserveWhiteSpace = true;
// nicely format output
$dom->formatOutput = true;
$new = htmlspecialchars($dom->saveHTML(), ENT_QUOTES);
$lines = preg_split('/\r\n|\r|\n/', $new); //split the string on new lines
echo "<pre>";
//find 'alt=""' and print the line number and html tag
foreach ($lines as $lineNumber => $line) {
if (strpos($line, htmlspecialchars('alt=""')) !== false) {
echo "\r\n" . $lineNumber . ". " . $line;
}
}
echo "\n\n\nBelow is the whole DOM\n\n\n";
//print out the whole DOM including line numbers
foreach ($lines as $lineNumber => $line) {
echo "\r\n" . $lineNumber . ". " . $line;
}
echo "</pre>";
?>
</body>
</html>
I like to thank everyone who helped specially "chwagssd" and Mike Johnson.

XMLReader and doctype

I need to parse an XML file and I need also to parse the doctype. I've tried with XML Reader but when I found a nodetype 10 (doctype), I can't get it's value.
There is a way to extract the doctype from an XML file, with XMLReader?
Edit: as asked, some sample code. however is nothing rather than a dump, right now.
$reader = new XMLReader( );
$filename = 'test.xhtml';
$reader->open($filename);
while( $reader->read( ) )
{
$nodeType = $reader->nodeType;
$nodeName = $reader->name;
$nodeValue = $reader->value;
if( $nodeType == 10 )
{
echo $nodeType ."\n";
echo $nodeName ."\n";
echo $nodeValue ."\n";
echo $reader->localName ."\n";
echo $reader->namespaceURI ."\n";
echo $reader->prefix ."\n";
echo $reader->xmlLang ."\n";
echo $reader->readString() . "\n";
echo $reader->readInnerXML() . "\n";
while( $reader->moveToNextAttribute( ) )
{
echo $reader->name . "=" . $reader->value;
}
}
You can use DOM to read the DOCTYPE data:
$doc = new DOMDocument();
$doc->loadXML($xmlData);
var_dump($doc->doctype->publicId);
var_dump($doc->doctype->systemId);
var_dump($doc->doctype->name);
var_dump($doc->doctype->entities);
var_dump($doc->doctype->notations);
I have not found a way to do this with XMLReader despite a lot of looking. However you can use DOMDocument to read the doctype quite easily, then revert to XMLReader to read the rest of the stream. For example, to get the system ID part of the doctype before processing the rest of the XML file:
$doc = new DOMDocument();
$doc->load($xmlfile);
$systemId = $doc->doctype->systemId;
unset($doc);
// Then proceed with XMLReader:
$reader = new XMLReader();
$reader->open($xmlfile);
while($reader->read())
{
// etc
I suppose that this may not be practical in all circumstances but it worked for me while processing very large XML files for which I needed to read the system ID from the doctype.

Problem editing word file in PHP

So I need to edit some text in a Word document. I created a Word document and saved it as XML. It is saved correctly (I can open the XML file in MS Word and it looks exactly like the docx original).
So then I use PHP DOM to edit some text in the file (just two lines) (EDIT - bellow is already fixed working version):
<?php
$firstName = 'Richard';
$lastName = 'Knop';
$xml = file_get_contents('template.xml');
$doc = new DOMDocument();
$doc->loadXML($xml);
$doc->preserveWhiteSpace = false;
$wts = $doc->getElementsByTagNameNS('http://schemas.openxmlformats.org/wordprocessingml/2006/main', 't');
$c1 = 0; $c2 = 0;
foreach ($wts as $wt) {
if (1 === $c1) {
$wt->nodeValue .= ' ' . $firstName;
$c1++;
}
if (1 === $c2) {
$wt->nodeValue .= ' ' . $lastName;
$c2++;
}
if ('First Name' === substr($wt->nodeValue, 0, 10)) {
$c1++;
}
if ('Last Name' === substr($wt->nodeValue, 0, 9)) {
$c2++;
}
}
$xml = str_replace("\n", "\r\n", $xml);
$fp = fopen('final-xml.xml', 'w');
fwrite($fp, $xml);
fclose($fp);
This gets executed properly (no errors). These two lines:
<w:t>First Name:</w:t>
<w:t>Last Name:</w:t>
Get replaced with these:
<w:t>First Name: Richard</w:t>
<w:t>Last Name: Knop</w:t>
However, when I try to open the final-xml.xml file in MS Word, it doesn't open (Word freezes). Any suggestions.
EDIT:
I tried using levenstein():
$xml = file_get_contents('template.xml');
$xml2 = file_get_contents('final-xml.xml');
$str = str_split($xml, 255);
$str2 = str_split($xml2, 255);
$i = 0;
foreach ($str as $s) {
$dist = levenshtein($s, $str2[$i]);
if (0 <> $dist) {
echo $dist, '<br />';
}
$i++;
}
Which outputted nothing.
Which is weird. When I open the final-xml.xml file in notepad, I can clearly see that those two lines have changed.
EDIT2:
Here is the template.xml file: http://uploading.com/files/61b2922b/template.xml/
This is a problem related to DOS vs UNIX line endings. Word 2007 does not tolerate a \n line ending, it requires \r\n whereas Word 2010 is more tolerant and accepts both versions.
To fix the problem make sure that you replace all UNIX line breaks with DOS ones before saving the output file:
$xml = str_replace("\n", "\r\n", $xml);
Full sample:
<?php
$firstName = 'Richard';
$lastName = 'Knop';
$xml = file_get_contents('template.xml');
$doc = new DOMDocument();
$doc->loadXML($xml);
$doc->preserveWhiteSpace = false;
$wts = $doc->getElementsByTagNameNS('http://schemas.openxmlformats.org/wordprocessingml/2006/main', 't');
foreach ($wts as $wt) {
echo $wt->nodeValue;
if ('First Name:' === $wt->nodeValue) {
$wt->nodeValue = 'First Name: ' . $firstName;
}
if ('Last Name:' === substr($wt->nodeValue, 0, 10)) {
$wt->nodeValue = 'Last Name: ' . $lastName;
}
}
$xml = $doc->saveXML();
// Replace UNIX with DOS line endings
$xml = str_replace("\n", "\r\n", $xml);
$fp = fopen('final-xml.xml', 'w');
fwrite($fp, $xml);
fclose($fp);
?>
XML Word files have certain checksums stored near the top of the dom (to my recollection). You may have to change these, such as the size, or general checksum itself.
I know this was my problem when I was (dumb) enough to make an HTML file in word and save it, it has thousands of useless things in it that only served to make editing worse.

Categories