PHP and XMLWriter - php

I was successfully using the following code to merge multiple large XML files into a new (larger) XML file. Found at least part of this on StackOverflow
$docList = new DOMDocument();
$root = $docList->createElement('documents');
$docList->appendChild($root);
$doc = new DOMDocument();
foreach(xmlFilenames as $xmlfilename) {
$doc->load($xmlfilename);
$xmlString = $doc->saveXML($doc->documentElement);
$xpath = new DOMXPath($doc);
$query = self::getQuery(); // this is the name of the ROOT element
$nodelist = $xpath->evaluate($query, $doc->documentElement);
if( $nodelist->length > 0 ) {
$node = $docList->importNode($nodelist->item(0), true);
$xmldownload = $docList->createElement('document');
if (self::getShowFileName())
$xmldownload->setAttribute("filename", $filename);
$xmldownload->appendChild($node);
$root->appendChild($xmldownload);
}
}
$newXMLFile = self::getNewXMLFile();
$docList->save($newXMLFile);
I started running into OUT OF MEMORY issues when the number of files grew as did the size of them.
I found an article here which explained the issue and recommended using XMLWriter
So, now trying to use PHP XMLWriter to merge multiple large XML files together into a new (larger) XML file. Later, I will execute xpath against the new file.
Code:
$xmlWriter = new XMLWriter();
$xmlWriter->openMemory();
$xmlWriter->openUri('mynewFile.xml');
$xmlWriter->setIndent(true);
$xmlWriter->startDocument('1.0', 'UTF-8');
$xmlWriter->startElement('documents');
$doc = new DOMDocument();
foreach($xmlfilenames as $xmlfilename)
{
$fileContents = file_get_contents($xmlfilename);
$xmlWriter->writeElement('document',$fileContents);
}
$xmlWriter->endElement();
$xmlWriter->endDocument();
$xmlWriter->flush();
Well, the resultant (new) xml file is no longer correct since elements are escaped - i.e.
<?xml version="1.0" encoding="UTF-8"?>
<CONFIRMOWNX>
<Confirm>
<LglVeh id="GLE">
<AddrLine1>GLEACHER &amp; COMPANY</AddrLine1>
<AddrLine2>DESCAP DIVISION</AddrLine2>
Can anyone explain how to take the content from the XML file and write them properly to new file?
I'm burnt on this and I KNOW it'll be something simple I'm missing.
Thanks.
Robert

See, the problem is that XMLWriter::writeElement is intended to, well, write a complete XML element. That's why it automatically sanitize (replace & with &, for example) the contents of what's been passed to it as the second param.
One possible solution is to use XMLWriter::writeRaw method instead, as it writes the contents as is - without any sanitizing. Obviously it doesn't validate its inputs, but in your case it does not seem to be a problem (as you're working with already checked source).

Hmm, Not sure why it's converting it to HTML Characters, but you can decode it like so
htmlspecialchars_decode($data);
It converts special HTML entities back to characters.

Related

Replace element value in DOM

I want to save DOM tags value to exist XML, I found replace function but it is in js and I need the function in PHP
I tried save and saveXML function, but this didn't worked. I have tags in XML with colon "iaiext:auction_title". I used getElement and it's work good, next i cut title to 50 characters function work too, but how i can replace old title to this new title if i dont use path like simple_load_file. How to show in my script this path?
$dom = new DOMDocument;
$dom->load('p.xml');
$i = 0;
$tytuly = $dom->getElementsByTagName('auction_title');
foreach ($tytuly as $tytul){
$title = $tytul->nodeValue;
$end_title = doTitleCut($title);
//echo "<pre>";
//echo($end_title);
//echo "<pre>";
$i = $i+1;
}
In your loop, you can update a particular nodes value the same way you fetch it - with nodeValue. So in your loop, just update it each time...
$tytul->nodeValue = doTitleCut($title);
Then after your loop, you can just echo the new XML out using
echo $dom->saveXML();
or save it using
$dom->save("3.xml");
It is the same basic API in PHP. However browsers implement more or other parts of the API. Here are 5 revisions of the API (DOM Level 1 to 4 and DOM LS). DOM 3 added a property to read/write the text content of a node: https://www.w3.org/TR/DOM-Level-3-Core/core.html#Node3-textContent
The following example prefixes the titles:
$xml = <<<'XML'
<auctions>
<auction_title>World!</auction_title>
<auction_title>World & Universe!</auction_title>
</auctions>
XML;
$document = new DOMDocument();
$document->loadXML($xml);
$titleNodes = $document->getElementsByTagName('auction_title');
foreach ($titleNodes as $titleNode) {
$title = $titleNode->textContent;
$titleNode->textContent = 'Hello '.$title;
}
echo $document->saveXML();
Output:
<?xml version="1.0"?>
<auctions>
<auction_title>Hello World!</auction_title>
<auction_title>Hello World & Universe!</auction_title>
</auctions>
PHPs DOMNode::$nodeValue implementation does not match the W3C API definition. It behaves the same as DOMNode::$textContent for reads and does not fully escape on write.

php export xml CDATA escaped

I am trying to export xml with CDATA tags. I use the following code:
$xml_product = $xml_products->addChild('product');
$xml_product->addChild('mychild', htmlentities("<![CDATA[" . $mytext . "]]>"));
The problem is that I get CDATA tags < and > escaped with < and > like following:
<mychild><![CDATA[My some long long long text]]></mychild>
but I need:
<mychild><![CDATA[My some long long long text]]></mychild>
If I use htmlentities() I get lots of errors like tag raquo is not defined etc... though there are no any such tags in my text. Probably htmlentities() tries to parse my text inside CDATA and convert it, but I dont want it either.
Any ideas how to fix that? Thank you.
UPD_1 My function which saves xml to file:
public static function saveFormattedXmlFile($simpleXMLElement, $output_file) {
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
$dom->loadXML(urldecode($simpleXMLElement->asXML()));
$dom->save($output_file);
}
A short example of how to add a CData section, note the way it skips into using DOMDocument to add the CData section in. The code builds up a <product> element, $xml_product has a new element <mychild> created in it. This newNode is then imported into a DOMElement using dom_import_simplexml. It then uses the DOMDocument createCDATASection method to properly create the appropriate bit and adds it back into the node.
$xml = new SimpleXMLElement('<?xml version="1.0" encoding="UTF-8"?><Products />');
$xml_product = $xml->addChild('product');
$newNode = $xml_product->addChild('mychild');
$mytext = "<html></html>";
$node = dom_import_simplexml($newNode);
$cdata = $node->ownerDocument->createCDATASection($mytext);
$node->appendChild($cdata);
echo $xml->asXML();
This example outputs...
<?xml version="1.0" encoding="UTF-8"?>
<Products><product><mychild><![CDATA[<html></html>]]></mychild></product></Products>

PHP: Keeping HTML inside XML node without CDATA

I've got an xml like this:
<father>
<son>Text with <b>HTML</b>.</son>
</father>
I'm using simplexml_load_string to parse it into SimpleXmlElement. Then I get my node like this
$xml->father->son->__toString(); //output: "Text with .", but expected "Text with <b>HTML</b>."
I need to handle simple HTML such as:
<b>text</b> or <br/> inside the xml which is sent by many users.
Me problem is that I can't just ask them to use CDATA because they won't be able to handle it properly, and they are already use to do without.
Also, if it's possible I don't want the file to be edited because the information need to be the one sent by the user.
The function simplexml_load_string simply erase anything inside HTML node and the HTML node itself.
How can I keep the information ?
SOLUTION
To handle the problem I used the asXml as explained by #ThW:
$tmp = $xml->father->son->asXml(); //<son>Text with <b>HTML</b>.</son>
I just added a preg_match to erase the node.
A CDATA section is a character node, just like a text node. But it does less encoding/decoding. This is mostly a downside, actually. On the upside something in a CDATA section might be more readable for a human and it allows for some BC in special cases. (Think HTML script tags.)
For an XML API they are nearly the same. Here is a small DOM example (SimpleXML abstracts to much).
$document = new DOMDocument();
$father = $document->appendChild(
$document->createElement('father')
);
$son = $father->appendChild(
$document->createElement('son')
);
$son->appendChild(
$document->createTextNode('With <b>HTML</b><br>It\'s so nice.')
);
$son = $father->appendChild(
$document->createElement('son')
);
$son->appendChild(
$document->createCDataSection('With <b>HTML</b><br>It\'s so nice.')
);
$document->formatOutput = TRUE;
echo $document->saveXml();
Output:
<?xml version="1.0"?>
<father>
<son>With <b>HTML</b><br>It's so nice.</son>
<son><![CDATA[With <b>HTML</b><br>It's so nice.]]></son>
</father>
As you can see they are serialized very differently - but from the API view they are basically exchangeable. If you're using an XML parser the value you get back should be the same in both cases.
So the first possibility is just letting the HTML fragment be stored in a character node. It is just a string value for the outer XML document itself.
The other way would be using XHTML. XHTML is XML compatible HTML. You can mix an match different XML formats, so you could add the XHTML fragment as part of the outer XML.
That seems to be what you're receiving. But SimpleXML has some problems with mixed nodes. So here is an example how you can read it in DOM.
$xml = <<<'XML'
<father>
<son>With <b>HTML</b><br/>It's so nice.</son>
</father>
XML;
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXpath($document);
$result = '';
foreach ($xpath->evaluate('/father/son[1]/node()') as $child) {
$result .= $document->saveXml($child);
}
echo $result;
Output:
With <b>HTML</b><br/>It's so nice.
Basically you need to save each child of the son element as XML.
SimpleXML is based on the same DOM library internally. That allows you to convert a SimpleXMLElement into a DOM node. From there you can again save each child as XML.
$father = new SimpleXMLElement($xml);
$sonNode = dom_import_simplexml($father->son);
$document = $sonNode->ownerDocument;
$result = '';
foreach ($sonNode->childNodes as $child) {
$result .= $document->saveXml($child);
}
echo $result;

Add item to xml everytime form is submitted - xmlwriter PHP

I have an html form when submitted post values to a php file. those values are then read into xmlwriter like below
<?php
$pastor= $_POST["Speaker"];
$title= $_POST["Title"];
$link= $_POST["podcasturl"];
$download= $_POST["Download"];
$podcastid= $_POST["Podcastid"];
$pubdate= $_POST ["Date"];
$xml = new XmlWriter();
$xml->openURI('podcast.xml');
$xml->formatOutput = true;
$xml->startDocument('1.0');
$xml->setIndent(4);
$xml->startElement("podcast");
$xml->writeElement('pastor', $pastor);
$xml->writeElement('title', $title);
$xml->writeElement('link', $link);
$xml->writeElement('download', $download);
$xml->writeElement('podcastid', $podcastid);
$xml->writeElement('pubdate', $pubdate);
$xml->endElement();
$xml->endElement();
$xml->endDocument();
?>
This whole system works fine. It creates the xml as I need it to based on the form. What I cant figure out is how to make it add a new item to the xml every time the form is submitted as opposed to it overwriting the same entry each time.
Thanks
You should consider using SimpleXML instead. XMLWriter is not intended to modify existing xml. From the XMLWriter introduction:
This extension represents a writer that provides a non-cached, forward-only means of generating streams or files containing XML data.
Using SimpleXML you can do something like this:
$xml = simplexml_load_file("podcast.xml");
$podcast = $xml->addChild("podcast");
$podcast->addChild("pastor", $pastor);
$podcast->addChild("title", $title);
$podcast->addChild("link", $link);
$podcast->addChild("download", $download);
$podcast->addChild("podcastid", $podcastid);
$podcast->addChild("pubdate", $pubdate);
$dom = new DOMDocument("1.0");
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
$dom->loadXML($xml->asXML());
$dom->save("podcast.xml");
In order for this example to work the podcast xml elements should be wrapped in an element (see below) and the `podcast.xml' file should already exist. Although the latter can be overcome with a check before loading.
<?xml version="1.0"?>
<podcasts>
<podcast>...</podcast>
<podcast>...</podcast>
<podcasts>

PHP Dealing with missing XML data

If I have three sets of data, say:
<note><from>Me</from><to>someone</to><message>hello</message></note>
<note><from>Me</from><to></to><message>Need milk & eggs</message></note>
<note><from>Me</from><message>Need milk & eggs</message></note>
and I'm using simplexml is there a way to have simple xml check that there's an empty/absent tag automatically?
I would like the output to be:
FROM TO MESSAGE
Me someone hello
Me NULL Need milk & eggs
Me NULL Need milk & eggs
Right now I'm doing it manually and I quickly realised that it's going to take a very long time to do it for long xml files.
My current sample code:
$xml = simplexml_load_string($string);
if ($xml->from != "") {$out .= $xml->from."\t"} else {$out .= "NULL\t";}
//repeat for all children, checking by name
Sometimes the order is different as well, there might be a xml with:
<note><message>pick up cd</message><from>me</from></note>
so iterating through the children and checking by index count doesn't work.
The actual xml files I'm working with are thousands of lines each, so I obviously can't just code in every tag.
It sounds like you need a DTD (Document Type Definition), which will define the required format of the XML file, and specify which elements are required, optional, what they can contain, etc.
DTDs can be used to validate an XML file before you do any processing with it.
Unfortunately, PHP's simplexml library doesn't do anything with DTD, but the DomDocument library does, so you may want to use that instead.
I'll leave it as a separate excersise for you to research how to create a DTD file. If you need more help with that, I'd suggest asking it as a separate question.
You could use the DOMDocument instead. I have created a quick demo that splits the <note> elements into an array using the XML tag names as keys. You could then iterate the resultant array to create your output.
I corrected the invalid XML by replacing the ampersand with the HTML entity equivalent (&).
<?php
libxml_use_internal_errors(true);
$xml = <<<XML
<notes>
<note><from>Me</from><to>someone</to><message>hello</message></note>
<note><from>Me</from><to></to><message>Need milk & eggs</message></note>
<note><from>Me</from><message>Need milk & eggs</message></note>
<note><message>pick up cd</message><from>me</from></note>
</notes>
XML;
function getNotes($nodelist) {
$notes = array();
foreach ($nodelist as $node) {
$noteParts = array();
foreach ($node->childNodes as $child) {
$noteParts[$child->tagName] = $child->nodeValue;
}
$notes[] = $noteParts;
}
return $notes;
}
$dom = new DOMDocument();
$dom->recover = true;
$dom->loadXML($xml);
$xpath = new DOMXPath($dom);
$nodelist = $xpath->query("//note");
$notes = getNotes($nodelist);
print_r($notes);
?>
Edit: If you change to $noteParts = array(); to $noteParts = array('from' => null, 'to' => null, 'message' => null); then it will always create the full set of keys.

Categories