How to keep DOMDocument from saving < as &lt - php

I'm using simpleXML to add in a child node within one of my XML documents... when I do a print_r on my simpleXML object, the < is still being displayed as a < in the view source. However, after I save this object back to XML using DOMDocument, the < is converted to < and the > is converted to >
Any ideas on how to change this behavior? I've tried adding dom->substituteEntities = false;, but this did no good.
//Convert SimpleXML element to DOM and save
$dom = new DOMDocument('1.0');
$dom->preserveWhiteSpace = false;
$dom->formatOutput = false;
$dom->substituteEntities = false;
$dom->loadXML($xml->asXML());
$dom->save($filename);
Here is where I'm using the <:
$new_hint = '<![CDATA[' . $value[0] . ']]>';
$PrintQuestion->content->multichoice->feedback->hint->Passage->Paragraph->addChild('TextFragment', $new_hint);
The problem, is I'm using simple XML to iterate through certain nodes in the XML document, and if an attribute matches a given ID, a specific child node is added with CDATA. Then after all processsing, I save the XML back to file using DOMDocument, which is where the < is converted to &lt, etc.
Here is a link to my entire class file, so you can get a better idea on what I'm trying to accomplish. Specifically refer to the hint_insert() method at the bottom.
http://pastie.org/1079562

SimpleXML and php5's DOM module use the same internal representation of the document (facilitated by libxml). You can switch between both apis without having to re-parse the document via simplexml_import_dom() and dom_import_simplexml().
I.e. if you really want/have to perform the iteration with the SimpleXML api once you've found your element you can switch to the DOM api and create the CData section within the same document.
<?php
$doc = new SimpleXMLElement('<a>
<b id="id1">a</b>
<b id="id2">b</b>
<b id="id3">c</b>
</a>');
foreach( $doc->xpath('b[#id="id2"]') as $b ) {
$b = dom_import_simplexml($b);
$cdata = $b->ownerDocument->createCDataSection('0<>1');
$b->appendChild($cdata);
unset($b);
}
echo $doc->asxml();
prints
<?xml version="1.0"?>
<a>
<b id="id1">a</b>
<b id="id2">b<![CDATA[0<>1]]></b>
<b id="id3">c</b>
</a>

The problem is that you're likely adding that as a string, instead of as an element.
So, instead of:
$simple->addChild('foo', '<something/>');
which will be treated as text:
$child = $simple->addChild('foo');
$child->addChild('something');
You can't have a literal < in the body of the XML document unless it's the opening of a tag.
Edit: After what you describe in the comments, I think you're after:
DomDocument::createCDatatSection()
$child = $dom->createCDataSection('your < cdata > body ');
$dom->appendChild($child);
Edit2: After reading your edit, there's only one thing I can say:
You're doing it wrong... You can't add elements as a string value for another element. Sorry, you just can't. That's why it's escaping things, because DOM and SimpleXML are there to make sure you always create valid XML. You need to create the element as an object... So, if you want to create the CDATA child, you'd have to do something like this:
$child = $PrintQuestion.....->addChild('TextFragment');
$domNode = dom_import_simplexml($child);
$cdata = $domNode->ownerDocument->createCDataSection($value[0]);
$domNode->appendChild($cdata);
That's all there should be to it...

Related

PHP: Keeping HTML inside XML node without CDATA

I've got an xml like this:
<father>
<son>Text with <b>HTML</b>.</son>
</father>
I'm using simplexml_load_string to parse it into SimpleXmlElement. Then I get my node like this
$xml->father->son->__toString(); //output: "Text with .", but expected "Text with <b>HTML</b>."
I need to handle simple HTML such as:
<b>text</b> or <br/> inside the xml which is sent by many users.
Me problem is that I can't just ask them to use CDATA because they won't be able to handle it properly, and they are already use to do without.
Also, if it's possible I don't want the file to be edited because the information need to be the one sent by the user.
The function simplexml_load_string simply erase anything inside HTML node and the HTML node itself.
How can I keep the information ?
SOLUTION
To handle the problem I used the asXml as explained by #ThW:
$tmp = $xml->father->son->asXml(); //<son>Text with <b>HTML</b>.</son>
I just added a preg_match to erase the node.
A CDATA section is a character node, just like a text node. But it does less encoding/decoding. This is mostly a downside, actually. On the upside something in a CDATA section might be more readable for a human and it allows for some BC in special cases. (Think HTML script tags.)
For an XML API they are nearly the same. Here is a small DOM example (SimpleXML abstracts to much).
$document = new DOMDocument();
$father = $document->appendChild(
$document->createElement('father')
);
$son = $father->appendChild(
$document->createElement('son')
);
$son->appendChild(
$document->createTextNode('With <b>HTML</b><br>It\'s so nice.')
);
$son = $father->appendChild(
$document->createElement('son')
);
$son->appendChild(
$document->createCDataSection('With <b>HTML</b><br>It\'s so nice.')
);
$document->formatOutput = TRUE;
echo $document->saveXml();
Output:
<?xml version="1.0"?>
<father>
<son>With <b>HTML</b><br>It's so nice.</son>
<son><![CDATA[With <b>HTML</b><br>It's so nice.]]></son>
</father>
As you can see they are serialized very differently - but from the API view they are basically exchangeable. If you're using an XML parser the value you get back should be the same in both cases.
So the first possibility is just letting the HTML fragment be stored in a character node. It is just a string value for the outer XML document itself.
The other way would be using XHTML. XHTML is XML compatible HTML. You can mix an match different XML formats, so you could add the XHTML fragment as part of the outer XML.
That seems to be what you're receiving. But SimpleXML has some problems with mixed nodes. So here is an example how you can read it in DOM.
$xml = <<<'XML'
<father>
<son>With <b>HTML</b><br/>It's so nice.</son>
</father>
XML;
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXpath($document);
$result = '';
foreach ($xpath->evaluate('/father/son[1]/node()') as $child) {
$result .= $document->saveXml($child);
}
echo $result;
Output:
With <b>HTML</b><br/>It's so nice.
Basically you need to save each child of the son element as XML.
SimpleXML is based on the same DOM library internally. That allows you to convert a SimpleXMLElement into a DOM node. From there you can again save each child as XML.
$father = new SimpleXMLElement($xml);
$sonNode = dom_import_simplexml($father->son);
$document = $sonNode->ownerDocument;
$result = '';
foreach ($sonNode->childNodes as $child) {
$result .= $document->saveXml($child);
}
echo $result;

Parsing inline tags with SimpleXML

I'm using SimpleXML & PHP to parse an XML element in the following form:
<element>
random text with <inlinetag src="http://url.com/">inline</inlinetag> XML to parse
</element>
I know I can reach inlinetag using $element->inlinetag, but I don't know how to reach it in such a way that I can basically replace the inlinetag with a link to the attribute source without using it's location in the text. The result would basically have to look like this:
here is a random text with inline XML
This may be a stupid questions, I hope someone here can help! :)
I found a way to do this using DOMElement.
One way to replace the element is by cloning it with a different name/attributes. Here is is a way to do this, using the accepted answer given on How do you rename a tag in SimpleXML through a DOM object?
function clonishNode(DOMNode $oldNode, $newName, $replaceAttrs = [])
{
$newNode = $oldNode->ownerDocument->createElement($newName);
foreach ($oldNode->attributes as $attr)
{
if (isset($replaceAttrs[$attr->name]))
$newNode->setAttribute($replaceAttrs[$attr->name], $attr->value);
else
$newNode->appendChild($attr->cloneNode());
}
foreach ($oldNode->childNodes as $child)
$newNode->appendChild($child->cloneNode(true));
$oldNode->parentNode->replaceChild($newNode, $oldNode);
}
Now, we use this function to clone the inline element with a new element and attribute name. Here comes the tricky part: iterating over all the nodes will not work as expected. The length of the selected nodes will change as you clone them, as the original node is removed. Therefore, we only select the first element until there are no elements left to clone.
$xml = '<element>
random text with <inlinetag src="http://url.com/">inline</inlinetag> XML to parse
</element>';
$dom = new DOMDocument;
$dom->loadXML($xml);
$nodes= $dom->getElementsByTagName('inlinetag');
echo $dom->saveXML(); //<element>random text with <inlinetag src="http://url.com/">inline</inlinetag> XML to parse</element>
while($nodes->length > 0) {
clonishNode($nodes->item(0), 'a', ['src' => 'href']);
}
echo $dom->saveXML(); //<element>random text with inline XML to parse</element>
That's it! All that's left to do is getting the content of the element tag.
Is this the result you want to achieve?
<?php
$data = '<element>
random text with
<inlinetag src="http://url.com/">inline
</inlinetag> XML to parse
</element>';
$xml = simplexml_load_string($data);
foreach($xml->inlinetag as $resource)
{
echo 'Your SRC attribute = '. $resource->attributes()->src; // e.g. name, price, symbol
}
?>

PHP Xpath: Get all href's that contain "letter"

Say I have an html file that I have loaded, I run this query:
$url = 'http://www.fangraphs.com/players.aspx';
$html = file_get_contents($url);
$myDom = new DOMDocument;
$myDom->formatOutput = true;
#$myDom->loadHTML($html);
$anchor = $xpath->query('//a[contains(#href,"letter")]');
That gives me a list of these anchors that look like the following:
Aa
But I need a way to only get "players.aspx?letter=Aa".
I thought I could try:
$anchor = $xpath->query('//a[contains(#href,"letter")]/#href');
But that gives me a php error saying I couldn't append node when I try the following:
$xpath = new DOMXPath($myDom);
$newDom = new DOMDocument;
$j = 0;
while( $myAnchor = $anchor->item($j++) ){
$node = $newDom->importNode( $myAnchor, true ); // import node
$newDom->appendChild($node);
}
Any idea how to obtain just the value of the href tags that the first query selects?? Thanks!
Use:
//a/#href[contains(., 'letter')]
this selects any href attribute of any a whose string value (of the attribute) contains the string "letter" .
Your XPath query is returning attributes themselves (i.e., DOMAttr objects) rather than elements (i.e., DOMElement objects). That's fine, and that seems to be what you want, but appending them to the document is the problem. A DOMAttr is not a standalone node in the document tree; it's associated with a DOMElement but is not a child in the usual sense. Thus, directly appending a DOMAttr to the document is invalid.
From the W3C specs:
Attr objects inherit the Node interface, but since they are not actually child nodes of the element they describe, the DOM does not consider them part of the document tree. . . . The DOM takes the view that attributes are properties of elements rather than having a separate identity from the elements they are associated with
Either associate the DOMAttr with a DOMElement and append that element, or pull out the DOMAttr's value and use that as you wish.
To just append its plain text value, use its value in a DOMText node and append that. For example, change this line:
$newDom->appendChild($node);
to this:
$newDom->appendChild(new DOMText($node->value));
try this..
$xml_string = 'your xml string';
$xml = simplexml_load_string($xml_string);
foreach($xml->a[0]->attributes() as $href => $value) {
$myAnchorsValues[] = $value;
}
var_dump($myAnchorsValues);

How to get a specific node text using php DOM

I am trying to get the value (text) of a specific node from an xml document using php DOM classes but I cannot do it right because I get the text content of that node merged with its descendants.
Let's suppose that I need to get the trees from this document:
<?xml version="1.0"?>
<trees>
LarchRedwoodChestnutBirch
<trimmed>Larch</trimmed>
<trimmed>Redwood</trimmed>
</trees>
And I get:
LarchRedwoodChestnutBirchLarchRedwood
You can see that I cannot remove the substring LarchRedwood made by the trimmed trees from the whole text because I would get only ChestnutBirch and it is not what I need.
Any suggest? (Thanx)
I got it. This works:
function specificNodeValue($node, $implode = true) {
$value = array();
if ($node->childNodes) {
for ($i = 0; $i < $node->childNodes->length; $i++) {
if (!(#$node->childNodes->item($i)->tagName)) {
$value[] = $node->childNodes->item($i)->nodeValue;
}
}
}
return (is_string($implode) ? implode($implode, $value) : ($implode === true ? implode($value) : $value));
}
A given node is like a root, if you get no tagName when you parse its child nodes then it is itself, so the value of that child node it is its own value.
Inside a bad formed xml document a node could have many pieces of value, put them all into an array to get the whole value of the node.
Use the function above to get needed node value without subnode values merged within.
Parameters are:
$node (required) must be a DOMElement object
$implode (optional) if you want to get a string (true by default) or an array (false) made up by many pieces of value. (Set a string instead of a boolean value if you wish to implode the array using a "glue" string).
You can try this to remove the trimmed node
$doc = new DOMDocument('1.0', 'utf-8');
$doc->loadXML($xml);
$xpath = new DOMXpath($doc);
$trees = $doc->getElementsByTagName('trees')->item(0);
foreach ($xpath->query('/trees/*') as $node)
{
$trees->removeChild($node);
}
echo $trees->textContent;
echo $trees->nodeValue;
Use $node->nodeValue to get a node's text content. If you use $node->textContent, you get all text from the current node and all child nodes.
Ideally, the XML should be:
<?xml version="1.0"?>
<trees>
<tree>Larch</tree>
<tree>Redwood</tree>
<tree>Chestnut</tree>
<tree>Birch</tree>
</trees>
To split "LarchRedwoodChestnutBirch" into separate words (by capital letter), you'll need to use PHP's "PCRE" functions:
http://www.php.net/manual/en/book.pcre.php
'Hope that helps!

How to get values inside <![CDATA[values]] > using php DOM?

How can i get values inside <![CDATA[values]] > using php DOM.
This is few code from my xml.
<Destinations>
<Destination>
<![CDATA[Aghia Paraskevi, Skiatos, Greece]]>
<CountryCode>GR</CountryCode>
</Destination>
<Destination>
<![CDATA[Amettla, Spain]]>
<CountryCode>ES</CountryCode>
</Destination>
<Destination>
<![CDATA[Amoliani, Greece]]>
<CountryCode>GR</CountryCode>
</Destination>
<Destination>
<![CDATA[Boblingen, Germany]]>
<CountryCode>DE</CountryCode>
</Destination>
</Destinations>
Working with PHP DOM is fairly straightforward, and is very similar to Javascript's DOM.
Here are the important classes:
DOMNode — The base class for anything that can be traversed inside an XML/HTML document, including text nodes, comment nodes, and CDATA nodes
DOMElement — The base class for tags.
DOMDocument — The base class for documents. Contains the methods to load/save XML, as well as normal DOM document methods (see below).
There are a few staple methods and properties:
DOMDocument->load() — After creating a new DOMDocument, use this method on that object to load from a file.
DOMDocument->getElementsByTagName() — this method returns a node list of all elements in the document with the given tag name. Then you can iterate (foreach) on this list.
DOMNode->childNodes — A node list of all children of a node. (Remember, a CDATA section is a node!)
DOMNode->nodeType — Get the type of a node. CDATA nodes have type XML_CDATA_SECTION_NODE, which is a constant with the value 4.
DOMNode->textContent — get the text content of any node.
Note: Your CDATA sections are malformed. I don't know why there is an extra ]] in the first one, or an unclosed CDATA section at the end of the line, but I think it should simply be:
<![CDATA[Aghia Paraskevi, Skiatos, Greece]]>
Putting this all together we:
Create a new document object and load the XML
Get all Destination elements by tag name and iterate over the list
Iterate over all child nodes of each Destination element
Check if the node type is XML_CDATA_SECTION_NODE
If it is, echo the textContent of that node.
Code:
$doc = new DOMDocument();
$doc->load('test.xml');
$destinations = $doc->getElementsByTagName("Destination");
foreach ($destinations as $destination) {
foreach($destination->childNodes as $child) {
if ($child->nodeType == XML_CDATA_SECTION_NODE) {
echo $child->textContent . "<br/>";
}
}
}
Result:
Aghia Paraskevi, Skiatos, Greece
Amettla, Spain
Amoliani, Greece
Boblingen, Germany
Use this:
$parseFile = simplexml_load_file($myXML,'SimpleXMLElement', LIBXML_NOCDATA)
and next :
foreach ($parseFile->yourNode as $node ){
etc...
}
Best and easy way
$xml = simplexml_load_string($xmlData, 'SimpleXMLElement', LIBXML_NOCDATA);
$xmlJson = json_encode($xml);
$xmlArr = json_decode($xmlJson, 1); // Returns associative array
Use replace CDATA before parsing PHP DOM element after that you can get the innerXml or innerHtml:
str_replace(array('<\![CDATA[',']]>'), '', $xml);
I use following code.
Its not only read all xml data with
<![CDATA[values]] >
but also convert xml object to php associative array. So we can apply loop on the data.
$xml_file_data = json_decode(json_encode(simplexml_load_string($xml, 'SimpleXMLElement', LIBXML_NOCDATA),true), true);
Hope this will work for you.
function inBetweenOf(string $here, string $there, string $content) : string {
$left_over = strlen(substr($content, strpos($content, $there)));
return substr($content, strpos($content, $here) + strlen($here), -$left_over);
}
Iterate over "Destination" tags and then call inBetweenOf on each iteration.
$doc = inBetweenOf('<![CDATA[', ']]>', $xml);

Categories