How to convert empty XML node into empty string instead of SimpleXMLElement - php

I have an XML string that sometimes has empty nodes. When parsing this with simplexml_load_string the parser interprets any empty nodes (example <node></node>) to be an empty SimpleXMLElement. I actually would prefer these come through as an empty string, or are just omitted entirely.
I've tried using LIBXML_NOBLANKS as shown below, but it seems to have no effect. Here's some code that demonstrates the situation. the node "p2" is empty:
$xml = "<xml><p1>1</p1><p2></p2><p3>3</p3></xml>";
$obj = simplexml_load_string($xml, 'SimpleXMLElement', LIBXML_NOBLANKS);
header("Content-type: text/plain");
echo "STRING\n-----\n" . $xml;
echo "\n\nOBJ\n---\n" . print_r($obj,1);
echo "\n\nJSON\n----\n" . json_encode($obj);

Here is working example for empty nodes:
$nodes = $rootNode->xpath("//*[text()='']");
foreach ($nodes as $node) {
unset($node->{0});
}
unset($node->{0}) - is a trick which destroyes this node and removes it from parent node.

Related

PHP: Keeping HTML inside XML node without CDATA

I've got an xml like this:
<father>
<son>Text with <b>HTML</b>.</son>
</father>
I'm using simplexml_load_string to parse it into SimpleXmlElement. Then I get my node like this
$xml->father->son->__toString(); //output: "Text with .", but expected "Text with <b>HTML</b>."
I need to handle simple HTML such as:
<b>text</b> or <br/> inside the xml which is sent by many users.
Me problem is that I can't just ask them to use CDATA because they won't be able to handle it properly, and they are already use to do without.
Also, if it's possible I don't want the file to be edited because the information need to be the one sent by the user.
The function simplexml_load_string simply erase anything inside HTML node and the HTML node itself.
How can I keep the information ?
SOLUTION
To handle the problem I used the asXml as explained by #ThW:
$tmp = $xml->father->son->asXml(); //<son>Text with <b>HTML</b>.</son>
I just added a preg_match to erase the node.
A CDATA section is a character node, just like a text node. But it does less encoding/decoding. This is mostly a downside, actually. On the upside something in a CDATA section might be more readable for a human and it allows for some BC in special cases. (Think HTML script tags.)
For an XML API they are nearly the same. Here is a small DOM example (SimpleXML abstracts to much).
$document = new DOMDocument();
$father = $document->appendChild(
$document->createElement('father')
);
$son = $father->appendChild(
$document->createElement('son')
);
$son->appendChild(
$document->createTextNode('With <b>HTML</b><br>It\'s so nice.')
);
$son = $father->appendChild(
$document->createElement('son')
);
$son->appendChild(
$document->createCDataSection('With <b>HTML</b><br>It\'s so nice.')
);
$document->formatOutput = TRUE;
echo $document->saveXml();
Output:
<?xml version="1.0"?>
<father>
<son>With <b>HTML</b><br>It's so nice.</son>
<son><![CDATA[With <b>HTML</b><br>It's so nice.]]></son>
</father>
As you can see they are serialized very differently - but from the API view they are basically exchangeable. If you're using an XML parser the value you get back should be the same in both cases.
So the first possibility is just letting the HTML fragment be stored in a character node. It is just a string value for the outer XML document itself.
The other way would be using XHTML. XHTML is XML compatible HTML. You can mix an match different XML formats, so you could add the XHTML fragment as part of the outer XML.
That seems to be what you're receiving. But SimpleXML has some problems with mixed nodes. So here is an example how you can read it in DOM.
$xml = <<<'XML'
<father>
<son>With <b>HTML</b><br/>It's so nice.</son>
</father>
XML;
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXpath($document);
$result = '';
foreach ($xpath->evaluate('/father/son[1]/node()') as $child) {
$result .= $document->saveXml($child);
}
echo $result;
Output:
With <b>HTML</b><br/>It's so nice.
Basically you need to save each child of the son element as XML.
SimpleXML is based on the same DOM library internally. That allows you to convert a SimpleXMLElement into a DOM node. From there you can again save each child as XML.
$father = new SimpleXMLElement($xml);
$sonNode = dom_import_simplexml($father->son);
$document = $sonNode->ownerDocument;
$result = '';
foreach ($sonNode->childNodes as $child) {
$result .= $document->saveXml($child);
}
echo $result;

DOM in PHP: Decoded entities and setting nodeValue

I want to perform certain manipulations on a XML document with PHP using the DOM part of its standard library. As others have already discovered, one has to deal with decoded entities then. To illustrate what bothers me, I give a quick example.
Suppose we have the following code
$doc = new DOMDocument();
$doc->loadXML(<XML data>);
$xpath = new DOMXPath($doc);
$node_list = $xpath->query(<some XPath>);
foreach($node_list as $node) {
//do something
}
If the code in the loop is something like
$attr = "<some string>";
$val = $node->getAttribute($attr);
//do something with $val
$node->setAttribute($attr, $val);
it works fine. But if it's more like
$text = $node->textContent;
//do something with $text
$node->nodeValue = $text;
and $text contains some decoded &, it doesn't get encoded, even if one does nothing with $text at all.
At the moment, I apply htmlspecialchars on $text before I set $node->nodeValue to it. Now I want to know
if that is sufficient,
if not, what would suffice,
and if there are more elegant solutions for this, as in the case of attribute manipulation.
The XML documents I have to deal with are mostly feeds, so a solution should be pretty general.
EDIT
It turned out that my original question had the wrong scope, sorry for that. Here I provide an example where the described behaviour actually happens.
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://feeds.bbci.co.uk/news/rss.xml?edition=uk");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
curl_close($ch);
$doc = new DOMDocument();
$doc->loadXML($output);
$xpath = new DOMXPath($doc);
$node_list = $xpath->query('//item/link');
foreach($node_list as $node) {
$node->nodeValue = $node->textContent;
}
echo $doc->saveXML();
If I execute this code on the CLI with
php beeb.php |egrep 'link|Warning'
I get results like
<link>http://www.bbc.co.uk/news/world-africa-23070006#sa-ns_mchannel=rss</link>
which should be
<link>http://www.bbc.co.uk/news/world-africa-23070006#sa-ns_mchannel=rss&ns_source=PublicRSS20-sa</link>
(and is, if the loop is omitted) and according warnings
Warning: main(): unterminated entity reference ns_source=PublicRSS20-sa in /private/tmp/beeb.php on line 15
When I apply htmlspecialchars to $node->textContent, it works fine, but I feel very uncomfortable doing that.
Your question is basically whether or not setting DOMText::nodeValue to an XML encoded string or to a verbatim string.
So let's just try that out and set it to & and '& and see what happens:
$doc = new DOMDocument();
$doc->loadXML('<root>*</root>');
$text = $doc->documentElement->childNodes->item(0);
echo "Before Edit: ", $doc->saveXML($text), "\n";
$text->nodeValue = "&";
echo "After Edit 1: ", $doc->saveXML($text), "\n";
$text->nodeValue = "&";
echo "After Edit 2: ", $doc->saveXML($text), "\n";
The output then is as the following (PHP 5.0.0 - 5.5.0):
Before Edit: *
After Edit 1: &
After Edit 2: &amp;
This shows that setting the nodeValue of a DOMText-node expects a UTF-8 encoded string and the DOM library encodes the XML reserved characters automatically.
So you should not apply htmlspecialchars() onto any text you add this way. That would create a double-encoding.
As you write you experience the opposite I suggest you to execute an isolated PHP example on the commandline / within your IDE so that you can see exactly the output. Not that your browser renders this as HTML and then you think the reserved XML characters have not been encoded.
As you have pointed out you're not editing a DOMText but an DOMElement node. It works a bit different, here the & character needs to be passed as entity & instead of verbatim , however only this character.
So this needs a little bit more work:
Read out the text-content and turn it into a DOMText node. Everything will be perfectly encoded.
Remove the node-value of the element node so it's empty.
Append the DOMText node form first step as child.
And done. Here your inner foreach modified showing this:
foreach($node_list as $node) {
$text = $doc->createTextNode($node->textContent);
$node->nodeValue = "";
$node->appendChild($text);
}
For your concrete example albeit I must admit I don't understand why you do that because this does not change the value so it wouldn't need this.
Tip: In PHP DOMDocument can open this feed directly, you don't need curl here:
$doc = new DOMDocument();
$doc->load("http://feeds.bbci.co.uk/news/rss.xml?edition=uk");
As hakre explained, the problem is that in PHP's DOM library, the behaviour of setting nodeValue w.r.t. entities depends on the class of the node, in particular DOMText and DOMElement differ in this regard.
To illustrate this, an example:
$doc = new DOMDocument();
$doc->formatOutput = True;
$doc->loadXML('<root/>');
$s = 'text &<<"\'&text;&text';
$root = $doc->documentElement;
$node = $doc->createElement('tag1', $s); #line 10
$root->appendChild($node);
$node = $doc->createElement('tag2');
$text = $doc->createTextNode($s);
$node->appendChild($text);
$root->appendChild($node);
$node = $doc->createElement('tag3');
$text = $doc->createCDATASection($s);
$node->appendChild($text);
$root->appendChild($node);
echo $doc->saveXML();
outputs
Warning: DOMDocument::createElement(): unterminated entity reference text in /tmp/DOMtest.php on line 10
<?xml version="1.0"?>
<root>
<tag1>text &<<"'&text;</tag1>
<tag2>text &amp;&lt;<"'&text;&text</tag2>
<tag3><![CDATA[text &<<"'&text;&text]]></tag3>
</root>
In this particular case, it is appropriate to alter the nodeValue of DOMText nodes. Combining hakre's two answers one gets a quite elegant solution.
$doc = new DOMDocument();
$doc->loadXML(<XML data>);
$xpath = new DOMXPath($doc);
$node_list = $xpath->query(<some XPath>);
$visitTextNode = function (DOMText $node) {
$text = $node->textContent;
/*
do something with $text
*/
$node->nodeValue = $text;
};
foreach ($node_list as $node) {
if ($node->nodeType == XML_TEXT_NODE) {
$visitTextNode($node);
} else {
foreach ($node->childNodes as $child) {
if ($child->nodeType == XML_TEXT_NODE) {
$visitTextNode($child);
}
}
}
}

PHP Xpath: Get all href's that contain "letter"

Say I have an html file that I have loaded, I run this query:
$url = 'http://www.fangraphs.com/players.aspx';
$html = file_get_contents($url);
$myDom = new DOMDocument;
$myDom->formatOutput = true;
#$myDom->loadHTML($html);
$anchor = $xpath->query('//a[contains(#href,"letter")]');
That gives me a list of these anchors that look like the following:
Aa
But I need a way to only get "players.aspx?letter=Aa".
I thought I could try:
$anchor = $xpath->query('//a[contains(#href,"letter")]/#href');
But that gives me a php error saying I couldn't append node when I try the following:
$xpath = new DOMXPath($myDom);
$newDom = new DOMDocument;
$j = 0;
while( $myAnchor = $anchor->item($j++) ){
$node = $newDom->importNode( $myAnchor, true ); // import node
$newDom->appendChild($node);
}
Any idea how to obtain just the value of the href tags that the first query selects?? Thanks!
Use:
//a/#href[contains(., 'letter')]
this selects any href attribute of any a whose string value (of the attribute) contains the string "letter" .
Your XPath query is returning attributes themselves (i.e., DOMAttr objects) rather than elements (i.e., DOMElement objects). That's fine, and that seems to be what you want, but appending them to the document is the problem. A DOMAttr is not a standalone node in the document tree; it's associated with a DOMElement but is not a child in the usual sense. Thus, directly appending a DOMAttr to the document is invalid.
From the W3C specs:
Attr objects inherit the Node interface, but since they are not actually child nodes of the element they describe, the DOM does not consider them part of the document tree. . . . The DOM takes the view that attributes are properties of elements rather than having a separate identity from the elements they are associated with
Either associate the DOMAttr with a DOMElement and append that element, or pull out the DOMAttr's value and use that as you wish.
To just append its plain text value, use its value in a DOMText node and append that. For example, change this line:
$newDom->appendChild($node);
to this:
$newDom->appendChild(new DOMText($node->value));
try this..
$xml_string = 'your xml string';
$xml = simplexml_load_string($xml_string);
foreach($xml->a[0]->attributes() as $href => $value) {
$myAnchorsValues[] = $value;
}
var_dump($myAnchorsValues);

How to get a specific node text using php DOM

I am trying to get the value (text) of a specific node from an xml document using php DOM classes but I cannot do it right because I get the text content of that node merged with its descendants.
Let's suppose that I need to get the trees from this document:
<?xml version="1.0"?>
<trees>
LarchRedwoodChestnutBirch
<trimmed>Larch</trimmed>
<trimmed>Redwood</trimmed>
</trees>
And I get:
LarchRedwoodChestnutBirchLarchRedwood
You can see that I cannot remove the substring LarchRedwood made by the trimmed trees from the whole text because I would get only ChestnutBirch and it is not what I need.
Any suggest? (Thanx)
I got it. This works:
function specificNodeValue($node, $implode = true) {
$value = array();
if ($node->childNodes) {
for ($i = 0; $i < $node->childNodes->length; $i++) {
if (!(#$node->childNodes->item($i)->tagName)) {
$value[] = $node->childNodes->item($i)->nodeValue;
}
}
}
return (is_string($implode) ? implode($implode, $value) : ($implode === true ? implode($value) : $value));
}
A given node is like a root, if you get no tagName when you parse its child nodes then it is itself, so the value of that child node it is its own value.
Inside a bad formed xml document a node could have many pieces of value, put them all into an array to get the whole value of the node.
Use the function above to get needed node value without subnode values merged within.
Parameters are:
$node (required) must be a DOMElement object
$implode (optional) if you want to get a string (true by default) or an array (false) made up by many pieces of value. (Set a string instead of a boolean value if you wish to implode the array using a "glue" string).
You can try this to remove the trimmed node
$doc = new DOMDocument('1.0', 'utf-8');
$doc->loadXML($xml);
$xpath = new DOMXpath($doc);
$trees = $doc->getElementsByTagName('trees')->item(0);
foreach ($xpath->query('/trees/*') as $node)
{
$trees->removeChild($node);
}
echo $trees->textContent;
echo $trees->nodeValue;
Use $node->nodeValue to get a node's text content. If you use $node->textContent, you get all text from the current node and all child nodes.
Ideally, the XML should be:
<?xml version="1.0"?>
<trees>
<tree>Larch</tree>
<tree>Redwood</tree>
<tree>Chestnut</tree>
<tree>Birch</tree>
</trees>
To split "LarchRedwoodChestnutBirch" into separate words (by capital letter), you'll need to use PHP's "PCRE" functions:
http://www.php.net/manual/en/book.pcre.php
'Hope that helps!

How to get values inside <![CDATA[values]] > using php DOM?

How can i get values inside <![CDATA[values]] > using php DOM.
This is few code from my xml.
<Destinations>
<Destination>
<![CDATA[Aghia Paraskevi, Skiatos, Greece]]>
<CountryCode>GR</CountryCode>
</Destination>
<Destination>
<![CDATA[Amettla, Spain]]>
<CountryCode>ES</CountryCode>
</Destination>
<Destination>
<![CDATA[Amoliani, Greece]]>
<CountryCode>GR</CountryCode>
</Destination>
<Destination>
<![CDATA[Boblingen, Germany]]>
<CountryCode>DE</CountryCode>
</Destination>
</Destinations>
Working with PHP DOM is fairly straightforward, and is very similar to Javascript's DOM.
Here are the important classes:
DOMNode — The base class for anything that can be traversed inside an XML/HTML document, including text nodes, comment nodes, and CDATA nodes
DOMElement — The base class for tags.
DOMDocument — The base class for documents. Contains the methods to load/save XML, as well as normal DOM document methods (see below).
There are a few staple methods and properties:
DOMDocument->load() — After creating a new DOMDocument, use this method on that object to load from a file.
DOMDocument->getElementsByTagName() — this method returns a node list of all elements in the document with the given tag name. Then you can iterate (foreach) on this list.
DOMNode->childNodes — A node list of all children of a node. (Remember, a CDATA section is a node!)
DOMNode->nodeType — Get the type of a node. CDATA nodes have type XML_CDATA_SECTION_NODE, which is a constant with the value 4.
DOMNode->textContent — get the text content of any node.
Note: Your CDATA sections are malformed. I don't know why there is an extra ]] in the first one, or an unclosed CDATA section at the end of the line, but I think it should simply be:
<![CDATA[Aghia Paraskevi, Skiatos, Greece]]>
Putting this all together we:
Create a new document object and load the XML
Get all Destination elements by tag name and iterate over the list
Iterate over all child nodes of each Destination element
Check if the node type is XML_CDATA_SECTION_NODE
If it is, echo the textContent of that node.
Code:
$doc = new DOMDocument();
$doc->load('test.xml');
$destinations = $doc->getElementsByTagName("Destination");
foreach ($destinations as $destination) {
foreach($destination->childNodes as $child) {
if ($child->nodeType == XML_CDATA_SECTION_NODE) {
echo $child->textContent . "<br/>";
}
}
}
Result:
Aghia Paraskevi, Skiatos, Greece
Amettla, Spain
Amoliani, Greece
Boblingen, Germany
Use this:
$parseFile = simplexml_load_file($myXML,'SimpleXMLElement', LIBXML_NOCDATA)
and next :
foreach ($parseFile->yourNode as $node ){
etc...
}
Best and easy way
$xml = simplexml_load_string($xmlData, 'SimpleXMLElement', LIBXML_NOCDATA);
$xmlJson = json_encode($xml);
$xmlArr = json_decode($xmlJson, 1); // Returns associative array
Use replace CDATA before parsing PHP DOM element after that you can get the innerXml or innerHtml:
str_replace(array('<\![CDATA[',']]>'), '', $xml);
I use following code.
Its not only read all xml data with
<![CDATA[values]] >
but also convert xml object to php associative array. So we can apply loop on the data.
$xml_file_data = json_decode(json_encode(simplexml_load_string($xml, 'SimpleXMLElement', LIBXML_NOCDATA),true), true);
Hope this will work for you.
function inBetweenOf(string $here, string $there, string $content) : string {
$left_over = strlen(substr($content, strpos($content, $there)));
return substr($content, strpos($content, $here) + strlen($here), -$left_over);
}
Iterate over "Destination" tags and then call inBetweenOf on each iteration.
$doc = inBetweenOf('<![CDATA[', ']]>', $xml);

Categories