How do I iterate through DOM elements in PHP? - php

I have an XML file loaded into a DOM document,
I wish to iterate through all 'foo' tags, getting values from every tag below it. I know I can get values via
$element = $dom->getElementsByTagName('foo')->item(0);
foreach($element->childNodes as $node){
$data[$node->nodeName] = $node->nodeValue;
}
However, what I'm trying to do, is from an XML like,
<stuff>
<foo>
<bar></bar>
<value/>
<pub></pub>
</foo>
<foo>
<bar></bar>
<pub></pub>
</foo>
<foo>
<bar></bar>
<pub></pub>
</foo>
</stuff>
iterate over every foo tag, and get specific bar or pub, and get values from there.
Now, how do I iterate over foo so that I can still access specific child nodes by name?

Not tested, but what about:
$elements = $dom->getElementsByTagName('foo');
$data = array();
foreach($elements as $node){
foreach($node->childNodes as $child) {
$data[] = array($child->nodeName => $child->nodeValue);
}
}

It's generally much better to use XPath to query a document than it is to write code that depends on knowledge of the document's structure. There are two reasons. First, there's a lot less code to test and debug. Second, if the document's structure changes it's a lot easier to change an XPath query than it is to change a bunch of code.
Of course, you have to learn XPath, but (most of) XPath isn't rocket science.
PHP's DOM uses the xpath_eval method to perform XPath queries. It's documented here, and the user notes include some pretty good examples.

Here's another (lazy) way to do it.
$data[][$node->nodeName] = $node->nodeValue;

With FluidXML you can query and iterate XML very easly.
$data = [];
$store_child = function($i, $fooChild) use (&$data) {
$data[] = [ $fooChild->nodeName => $fooChild->nodeValue ];
};
fluidxml($dom)->query('//foo/*')->each($store_child);
https://github.com/servo-php/fluidxml

Related

PHP: Keeping HTML inside XML node without CDATA

I've got an xml like this:
<father>
<son>Text with <b>HTML</b>.</son>
</father>
I'm using simplexml_load_string to parse it into SimpleXmlElement. Then I get my node like this
$xml->father->son->__toString(); //output: "Text with .", but expected "Text with <b>HTML</b>."
I need to handle simple HTML such as:
<b>text</b> or <br/> inside the xml which is sent by many users.
Me problem is that I can't just ask them to use CDATA because they won't be able to handle it properly, and they are already use to do without.
Also, if it's possible I don't want the file to be edited because the information need to be the one sent by the user.
The function simplexml_load_string simply erase anything inside HTML node and the HTML node itself.
How can I keep the information ?
SOLUTION
To handle the problem I used the asXml as explained by #ThW:
$tmp = $xml->father->son->asXml(); //<son>Text with <b>HTML</b>.</son>
I just added a preg_match to erase the node.
A CDATA section is a character node, just like a text node. But it does less encoding/decoding. This is mostly a downside, actually. On the upside something in a CDATA section might be more readable for a human and it allows for some BC in special cases. (Think HTML script tags.)
For an XML API they are nearly the same. Here is a small DOM example (SimpleXML abstracts to much).
$document = new DOMDocument();
$father = $document->appendChild(
$document->createElement('father')
);
$son = $father->appendChild(
$document->createElement('son')
);
$son->appendChild(
$document->createTextNode('With <b>HTML</b><br>It\'s so nice.')
);
$son = $father->appendChild(
$document->createElement('son')
);
$son->appendChild(
$document->createCDataSection('With <b>HTML</b><br>It\'s so nice.')
);
$document->formatOutput = TRUE;
echo $document->saveXml();
Output:
<?xml version="1.0"?>
<father>
<son>With <b>HTML</b><br>It's so nice.</son>
<son><![CDATA[With <b>HTML</b><br>It's so nice.]]></son>
</father>
As you can see they are serialized very differently - but from the API view they are basically exchangeable. If you're using an XML parser the value you get back should be the same in both cases.
So the first possibility is just letting the HTML fragment be stored in a character node. It is just a string value for the outer XML document itself.
The other way would be using XHTML. XHTML is XML compatible HTML. You can mix an match different XML formats, so you could add the XHTML fragment as part of the outer XML.
That seems to be what you're receiving. But SimpleXML has some problems with mixed nodes. So here is an example how you can read it in DOM.
$xml = <<<'XML'
<father>
<son>With <b>HTML</b><br/>It's so nice.</son>
</father>
XML;
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXpath($document);
$result = '';
foreach ($xpath->evaluate('/father/son[1]/node()') as $child) {
$result .= $document->saveXml($child);
}
echo $result;
Output:
With <b>HTML</b><br/>It's so nice.
Basically you need to save each child of the son element as XML.
SimpleXML is based on the same DOM library internally. That allows you to convert a SimpleXMLElement into a DOM node. From there you can again save each child as XML.
$father = new SimpleXMLElement($xml);
$sonNode = dom_import_simplexml($father->son);
$document = $sonNode->ownerDocument;
$result = '';
foreach ($sonNode->childNodes as $child) {
$result .= $document->saveXml($child);
}
echo $result;

How to get values inside <![CDATA[values]] > using php DOM?

How can i get values inside <![CDATA[values]] > using php DOM.
This is few code from my xml.
<Destinations>
<Destination>
<![CDATA[Aghia Paraskevi, Skiatos, Greece]]>
<CountryCode>GR</CountryCode>
</Destination>
<Destination>
<![CDATA[Amettla, Spain]]>
<CountryCode>ES</CountryCode>
</Destination>
<Destination>
<![CDATA[Amoliani, Greece]]>
<CountryCode>GR</CountryCode>
</Destination>
<Destination>
<![CDATA[Boblingen, Germany]]>
<CountryCode>DE</CountryCode>
</Destination>
</Destinations>
Working with PHP DOM is fairly straightforward, and is very similar to Javascript's DOM.
Here are the important classes:
DOMNode — The base class for anything that can be traversed inside an XML/HTML document, including text nodes, comment nodes, and CDATA nodes
DOMElement — The base class for tags.
DOMDocument — The base class for documents. Contains the methods to load/save XML, as well as normal DOM document methods (see below).
There are a few staple methods and properties:
DOMDocument->load() — After creating a new DOMDocument, use this method on that object to load from a file.
DOMDocument->getElementsByTagName() — this method returns a node list of all elements in the document with the given tag name. Then you can iterate (foreach) on this list.
DOMNode->childNodes — A node list of all children of a node. (Remember, a CDATA section is a node!)
DOMNode->nodeType — Get the type of a node. CDATA nodes have type XML_CDATA_SECTION_NODE, which is a constant with the value 4.
DOMNode->textContent — get the text content of any node.
Note: Your CDATA sections are malformed. I don't know why there is an extra ]] in the first one, or an unclosed CDATA section at the end of the line, but I think it should simply be:
<![CDATA[Aghia Paraskevi, Skiatos, Greece]]>
Putting this all together we:
Create a new document object and load the XML
Get all Destination elements by tag name and iterate over the list
Iterate over all child nodes of each Destination element
Check if the node type is XML_CDATA_SECTION_NODE
If it is, echo the textContent of that node.
Code:
$doc = new DOMDocument();
$doc->load('test.xml');
$destinations = $doc->getElementsByTagName("Destination");
foreach ($destinations as $destination) {
foreach($destination->childNodes as $child) {
if ($child->nodeType == XML_CDATA_SECTION_NODE) {
echo $child->textContent . "<br/>";
}
}
}
Result:
Aghia Paraskevi, Skiatos, Greece
Amettla, Spain
Amoliani, Greece
Boblingen, Germany
Use this:
$parseFile = simplexml_load_file($myXML,'SimpleXMLElement', LIBXML_NOCDATA)
and next :
foreach ($parseFile->yourNode as $node ){
etc...
}
Best and easy way
$xml = simplexml_load_string($xmlData, 'SimpleXMLElement', LIBXML_NOCDATA);
$xmlJson = json_encode($xml);
$xmlArr = json_decode($xmlJson, 1); // Returns associative array
Use replace CDATA before parsing PHP DOM element after that you can get the innerXml or innerHtml:
str_replace(array('<\![CDATA[',']]>'), '', $xml);
I use following code.
Its not only read all xml data with
<![CDATA[values]] >
but also convert xml object to php associative array. So we can apply loop on the data.
$xml_file_data = json_decode(json_encode(simplexml_load_string($xml, 'SimpleXMLElement', LIBXML_NOCDATA),true), true);
Hope this will work for you.
function inBetweenOf(string $here, string $there, string $content) : string {
$left_over = strlen(substr($content, strpos($content, $there)));
return substr($content, strpos($content, $here) + strlen($here), -$left_over);
}
Iterate over "Destination" tags and then call inBetweenOf on each iteration.
$doc = inBetweenOf('<![CDATA[', ']]>', $xml);

Simple xpath question that drives me crazy

below is the structure of a feed I managed to print the content using this xpath
$xml->xpath('/rss/channel//item')
the structure
<rss><channel><item><pubDate></pubDate><title></title><description></description><link></link><author></author></item></channel></rss>
However some of my files follow this structure
<feed xmlns="http://www.w3.org/2005/Atom" .....><entry><published></published><title></title><description></description><link></link><author></author></entry></feed>
and I guessed that this should be the xpath to get the content of entry
$xml->xpath('/feed//entry')
something that proved me wrong.
My question is what is the right xpath to use? Am i missing something else ?
This is the code
<?php
$feeds = array('http://feeds.feedburner.com/blogspot/wSuKU');
$entries = array();
foreach ($feeds as $feed) {
$xml = simplexml_load_file($feed);
$entries = array_merge($entries, $xml->xpath('/feed//entry'));
}
echo "<pre>"; print_r($entries); echo"</pre>";
?>
try this:
$xml->registerXPathNamespace('f', 'http://www.w3.org/2005/Atom');
$xml->xpath('/f:feed/f:entry');
If you want a single XPath expression that will work when applied to either an RSS or an ATOM feed, you could use either of the following XPath expressions:
This one is the most precise, but also the most verbose:
(/rss/channel/item
| /*[local-name()='feed' and namespace-uri()='http://www.w3.org/2005/Atom']
/*[local-name()='entry' and namespace-uri()='http://www.w3.org/2005/Atom'])
This one ignores the namespace of the ATOM elements and just matches on their local-name():
(/rss/channel/item | /*[local-name()='feed']/*[local-name()='entry'])
This one is the most simple, but the least precise and the least efficient:
/*//*[local-name()='item' or local-name()='entry']

Get child elements in xml with PHP

I have an xml file that I need to parse through and get values. Below is a snippit of xml
<?xml version="1.0"?>
<mobile>
<userInfo>
</userInfo>
<CATALOG>
<s0>
<SUB0>
<DESCR>Paranormal Studies</DESCR>
<SUBJECT>147</SUBJECT>
</SUB0>
</s0>
<sA>
<SUB0>
<DESCR>Accounting</DESCR>
<SUBJECT>ACCT</SUBJECT>
</SUB0>
<SUB1>
<DESCR>Accounting</DESCR>
<SUBJECT>ACCTG</SUBJECT>
</SUB1>
<SUB2>
<DESCR>Anatomy</DESCR>
<SUBJECT>ANATOMY</SUBJECT>
</SUB2>
<SUB3>
<DESCR>Anthropology</DESCR>
<SUBJECT>ANTHRO</SUBJECT>
</SUB3>
<SUB4>
<DESCR>Art</DESCR>
<SUBJECT>ART</SUBJECT>
</SUB4>
<SUB5>
<DESCR>Art History</DESCR>
<SUBJECT>ARTHIST</SUBJECT>
</SUB5>
</sA>
So, I need to grab all the child elements of <sA> and then there are more elements called <sB> etc
But I do not know how to get all of the child elements with <sA>, <sB>, etc.
How about this:
$xmlstr = LoadTheXMLFromSomewhere();
$xml = new simplexml_load_string($xmlstr);
$result = $xml->xpath('//sA');
foreach ($result as $node){
//do something with node
}
PHP does have a nice class to access XML, which is called SimpleXml for a reason, consider heavily using that if your code is going to access only a part of the XML (aka query the xml). Also, consider doing queries using XPath, which is the best way to do it
Notice that I did the example with sA nodes only, but you can configure your code for other node types really easily.
Hope I can help!
you should look into simplexml_load_string() as I'm pretty sure it would make your life a lot easier. It returns a StdObject that you can use like so:
$xml = simplexml_load_string(<your huge xml string>);
foreach ($xml->hpt_mobile->CATALOG->sA as $value){
// do things with sA children
}
$xml = new DOMDocument();
$xml->load('path_to_xml');
$htp = $xml->getElementsByTagName('hpt_mobile')[0];
$catalog = $htp->getElementsByTagName('CATALOG')[0]
$nodes = $catalog->getElementsByTagName('sA')->childNodes;

How can I find text nodes in an HTML snippet?

I'm trying to parse an HTML snippet, using the PHP DOM functions. I have stripped out everything apart from paragraph, span and line break tags, and now I want to retrieve all the text, along with its accompanying styles.
So, I'd like to get each piece of text, one by one, and for each one I can then go back up the tree to get the values of particular attributes (I'm only interested in some specific ones, like color etc.).
How can I do this? Or am I thinking about it the wrong way?
Suppose you have a DOMDocument here:
$doc = new DOMDocument();
$doc->loadHTMLFile('http://stackoverflow.com/');
You can find all text nodes using a simple Xpath.
$xpath = new DOMXpath($doc);
$textNodes = $xpath->query('//text()');
Just foreach over it to iterate over all textnodes:
foreach ($textNodes as $textNode) {
echo $textNode->data . "\n";
}
From that, you can go up the DOM tree by using ->parentNode.
Hope that this can give you a good start.
For those who are more comfortable with CSS3 selectors, and are willing to include a single extra PHP class into their project, I would suggest the use of Simple PHP DOM parser. The solution would look something like the following:
$html = file_get_html('http://www.example.com/');
$ret = $html->find('p, span');
$store = array();
foreach($ret as $element) {
$store[] = array($element->tag => array('text' => $element->innertext,
'color' => $element->color,
'style' => $element->style));
}
print_r($store);

Categories