Simple xpath question that drives me crazy - php

below is the structure of a feed I managed to print the content using this xpath
$xml->xpath('/rss/channel//item')
the structure
<rss><channel><item><pubDate></pubDate><title></title><description></description><link></link><author></author></item></channel></rss>
However some of my files follow this structure
<feed xmlns="http://www.w3.org/2005/Atom" .....><entry><published></published><title></title><description></description><link></link><author></author></entry></feed>
and I guessed that this should be the xpath to get the content of entry
$xml->xpath('/feed//entry')
something that proved me wrong.
My question is what is the right xpath to use? Am i missing something else ?
This is the code
<?php
$feeds = array('http://feeds.feedburner.com/blogspot/wSuKU');
$entries = array();
foreach ($feeds as $feed) {
$xml = simplexml_load_file($feed);
$entries = array_merge($entries, $xml->xpath('/feed//entry'));
}
echo "<pre>"; print_r($entries); echo"</pre>";
?>

try this:
$xml->registerXPathNamespace('f', 'http://www.w3.org/2005/Atom');
$xml->xpath('/f:feed/f:entry');

If you want a single XPath expression that will work when applied to either an RSS or an ATOM feed, you could use either of the following XPath expressions:
This one is the most precise, but also the most verbose:
(/rss/channel/item
| /*[local-name()='feed' and namespace-uri()='http://www.w3.org/2005/Atom']
/*[local-name()='entry' and namespace-uri()='http://www.w3.org/2005/Atom'])
This one ignores the namespace of the ATOM elements and just matches on their local-name():
(/rss/channel/item | /*[local-name()='feed']/*[local-name()='entry'])
This one is the most simple, but the least precise and the least efficient:
/*//*[local-name()='item' or local-name()='entry']

Related

How to extract the text in a SimpleXmlElement object? [duplicate]

Given the php code:
$xml = <<<EOF
<articles>
<article>
This is a link
<link>Title</link>
with some text following it.
</article>
</articles>
EOF;
function traverse($xml) {
$result = "";
foreach($xml->children() as $x) {
if ($x->count()) {
$result .= traverse($x);
}
else {
$result .= $x;
}
}
return $result;
}
$parser = new SimpleXMLElement($xml);
traverse($parser);
I expected the function traverse() to return:
This is a link Title with some text following it.
However, it returns only:
Title
Is there a way to get the expected result using simpleXML (obviously for the purpose of consuming the data rather than just returning it as in this simple example)?
There might be ways to achieve what you want using only SimpleXML, but in this case, the simplest way to do it is to use DOM. The good news is if you're already using SimpleXML, you don't have to change anything as DOM and SimpleXML are basically interchangeable:
// either
$articles = simplexml_load_string($xml);
echo dom_import_simplexml($articles)->textContent;
// or
$dom = new DOMDocument;
$dom->loadXML($xml);
echo $dom->documentElement->textContent;
Assuming your task is to iterate over each <article/> and get its content, your code will look like
$articles = simplexml_load_string($xml);
foreach ($articles->article as $article)
{
$articleText = dom_import_simplexml($article)->textContent;
}
node->asXML();// It's the simple solution i think !!
So, the simple answer to my question was: Simplexml can't process this kind of XML. Use DomDocument instead.
This example shows how to traverse the entire XML. It seems that DomDocument will work with any XML whereas SimpleXML requires the XML to be simple.
function attrs($list) {
$result = "";
foreach ($list as $attr) {
$result .= " $attr->name='$attr->value'";
}
return $result;
}
function parseTree($xml) {
$result = "";
foreach ($xml->childNodes AS $item) {
if ($item->nodeType == 1) {
$result .= "<$item->nodeName" . attrs($item->attributes) . ">" . parseTree($item) . "</$item->nodeName>";
}
else {
$result .= $item->nodeValue;
}
}
return $result;
}
$xmlDoc = new DOMDocument();
$xmlDoc->loadXML($xml);
print parseTree($xmlDoc->documentElement);
You could also load the xml using simpleXML and then convert it to DOM using dom_import_simplexml() as Josh said. This would be useful, if you are using simpleXml to filter nodes for parsing, e.g. using XPath.
However, I don't actually use simpleXML, so for me that would be taking the long way around.
$simpleXml = new SimpleXMLElement($xml);
$xmlDom = dom_import_simplexml($simpleXml);
print parseTree($xmlDom);
Thank you for all the help!
You can get the text node of a DOM element with simplexml just by treating it like a string:
foreach($xml->children() as $x) {
$result .= "$x"
However, this prints out:
This is a link
with some text following it.
TitleTitle
..because the text node is treated as one block and there is no way to tell where the child fits in inside the text node. The child node is also added twice because of the other else {}, but you can just take that out.
Sorry if I didn't help much, but I don't think there's any way to find out where the child node fits in the text node unless the xml is consistent (but then, why not use tags). If you know what element you want to strip the text out of, strip_tags() will work great.
This has already been answered, but CASTING TO STRING ( i.e. $sString = (string) oSimpleXMLNode->TagName) always worked for me.
Try this:
$parser = new SimpleXMLElement($xml);
echo html_entity_decode(strip_tags($parser->asXML()));
That's pretty much equivalent to:
$parser = simplexml_load_string($xml);
echo dom_import_simplexml($parser)->textContent;
Like #tandu said, it's not possible, but if you can modify your XML, this will work:
$xml = <<<EOF
<articles>
<article>
This is a link
</article>
<link>Title</link>
<article>
with some text following it.
</article>
</articles>

can't access xml node PHP

I have a page in php where I have to parse an xml.
I have done this for example:
$hotelNodes = $xml_data->getElementsByTagName('Hotel');
foreach($hotelNodes as $hotel){
$supplementsNodes2 = $hotel->getElementsByTagName('BoardBase');
foreach($supplementsNodes2 as $suppl2) {
echo'<p>HERE</p>'; //not enter here
}
}
}
In this code I access to each hotel of my xml, and foreach hotel I would like to search the tag BoardBase but it doesn0t enter inside it.
This is my xml (cutted of many parts!!!!!)
<hotel desc="DESC" name="Hotel">
<selctedsupplements>
<boardbases>
<boardbase bbpublishprice="0" bbprice="0" bbname="Colazione Continentale" bbid="1"></boardbase>
</boardbases>
</selctedsupplements>
</occupancy></occupancies>
</hotel>
I have many nodes that doesn't have BoardBase but sometimes there is but not enter.
Is possible that this node isn't accessible?
This xml is received by a server with a SoapClient.
If I inspect the XML printed in firebug I can see the node with opacity like this:
I have also tried this:
$supplementsNodes2 = $hotel->getElementsByTagName('boardbase');
but without success
2 issues I can see from the get-go: XML names are case-sensitive, hence:
$hotelNodes = $xml_data->getElementsByTagName('Hotel');
Can't work, because your xml node looks like:
<hotel desc="DESC" name="Hotel">
hotel => lower-case!
As you can see here:
[...] names for such things as elements, while XML is explicitly case sensitive.
The official specs specify tag names as case-sensitive, so getElementsByTagName('FOO') won't return the same elements as getElementsByTagName('foo')...
Secondly, you seem to have some tag-soup going on:
</occupancy></occupancies>
<!-- tag names don't match, both are closing tags -->
This is just plain invalid markup, it should read either:
<occupancy></occupancy>
or
<occupancies></occupancies>
That would be the first 2 ports of call.
I've set up a quick codepad using this code, which you can see here:
$xml = '<hotel desc="DESC" name="Hotel">
<selctedsupplements>
<boardbases>
<boardbase bbpublishprice="0" bbprice="0" bbname="Colazione Continentale" bbid="1"></boardbase>
</boardbases>
</selctedsupplements>
<occupancy></occupancy>
</hotel>';
$dom = new DOMDocument;
$dom->loadXML($xml);
$badList = $dom->getElementsByTagName('Hotel');
$correctList = $dom->getElementsByTagName('hotel');
echo sprintf("%d",$badList->lenght),
' compared to ',
$correctList->length, PHP_EOL;
The output was "0 compared to 1", meaning that using a lower-case selector returned 1 element, the one with the upper-case H returned an empty list.
To get to the boardbase tags for each hotel tag, you just have to write this:
$hotels = $dom->getElementsByTagName('html');
foreach($hotels as $hotel)
{
$supplementsNodes2 = $hotel->getElementsByTagName('boardbase');
foreach($supplementsNodes2 as $node)
{
var_dump($node);//you _will_ get here now
}
}
As you can see on this updated codepad.
Alessandro, your XML is a mess (=un casino), you really need to get that straight. Elias' answer pointed out some very basic stuff to consider.
I built on the code pad Elias has been setting up, it is working perfectly with me:
$dom = new DOMDocument;
$dom->loadXML($xml);
$hotels = $dom->getElementsByTagName('hotel');
foreach ($hotels as $hotel) {
$bbs = $hotel->getElementsByTagName('boardbase');
foreach ($bbs as $bb) echo $bb->getAttribute('bbname');
}
see http://codepad.org/I6oxkEOC

get tagname in xml using php

Here is my xml:
<news_item>
<title>TITLE</title>
<content>COTENT.</content>
<date>DATE</date>
<news_item>
I want to get the names of the tags inside of news_item.
Here is what I have so far:
$dom = new DOMDocument();
$dom->load($file_name);
$results = $dom->getElementsByTagName('news_item');
WITHOUT USING other php libraries like simpleXML, can I get the name of all the tag names (not values) of the children tags?
Example solution
title, content, date
I don't know the name of the tags inside of news_item, only the container tag name 'news_item'
Thanks guys!
Try this:
foreach($results as $node)
{
if($node->childNodes->length)
{
foreach($node->childNodes as $child)
{
echo $child->nodeName,', ';
}
}
}
Should work. Using something similar currently, though for html not xml.
$nodelist = $results->getElementsByTagName('*');
for( $i=0; $i < $nodelist->length; $i++)
echo $nodelist->item($i)->nodeName;
[Previous incorrect answer redacted]
For what it's worth though, there's no cost to using simplexml_import_dom() to make a SimpleXMLElement out of a DOMElement. Both are just object interfaces into an underlying libxml2 data structure. You can even make a change to the DOMElement and see it reflected in the SimpleXMLElement or vice versa. So it doesn't have to be an either/or choice.

Getting the text portion of a node using php Simple XML

Given the php code:
$xml = <<<EOF
<articles>
<article>
This is a link
<link>Title</link>
with some text following it.
</article>
</articles>
EOF;
function traverse($xml) {
$result = "";
foreach($xml->children() as $x) {
if ($x->count()) {
$result .= traverse($x);
}
else {
$result .= $x;
}
}
return $result;
}
$parser = new SimpleXMLElement($xml);
traverse($parser);
I expected the function traverse() to return:
This is a link Title with some text following it.
However, it returns only:
Title
Is there a way to get the expected result using simpleXML (obviously for the purpose of consuming the data rather than just returning it as in this simple example)?
There might be ways to achieve what you want using only SimpleXML, but in this case, the simplest way to do it is to use DOM. The good news is if you're already using SimpleXML, you don't have to change anything as DOM and SimpleXML are basically interchangeable:
// either
$articles = simplexml_load_string($xml);
echo dom_import_simplexml($articles)->textContent;
// or
$dom = new DOMDocument;
$dom->loadXML($xml);
echo $dom->documentElement->textContent;
Assuming your task is to iterate over each <article/> and get its content, your code will look like
$articles = simplexml_load_string($xml);
foreach ($articles->article as $article)
{
$articleText = dom_import_simplexml($article)->textContent;
}
node->asXML();// It's the simple solution i think !!
So, the simple answer to my question was: Simplexml can't process this kind of XML. Use DomDocument instead.
This example shows how to traverse the entire XML. It seems that DomDocument will work with any XML whereas SimpleXML requires the XML to be simple.
function attrs($list) {
$result = "";
foreach ($list as $attr) {
$result .= " $attr->name='$attr->value'";
}
return $result;
}
function parseTree($xml) {
$result = "";
foreach ($xml->childNodes AS $item) {
if ($item->nodeType == 1) {
$result .= "<$item->nodeName" . attrs($item->attributes) . ">" . parseTree($item) . "</$item->nodeName>";
}
else {
$result .= $item->nodeValue;
}
}
return $result;
}
$xmlDoc = new DOMDocument();
$xmlDoc->loadXML($xml);
print parseTree($xmlDoc->documentElement);
You could also load the xml using simpleXML and then convert it to DOM using dom_import_simplexml() as Josh said. This would be useful, if you are using simpleXml to filter nodes for parsing, e.g. using XPath.
However, I don't actually use simpleXML, so for me that would be taking the long way around.
$simpleXml = new SimpleXMLElement($xml);
$xmlDom = dom_import_simplexml($simpleXml);
print parseTree($xmlDom);
Thank you for all the help!
You can get the text node of a DOM element with simplexml just by treating it like a string:
foreach($xml->children() as $x) {
$result .= "$x"
However, this prints out:
This is a link
with some text following it.
TitleTitle
..because the text node is treated as one block and there is no way to tell where the child fits in inside the text node. The child node is also added twice because of the other else {}, but you can just take that out.
Sorry if I didn't help much, but I don't think there's any way to find out where the child node fits in the text node unless the xml is consistent (but then, why not use tags). If you know what element you want to strip the text out of, strip_tags() will work great.
This has already been answered, but CASTING TO STRING ( i.e. $sString = (string) oSimpleXMLNode->TagName) always worked for me.
Try this:
$parser = new SimpleXMLElement($xml);
echo html_entity_decode(strip_tags($parser->asXML()));
That's pretty much equivalent to:
$parser = simplexml_load_string($xml);
echo dom_import_simplexml($parser)->textContent;
Like #tandu said, it's not possible, but if you can modify your XML, this will work:
$xml = <<<EOF
<articles>
<article>
This is a link
</article>
<link>Title</link>
<article>
with some text following it.
</article>
</articles>

How do I iterate through DOM elements in PHP?

I have an XML file loaded into a DOM document,
I wish to iterate through all 'foo' tags, getting values from every tag below it. I know I can get values via
$element = $dom->getElementsByTagName('foo')->item(0);
foreach($element->childNodes as $node){
$data[$node->nodeName] = $node->nodeValue;
}
However, what I'm trying to do, is from an XML like,
<stuff>
<foo>
<bar></bar>
<value/>
<pub></pub>
</foo>
<foo>
<bar></bar>
<pub></pub>
</foo>
<foo>
<bar></bar>
<pub></pub>
</foo>
</stuff>
iterate over every foo tag, and get specific bar or pub, and get values from there.
Now, how do I iterate over foo so that I can still access specific child nodes by name?
Not tested, but what about:
$elements = $dom->getElementsByTagName('foo');
$data = array();
foreach($elements as $node){
foreach($node->childNodes as $child) {
$data[] = array($child->nodeName => $child->nodeValue);
}
}
It's generally much better to use XPath to query a document than it is to write code that depends on knowledge of the document's structure. There are two reasons. First, there's a lot less code to test and debug. Second, if the document's structure changes it's a lot easier to change an XPath query than it is to change a bunch of code.
Of course, you have to learn XPath, but (most of) XPath isn't rocket science.
PHP's DOM uses the xpath_eval method to perform XPath queries. It's documented here, and the user notes include some pretty good examples.
Here's another (lazy) way to do it.
$data[][$node->nodeName] = $node->nodeValue;
With FluidXML you can query and iterate XML very easly.
$data = [];
$store_child = function($i, $fooChild) use (&$data) {
$data[] = [ $fooChild->nodeName => $fooChild->nodeValue ];
};
fluidxml($dom)->query('//foo/*')->each($store_child);
https://github.com/servo-php/fluidxml

Categories