How to get text node from xml - php

I want to extract some data from xml.
I have this xml:
<root>
<p>Some text</p>
<p>Even more text</p>
<span class="bla bla">
<span class="currency">EUR</span> 19.95
</span>
</root>
and then I run this php code
$xml = simplexml_load_string($xmlString);
$json = json_encode($xml);
$obj = json_decode($json);
print_r($obj);
and the result is:
stdClass Object
(
[p] => Array
(
[0] => Some text
[1] => Even more text
)
[span] => stdClass Object
(
[#attributes] => stdClass Object
(
[class] => bla bla
)
[span] => EUR
)
)
How do I get the missing string "19.95"?

Don't convert XML into JSON/an array. It means that you loose information and features.
SimpleXML is litmit, it works with basic XML, but it has problems with thing like mixed nodes. DOM allows for an easier handling in this case.
$xml = <<<'XML'
<root>
<p>Some text</p>
<p>Even more text</p>
<span class="bla bla">
<span class="currency">EUR</span> 19.95
</span>
</root>
XML;
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXpath($document);
foreach($xpath->evaluate('/root/span[#class="bla bla"]') as $span) {
var_dump(
$xpath->evaluate('string(span[#class="currency"][1])', $span),
$xpath->evaluate(
'number(span[#class="currency"][1]/following-sibling::text()[1])',
$span
)
);
}
Xpath is an expression language to fetch parts of an DOM (Think SQL for XML). PHP has several method to access it. SimpleXMLElement::xpath() allows to fetch nodes as arrays of SimpleXMLElement objects. DOMXpath::query() allows you to fetch node lists. Only DOMXpath::evaluate() allows to fetch node lists and scalar values.
In the example /root/span[#class="bla bla"] fetches all span element nodes that have the given class attribute. For each of the nodes it then fetches the span with the class currency as a string. The third expression fetches the first following sibling text node of the currency span as a number.

Don't trust the debug output, don't convert to JSON or an array, and don't overthink the problem.
Outputting this string is as simple as navigating to the element and echoing it:
echo $xml->span;
Or to get it into a variable, explicitly cast to string:
$foo = (string)$xml->span
Or if you want to use XPath like in ThW's answer, you could find the span using //span[#class="bla bla"] and echo that (note that ->xpath() returns an array, so you want element 0 of that array):
echo $xml->xpath('//span[#class="bla bla"]')[0];

Related

Namespaces and XPath

I'm exploring XML and PHP, mostly XPath and other parsers.
Here be the xml:
<?xml version="1.0" encoding="UTF-8"?>
<root xmlns:foo="http://www.foo.org/" xmlns:bar="http://www.bar.org">
<actors>
<actor id="1">Christian Bale</actor>
<actor id="2">Liam Neeson</actor>
<actor id="3">Michael Caine</actor>
</actors>
<foo:singers>
<foo:singer id="4">Tom Waits</foo:singer>
<foo:singer id="5">B.B. King</foo:singer>
<foo:singer id="6">Ray Charles</foo:singer>
</foo:singers>
<items>
<item id="7">Pizza</item>
<item id="8">Cheese</item>
<item id="9">Cane</item>
</items>
</root>
Here be my path & code:
$xml = simplexml_load_file('xpath.xml');
$result = $xml -> xpath('/root/actors');
echo '<pre>'.print_r($result,1).'</pre>';
Now, said path returns:
Array
(
[0] => SimpleXMLElement Object
(
[actor] => Array
(
[0] => Christian Bale
[1] => Liam Neeson
[2] => Michael Caine
)
)
)
Whereas a seemingly similar line of code, which I would have though would result in the singers, doesnt. Meaning:
$result = $xml -> xpath('/root/foo:singers');
Results in:
Array
(
[0] => SimpleXMLElement Object
(
)
)
Now I would've thought the foo: namespace in this case is a non-issue and both paths should result in the same sort of array of singers/actors respectively? How come that is not the case?
Thank-you!
Note: As you can probably gather I'm quite new to xml so please be gentle.
Edit: When I go /root/foo:singers/foo:singer I get results, but not before. Also with just /root I only get actors and items as results, foo:singers are completely omitted.
SimpleXML is, for a number of reasons, simply a bad API.
For most purposes I suggest PHP's DOM extension. (Or for very large documents a combination of it along with XMLReader.)
For using namespaces in xpath you'll want to register those you'd like to use, and the prefix you want to use them with, with your xpath processor.
Example:
$dom = new DOMDocument();
$dom->load('xpath.xml');
$xpath = new DOMXPath($dom);
// The prefix *can* match that used in the document, but it's not necessary.
$xpath->registerNamespace("ns", "http://www.foo.org/");
foreach ($xpath->query("/root/ns:singers") as $node) {
echo $dom->saveXML($node);
}
Output:
<foo:singers>
<foo:singer id="4">Tom Waits</foo:singer>
<foo:singer id="5">B.B. King</foo:singer>
<foo:singer id="6">Ray Charles</foo:singer>
</foo:singers>
DOMXPath::query returns a DOMNodeList containing matched nodes. You can work with it essentially the same way you would in any other language with a DOM implementation.
You can use // expression like:
$xml -> xpath( '//foo:singer' );
to select all foo:singer elements no matter where they are.
EDIT:
SimpleXMLElement is selected, you just can't see the child nodes with print_r(). Use SimpleXMLElement methods like SimpleXMLElement::children to access them.
// example 1
$result = $xml->xpath( '/root/foo:singers' );
foreach( $result as $value ) {
print_r( $value->children( 'foo', TRUE ) );
}
// example 2
print_r( $result[0]->children( 'foo', TRUE )->singer );

How to get name of very first tag of XML with php's SimpleXML?

I am parsing XML strings using simplexml_load_string(), but I noticed that i don't get the name of the very first tag.
For example, I have these two xml strings:
$s = '<?xml version="1.0" encoding="UTF-8"?>
<ParentTypeABC>
<chidren1>
<children2>1000</children2>
</chidren1>
</ParentTypeABC>
';
$t = '<?xml version="1.0" encoding="UTF-8"?>
<ParentTypeDEF>
<chidren1>
<children2>1000</children2>
</chidren1>
</ParentTypeDEF>
';
NOTICE that they are nearly identical, the only difference being that one has the first node as <ParentTypeABC> and the other as <ParentTypeDEF>
then I just convert them to SimpleXML objects:
$o = simplexml_load_string($s);
$p = simplexml_load_string($t);
but then i have two equal objects, none of them having the "top" node's name appearing, either ParentTypeABC or ParentTypeDEF (I examine the objects using print_r()):
// with top node "ParentTypeABC"
SimpleXMLElement Object
(
[chidren1] => SimpleXMLElement Object
(
[children2] => 1000
)
)
// with top node "ParentTypeDEF"
SimpleXMLElement Object
(
[chidren1] => SimpleXMLElement Object
(
[children2] => 1000
)
)
So how I am supposed to know the top node's name? If I parse unknown XMLs and I need to know what's the top node name, what can I do?
Is there an option in simplexml_load_string() I could use?
I know there are MANY ways to parse XML's with PHP, but I'd like it to be as simple as posible, and to get a simple object or array I could navigate easily.
I made a simple example here to fiddle with.
SimpleXML has a getName() method.
echo $xml->getName();
This should return the name of the respective node, no matter if root or not.
http://php.net/manual/en/simplexmlelement.getname.php

Json Encode or Serialize an XML

I have some xml, this is a simple version of it.
<xml>
<items>
<item abc="123">item one</item>
<item abc="456">item two</item>
</items>
</xml>
Using SimpleXML on the content,
$obj = simplexml_load_string( $xml );
I can use $obj->xpath( '//items/item' ); and get access to the #attributes.
I need an array result, so I have tried the json_decode(json_encode($obj),true) trick, but that looks to be removing access to the #attributes (ie. abc="123").
Is there another way of doing this, that provides access to the attributes and leaves me with an array?
You need to call attributes() function.
Sample code:
$xmlString = '<xml>
<items>
<item abc="123">item one</item>
<item abc="456">item two</item>
</items>
</xml>';
$xml = new SimpleXMLElement($xmlString);
foreach( $xml->items->item as $value){
$my_array[] = strval($value->attributes());
}
print_r($my_array);
Eval
You can go the route with json_encode and json_decode and you can add the stuff you're missing because that json_encode-ing follows some specific rules with SimpleXMLElement.
If you're interested into the rules and their details, I have written two blog-posts about it:
SimpleXML and JSON Encode in PHP – Part I
SimpleXML and JSON Encode in PHP – Part II
For you perhaps more interesing is the third part which shows how you can modify the json serialization and provide your own format (e.g. to preserve the attributes):
SimpleXML and JSON Encode in PHP – Part III and End
It ships with a full blown example, here is an excerpt in code:
$xml = '<xml>
<items>
<item abc="123">item one</item>
<item abc="456">item two</item>
</items>
</xml>';
$obj = simplexml_load_string($xml, 'JsonXMLElement');
echo $json = json_encode($obj, JSON_PRETTY_PRINT), "\n";
print_r(json_decode($json, TRUE));
Output of JSON and the array is as following, note that the attributes are part of it:
{
"items": {
"item": [
{
"#attributes": {
"abc": "123"
},
"#text": "item one"
},
{
"#attributes": {
"abc": "456"
},
"#text": "item two"
}
]
}
}
Array
(
[items] => Array
(
[item] => Array
(
[0] => Array
(
[#attributes] => Array
(
[abc] => 123
)
[#text] => item one
)
[1] => Array
(
[#attributes] => Array
(
[abc] => 456
)
[#text] => item two
)
)
)
)
$xml = new SimpleXMLElement($xmlString);
$xml is now an object. To get the value of an attribute:
$xml->something['id'];
Where 'id' is the name of the attribute.
While it's theoretically possible to write a generic conversion from XML to PHP or JSON structures, it is very hard to capture all the subtleties that might be present - the distinction between child elements and attributes, text content alongside attributes (as you have here) or even alongside child elements, multiple child nodes with the same name, whether order of child elements and text nodes is important (e.g. in XHTML or DocBook), etc, etc.
If you have a specific format you need to produce, it will generally be much easier to use an API - like SimpleXML - to loop over the XML and produce the structure you need.
You don't specify the structure you want to achieve, but the general approach given your input would be to loop over each item, and either access known attributes, or loop over each attribute:
$sxml = simplexml_load_string( $xml );
$final_array = array();
foreach ( $sxml->items->item as $xml_item )
{
$formatted_item = array();
// Text content of item
$formatted_item['content'] = (string)$xml_item;
// Specifically get 'abc' attribute
$formatted_item['abc'] = (string)$xml_item['abc'];
// Maybe one of the attributes is an integer
$formatted_item['foo_id'] = (int)$xml_item['foo_id'];
// Or maybe you want to loop over lots of possible attributes
foreach ( $xml_item->attributes() as $attr_name => $attr_value )
{
$formatted_item['attrib:' . $attr_name] = (string)$attr_value;
}
// Add it to a final list
$final_array[] = $formatted_item;
// Or maybe you want that array to be keyed on one of the attributes
$final_array[ (string)$xml_item['key'] ] = $formatted_item;
}
Here is a class I've found that is able to process XML into array very nicely: http://outlandish.com/blog/xml-to-json/ (backup). Converting to json is a matter of a json_encode() call.

Getting cdata content while parsing xml file

I have an xml file
<?xml version="1.0" encoding="utf-8"?>
<xml>
<events date="01-10-2009" color="0x99CC00" selected="true">
<event>
<title>You can use HTML and CSS</title>
<description><![CDATA[This is the description ]]></description>
</event>
</events>
</xml>
I used xpath and and xquery for parsing the xml.
$xml_str = file_get_contents('xmlfile');
$xml = simplexml_load_string($xml_str);
if(!empty($xml))
{
$nodes = $xml->xpath('//xml/events');
}
i am getting the title properly, but iam not getting description.How i can get data inside
the cdata
SimpleXML has a bit of a problem with CDATA, so use:
$xml = simplexml_load_file('xmlfile', 'SimpleXMLElement', LIBXML_NOCDATA);
if(!empty($xml))
{
$nodes = $xml->xpath('//xml/events');
}
print_r( $nodes );
This will give you:
Array
(
[0] => SimpleXMLElement Object
(
[#attributes] => Array
(
[date] => 01-10-2009
[color] => 0x99CC00
[selected] => true
)
[event] => SimpleXMLElement Object
(
[title] => You can use HTML and CSS
[description] => This is the description
)
)
)
You are probably being misled into thinking that the CDATA is missing by using print_r or one of the other "normal" PHP debugging functions. These cannot see the full content of a SimpleXML object, as it is not a "real" PHP object.
If you run echo $nodes[0]->Description, you'll find your CDATA comes out fine. What's happening is that PHP knows that echo expects a string, so asks SimpleXML for one; SimpleXML responds with all the string content, including CDATA.
To get at the full string content reliably, simply tell PHP that what you want is a string using the (string) cast operator, e.g. $description = (string)$nodes[0]->Description.
To debug SimpleXML objects and not be fooled by quirks like this, use a dedicated debugging function such as one of these: https://github.com/IMSoP/simplexml_debug
This could also be another viable option, which would remove that code and make life a little easier.
$xml = str_replace("<![CDATA[", "", $xml);
$xml = str_replace("]]>", "", $xml);

XPath Node to String

How can I select the string contents of the following nodes:
<span class="url">
word
<b class=" ">test</b>
</span>
<span class="url">
word
<b class=" ">test2</b>
more words
</span>
I have tried a few things
//span/text()
Doesn't get the bold tag
//span/string(.)
is invalid
string(//span)
only selects 1 node
I am using simple_xml in php and the only other option I think is to use //span which returns:
Array
(
[0] => SimpleXMLElement Object
(
[#attributes] => Array
(
[class] => url
)
[b] => test
)
[1] => SimpleXMLElement Object
(
[#attributes] => Array
(
[class] => url
)
[b] => test2
)
)
*note that it is also dropping the "more words" text from the second span.
So I guess I could then flatten the item in the array using php some how? Xpath is preferred, but any other ideas would help too.
$xml = '<foo>
<span class="url">
word
<b class=" ">test</b>
</span>
<span class="url">
word
<b class=" ">test2</b>
more words
</span>
</foo>';
$dom = new DOMDocument();
$dom->loadXML($xml); //or load an HTML document with loadHTML()
$x= new DOMXpath($dom);
foreach($x->query("//span[#class='url']") as $node) echo $node->textContent;
You dont even need an XPath for this:
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('span') as $span) {
if(in_array('url', explode(' ', $span->getAttribute('class')))) {
$span->nodeValue = $span->textContent;
}
}
echo $dom->saveHTML();
EDIT after comment below
If you just want to fetch the string, you can do echo $span->textContent; instead of replacing the nodeValue. I understood you wanted to have one string for the span, instead of the nested structure. In this case, you should also consider if simply running strip_tags on the span snippet wouldnt be the faster and easier alternative.
With PHP5.3 you can also register arbitrary PHP functions for use as callbacks in XPath queries. The following would fetch the content of all span elements and it's child nodes and return it as a single string.
$dom = new DOMDocument;
$dom->loadHTML($html);
$xp = new DOMXPath($dom);
$xp->registerNamespace("php", "http://php.net/xpath");
$xp->registerPHPFunctions();
echo $xp->evaluate('php:function("nodeTextJoin", //span)');
// Custom Callback function
function nodeTextJoin($nodes)
{
$text = '';
foreach($nodes as $node) {
$text .= $node->textContent;
}
return $text;
}
Using XMLReader:
$xmlr = new XMLReader;
$xmlr->xml($doc);
while ($xmlr->read()) {
if (($xmlr->nodeType == XmlReader::ELEMENT) && ($xmlr->name == 'span')) {
echo $xmlr->readString();
}
}
Output:
word
test
word
test2
more words
SimpleXML doesn't like mixing text nodes with other elements, that's why you're losing some content there. The DOM extension, however, handles that just fine. Luckily, DOM and SimpleXML are two faces of the same coin (libxml) so it's very easy to juggle them. For instance:
foreach ($yourSimpleXMLElement->xpath('//span') as $span)
{
// will not work as expected
echo $span;
// will work as expected
echo textContent($span);
}
function textContent(SimpleXMLElement $node)
{
return dom_import_simplexml($node)->textContent;
}
//span//text()
This may be the best you can do. You'll get multiple text nodes because the text is stored in separate nodes in the DOM. If you want a single string you'll have to just concatenate the text nodes yourself since I can't think of a way to get the built-in XPath functions to do it.
Using string() or concat() won't work because these functions expect string arguments. When you pass a node-set to a function expecting a string, the node-set is converted to a string by taking the text content of the first node in the node-set. The rest of the nodes are discarded.
How can I select the string contents
of the following nodes:
First, I think your question is not clear.
You could select the descendant text nodes as John Kugelman has answer with
//span//text()
I recommend to use the absolute path (not starting with //)
But with this you would need to process the text nodes finding from wich parent span they are childs. So, it would be better to just select the span elements (as example, //span) and then process its string value.
With XPath 2.0 you could use:
string-join(//span, '.')
Result:
word test. word test2 more words
With XSLT 1.0, this input:
<div>
<span class="url">
word
<b class=" ">test</b>
</span>
<span class="url">
word
<b class=" ">test2</b>
more words
</span>
</div>
With this stylesheet:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="span[#class='url']">
<xsl:value-of select="concat(substring('.',1,position()-1),normalize-space(.))"/>
</xsl:template>
</xsl:stylesheet>
Output:
word test.word test2 more words
Along the lines of Alejandro's XSLT 1.0 "but any other ideas would help too" answer...
XML:
<?xml version="1.0" encoding="UTF-8"?>
<div>
<span class="url">
word
<b class=" ">test</b>
</span>
<span class="url">
word
<b class=" ">test2</b>
more words
</span>
</div>
XSL:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="span">
<xsl:value-of select="normalize-space(data(.))"/>
</xsl:template>
</xsl:stylesheet>
OUTPUT:
word test
word test2 more words

Categories