PHP SimpleXML Element parsing issue - php

I've come across a weird but apparently valid XML string that I'm being returned by an API. I've been parsing XML with SimpleXML because it's really easy to pass it to a function and convert it into a handy array.
The following is parsed incorrectly by SimpleXML:
<?xml version="1.0" standalone="yes"?>
<Response>
<CustomsID>010912-1
<IsApproved>NO</IsApproved>
<ErrorMsg>Electronic refunds...</ErrorMsg>
</CustomsID>
</Response>
Simple XML results in:
SimpleXMLElement Object ( [CustomsID] => 010912-1 )
Is there a way to parse this in XML? Or another XML library that returns an object that reflects the XML structure?

That is an odd response with the text along with other nodes. If you manually traverse it (not as an array, but as an object) you should be able to get inside:
<?php
$xml = '<?xml version="1.0" standalone="yes"?>
<Response>
<CustomsID>010912-1
<IsApproved>NO</IsApproved>
<ErrorMsg>Electronic refunds...</ErrorMsg>
</CustomsID>
</Response>';
$sObj = new SimpleXMLElement( $xml );
var_dump( $sObj->CustomsID );
exit;
?>
Results in second object:
object(SimpleXMLElement)#2 (2) {
["IsApproved"]=>
string(2) "NO"
["ErrorMsg"]=>
string(21) "Electronic refunds..."
}

You already parse the XML with SimpleXML. I guess you want to parse it into a handy array which you not further define.
The problem with the XML you have is that it's structure is not very distinct. In case it does not change much, you can convert it into an array using a SimpleXMLIterator instead of a SimpleXMLElement:
$it = new SimpleXMLIterator($xml);
$mode = RecursiveIteratorIterator::SELF_FIRST;
$rit = new RecursiveIteratorIterator($it, $mode);
$array = array_map('trim', iterator_to_array($rit));
print_r($array);
For the XML-string in question this gives:
Array
(
[CustomsID] => 010912-1
[IsApproved] => NO
[ErrorMsg] => Electronic refunds...
)
See as well the online demo and How to parse and process HTML/XML with PHP?.

Related

PHP Converting from XML to JSON with a SimpleXML object. Array with <items> tag causing issues

We are using SimpleXML to try and convert XML to JSON, and in turn convert to a PHP object, so that we can compare out Soap API with our Rest API. We have a request that returns quite a lot of data, but the part in question is where we have a nested array.
The array is returned with the tag in XML, however we do not want this translated into the JSON.
The XML that we get is as follows:
<apns>
<item>
<apn>apn</apn>
</item>
</apns>
So when it is translated into JSON it looks like this:
{"apns":{"item":{"apn":"apn"}}
In reality, we want SimpleXML to convert to the same JSON as in our Rest API, which looks like the following:
{"apns":[{"apn":"apn"}]}
The array could contain more than one thing, for example:
<apns>
<item>
<apn>apn</apn>
</item>
<item>
<apn>apn2</apn>
</item>
</apns>
Which I'm assuming will just error in JSON or have the first one overwritten.
I'd expect SimpleXML to be able to handle this natively, but if not has anyone got a fix that doesn't involve janky string manipulation?
TIA :)
A generic conversion has no possibility to know that a single element should be an array in JSON.
SimpleXMLElement properties can be treated as an Iterable to traverse sibling with the same name. They can be treated as an list or a single value.
This allows you to build up your own array/object structure and serialize it to JSON.
$xml = <<<'XML'
<apns>
<item>
<apn>apn1</apn>
</item>
<item>
<apn>apn2</apn>
</item>
</apns>
XML;
$apns = new SimpleXMLElement($xml);
$json = [
'apns' => []
];
foreach ($apns->item as $item) {
$json['apns'][] = ['apn' => (string)$item->apn];
}
echo json_encode($json, JSON_PRETTY_PRINT);
This still allows you to read/convert parts in a general way. Take a more in deep look at the SimpleXMLElement class. Here are method to iterate over all children or to get the name of the current node.
I hope this code is useful as a template to what your after, the problem is that it's difficult to know if this is the only instance of what your trying to do...
What this does is first looks for any nodes which have a item/apn structure underneath using XPath (//*[item/apn] says any node //* with the following nodes underneath).
Then it loops through these items and adds new <apn> nodes underneath the start node (the <apns> node in this case) from each <item> with the value ($list->addChild("apn", (string)$item->apn);.
Once the nodes are copied it removes all of the <item> nodes (unset($list->item);).
$input = '<apns>
<item>
<apn>apn</apn>
</item>
<item>
<apn>apn2</apn>
</item>
</apns>';
$xml = simplexml_load_string($input);
$itemList = $xml->xpath("//*[item/apn]");
foreach ( $itemList as $list ) {
foreach ( $list->item as $item ) {
$list->addChild("apn", (string)$item->apn);
}
unset($list->item);
}
echo $xml->asXML();
gives...
<?xml version="1.0"?>
<apns>
<apn>apn</apn><apn>apn2</apn></apns>
and
echo json_encode($xml);
gives...
{"apn":["apn","apn2"]}
If you just want the last value, then you can just keep track of the last value and set the new element outside the inner loop...
$itemList = $xml->xpath("//*[item/apn]");
foreach ( $itemList as $list ) {
foreach ( $list->item as $item ) {
$apn = (string)$item->apn;
}
$list->addChild("apn", $apn);
unset($list->item);
}

Json Encode or Serialize an XML

I have some xml, this is a simple version of it.
<xml>
<items>
<item abc="123">item one</item>
<item abc="456">item two</item>
</items>
</xml>
Using SimpleXML on the content,
$obj = simplexml_load_string( $xml );
I can use $obj->xpath( '//items/item' ); and get access to the #attributes.
I need an array result, so I have tried the json_decode(json_encode($obj),true) trick, but that looks to be removing access to the #attributes (ie. abc="123").
Is there another way of doing this, that provides access to the attributes and leaves me with an array?
You need to call attributes() function.
Sample code:
$xmlString = '<xml>
<items>
<item abc="123">item one</item>
<item abc="456">item two</item>
</items>
</xml>';
$xml = new SimpleXMLElement($xmlString);
foreach( $xml->items->item as $value){
$my_array[] = strval($value->attributes());
}
print_r($my_array);
Eval
You can go the route with json_encode and json_decode and you can add the stuff you're missing because that json_encode-ing follows some specific rules with SimpleXMLElement.
If you're interested into the rules and their details, I have written two blog-posts about it:
SimpleXML and JSON Encode in PHP – Part I
SimpleXML and JSON Encode in PHP – Part II
For you perhaps more interesing is the third part which shows how you can modify the json serialization and provide your own format (e.g. to preserve the attributes):
SimpleXML and JSON Encode in PHP – Part III and End
It ships with a full blown example, here is an excerpt in code:
$xml = '<xml>
<items>
<item abc="123">item one</item>
<item abc="456">item two</item>
</items>
</xml>';
$obj = simplexml_load_string($xml, 'JsonXMLElement');
echo $json = json_encode($obj, JSON_PRETTY_PRINT), "\n";
print_r(json_decode($json, TRUE));
Output of JSON and the array is as following, note that the attributes are part of it:
{
"items": {
"item": [
{
"#attributes": {
"abc": "123"
},
"#text": "item one"
},
{
"#attributes": {
"abc": "456"
},
"#text": "item two"
}
]
}
}
Array
(
[items] => Array
(
[item] => Array
(
[0] => Array
(
[#attributes] => Array
(
[abc] => 123
)
[#text] => item one
)
[1] => Array
(
[#attributes] => Array
(
[abc] => 456
)
[#text] => item two
)
)
)
)
$xml = new SimpleXMLElement($xmlString);
$xml is now an object. To get the value of an attribute:
$xml->something['id'];
Where 'id' is the name of the attribute.
While it's theoretically possible to write a generic conversion from XML to PHP or JSON structures, it is very hard to capture all the subtleties that might be present - the distinction between child elements and attributes, text content alongside attributes (as you have here) or even alongside child elements, multiple child nodes with the same name, whether order of child elements and text nodes is important (e.g. in XHTML or DocBook), etc, etc.
If you have a specific format you need to produce, it will generally be much easier to use an API - like SimpleXML - to loop over the XML and produce the structure you need.
You don't specify the structure you want to achieve, but the general approach given your input would be to loop over each item, and either access known attributes, or loop over each attribute:
$sxml = simplexml_load_string( $xml );
$final_array = array();
foreach ( $sxml->items->item as $xml_item )
{
$formatted_item = array();
// Text content of item
$formatted_item['content'] = (string)$xml_item;
// Specifically get 'abc' attribute
$formatted_item['abc'] = (string)$xml_item['abc'];
// Maybe one of the attributes is an integer
$formatted_item['foo_id'] = (int)$xml_item['foo_id'];
// Or maybe you want to loop over lots of possible attributes
foreach ( $xml_item->attributes() as $attr_name => $attr_value )
{
$formatted_item['attrib:' . $attr_name] = (string)$attr_value;
}
// Add it to a final list
$final_array[] = $formatted_item;
// Or maybe you want that array to be keyed on one of the attributes
$final_array[ (string)$xml_item['key'] ] = $formatted_item;
}
Here is a class I've found that is able to process XML into array very nicely: http://outlandish.com/blog/xml-to-json/ (backup). Converting to json is a matter of a json_encode() call.

Parse CDATA from a SOAP Response with PHP

I'm trying to parse out the CDATA from a SOAP response using SimpleXML and Xpath. I get the output that I looking for but the output returned is one continuous line of data with no separators that would allow me to parse.
I appreciate any help!
Here is the SOAP response containing the CDATA that I need to parse:
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
<soapenv:Body>
<ns1:getIPServiceDataResponse xmlns:ns1="http://ws.icontent.idefense.com/V3/2">
<ns1:return xsi:type="ns1:IPServiceDataResponse" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ns1:status>Success</ns1:status>
<ns1:serviceType>IPservice_TIIncremental_ALL_xml_v1</ns1:serviceType>
<ns1:ipserviceData><![CDATA[<?xml version="1.0" encoding="utf-8"?><threat_indicators><tidata><indicator>URL</indicator><format>STRING</format><value>http://update.lflink.com/aspnet_vil/debug.swf</value><role>EXPLOIT</role><sample_md5/><last_observed>2012-11-02 18:13:43.587000</last_observed><comment>APT Blade2009 - CVE-2012-5271</comment><ref_id/></tidata><tidata><indicator>URL</indicator><format>STRING</format><value>http://update.lflink.com/crossdomain.xml</value><role>EXPLOIT</role><sample_md5/><last_observed>2012-11-02 18:14:04.108000</last_observed><comment>APT Blade2009 - CVE-2012-5271</comment><ref_id/></tidata><tidata><indicator>DOMAIN</indicator><format>STRING</format><value>update.lflink.com</value><role>EXPLOIT</role><sample_md5/><last_observed>2012-11-02 18:15:10.445000</last_observed><comment>APT Blade2009 - CVE-2012-5271</comment><ref_id/></tidata></threat_indicators>]]></ns1:ipserviceData>
</ns1:return>
</ns1:getIPServiceDataResponse>
</soapenv:Body>
</soapenv:Envelope>
Here is PHP code I'm using to try to parse the CDATA:
<?php
$xml = simplexml_load_string($soap_response);
$xml->registerXPathNamespace('ns1', 'http://ws.icontent.idefense.com/V3/2');
foreach ($xml->xpath("//ns1:ipserviceData") as $item)
{
echo '<pre>';
print_r($item);
echo '</pre>';
}
?>
Here's the print_r output:
SimpleXMLElement Object
(
[0] => URLSTRINGhttp://update.lflink.com/aspnet_vil/debug.swfEXPLOIT2012-11-02 18:13:43.587000APT Blade2009 - CVE-2012-5271URLSTRINGhttp://update.lflink.com/crossdomain.xmlEXPLOIT2012-11-02 18:14:04.108000APT Blade2009 - CVE-2012-5271DOMAINSTRINGupdate.lflink.comEXPLOIT2012-11-02 18:15:10.445000APT Blade2009 - CVE-2012-5271
)
Any ideas what I can do to make the output usable? For example, parsing out each element of the CDATA output such as: <indicator></indicator>, <value></value>, <role></role>, etc.
FYI - Also tried using LIBXML_NOCDATA with no change in output.
You get it as a single string because you have asked for that - just the string.
If you want to be able to parse that string as XML then, well create a new Simplexml object out of it.
Then you have another parser on the string which can parse the HTML (yes that simple; Demo):
$soap = simplexml_load_string($soapXML);
$soap->registerXPathNamespace('ns1', 'http://ws.icontent.idefense.com/V3/2');
$ipserviceData = simplexml_load_string($soap->xpath('//ns1:ipserviceData')[0]);
// <threat_indicators><tidata><indicator>URL</indicator>
echo $ipserviceData->tidata->indicator, "\n"; # URL
Btw, the LIBXML_NOCDATA flagDocs only controls whether the <![CDATA[...]]> parts are preserved as CDATA nodes or merged into text-nodes.

How to get a simple SimpleXMLElement value

I'm aware of how to drill down into the nodes of an xml document as described here:
http://www.php.net/manual/en/simplexml.examples-basic.php
but am at a loss on how to extract the value in the following example
$xmlStr = '<Error>Hello world. There is an Error</Error>';
$xml = simplexml_load_string($xmlStr);
simplexml_load_string returns an object of type SimpleXMLElement whose properties will have the data of the XML string.
In your case there is no opening <xml> and closing </xml> tags, which every valid XML should have.
If these were present then to get the data between <Error> tags you can do:
$xmlStr = '<xml><Error>Hello world. There is an Error</Error></xml>';
$xml = simplexml_load_string($xmlStr);
echo $xml->Error; // prints "Hello world. There is an Error"
What do you know. The value of the tag is just:
$error = $xml;
Thanks for looking :)

Getting cdata content while parsing xml file

I have an xml file
<?xml version="1.0" encoding="utf-8"?>
<xml>
<events date="01-10-2009" color="0x99CC00" selected="true">
<event>
<title>You can use HTML and CSS</title>
<description><![CDATA[This is the description ]]></description>
</event>
</events>
</xml>
I used xpath and and xquery for parsing the xml.
$xml_str = file_get_contents('xmlfile');
$xml = simplexml_load_string($xml_str);
if(!empty($xml))
{
$nodes = $xml->xpath('//xml/events');
}
i am getting the title properly, but iam not getting description.How i can get data inside
the cdata
SimpleXML has a bit of a problem with CDATA, so use:
$xml = simplexml_load_file('xmlfile', 'SimpleXMLElement', LIBXML_NOCDATA);
if(!empty($xml))
{
$nodes = $xml->xpath('//xml/events');
}
print_r( $nodes );
This will give you:
Array
(
[0] => SimpleXMLElement Object
(
[#attributes] => Array
(
[date] => 01-10-2009
[color] => 0x99CC00
[selected] => true
)
[event] => SimpleXMLElement Object
(
[title] => You can use HTML and CSS
[description] => This is the description
)
)
)
You are probably being misled into thinking that the CDATA is missing by using print_r or one of the other "normal" PHP debugging functions. These cannot see the full content of a SimpleXML object, as it is not a "real" PHP object.
If you run echo $nodes[0]->Description, you'll find your CDATA comes out fine. What's happening is that PHP knows that echo expects a string, so asks SimpleXML for one; SimpleXML responds with all the string content, including CDATA.
To get at the full string content reliably, simply tell PHP that what you want is a string using the (string) cast operator, e.g. $description = (string)$nodes[0]->Description.
To debug SimpleXML objects and not be fooled by quirks like this, use a dedicated debugging function such as one of these: https://github.com/IMSoP/simplexml_debug
This could also be another viable option, which would remove that code and make life a little easier.
$xml = str_replace("<![CDATA[", "", $xml);
$xml = str_replace("]]>", "", $xml);

Categories