Parsing xml-like data

Parsing xml-like data - php

I have a string with xml-like data:
<header>Article header</header>
<description>This article is about you</description>
<text>some <b>html</b> text</text>
I need to parse it into variables/object/array "header", "description", "text".
What is the best way to do this? I tried $vars = simplexml_load_string($content), but it does not work, because it is not 100% pure xml (no <?xml...).
So, should I use preg_match? Is it the only way?

Your XML string looks like (though may or may not be) an XML document fragment. PHP can work with this using the DOMDocumentFragment class.
$doc = new DOMDocument;
$frag = $doc->createDocumentFragment();
$frag->appendXML($content);
$parsed = array();
foreach ($frag->childNodes as $element) {
if ($element->nodeType === XML_ELEMENT_NODE) {
$parsed[$element->nodeName] = $element->textContent;
}
}
echo $parsed['description']; // This article is about you

With a string like that, simlexml_load_string should work.
Because of the 3rd tag, if you try to get that it will fail, and not return the correct value (because there is a sub part within the tag.
Try something like this, which might work for you:
$xml = simplexml_load_string($content)
$text = $xml->text->asXML();
You should also take a look at this documentation: http://www.php.net/manual/en/simplexmlelement.asxml.php. They also do the same thing with the string. You might wanna use this option instead of simplexml_load_string too
$xml = new SimpleXMLElement($string);

Related

How to extract the text in a SimpleXmlElement object? [duplicate]

Given the php code:
$xml = <<<EOF
<articles>
<article>
This is a link
<link>Title</link>
with some text following it.
</article>
</articles>
EOF;
function traverse($xml) {
$result = "";
foreach($xml->children() as $x) {
if ($x->count()) {
$result .= traverse($x);
}
else {
$result .= $x;
}
}
return $result;
}
$parser = new SimpleXMLElement($xml);
traverse($parser);
I expected the function traverse() to return:
This is a link Title with some text following it.
However, it returns only:
Title
Is there a way to get the expected result using simpleXML (obviously for the purpose of consuming the data rather than just returning it as in this simple example)?

There might be ways to achieve what you want using only SimpleXML, but in this case, the simplest way to do it is to use DOM. The good news is if you're already using SimpleXML, you don't have to change anything as DOM and SimpleXML are basically interchangeable:
// either
$articles = simplexml_load_string($xml);
echo dom_import_simplexml($articles)->textContent;
// or
$dom = new DOMDocument;
$dom->loadXML($xml);
echo $dom->documentElement->textContent;
Assuming your task is to iterate over each <article/> and get its content, your code will look like
$articles = simplexml_load_string($xml);
foreach ($articles->article as $article)
{
$articleText = dom_import_simplexml($article)->textContent;
}

node->asXML();// It's the simple solution i think !!

So, the simple answer to my question was: Simplexml can't process this kind of XML. Use DomDocument instead.
This example shows how to traverse the entire XML. It seems that DomDocument will work with any XML whereas SimpleXML requires the XML to be simple.
function attrs($list) {
$result = "";
foreach ($list as $attr) {
$result .= " $attr->name='$attr->value'";
}
return $result;
}
function parseTree($xml) {
$result = "";
foreach ($xml->childNodes AS $item) {
if ($item->nodeType == 1) {
$result .= "<$item->nodeName" . attrs($item->attributes) . ">" . parseTree($item) . "</$item->nodeName>";
}
else {
$result .= $item->nodeValue;
}
}
return $result;
}
$xmlDoc = new DOMDocument();
$xmlDoc->loadXML($xml);
print parseTree($xmlDoc->documentElement);
You could also load the xml using simpleXML and then convert it to DOM using dom_import_simplexml() as Josh said. This would be useful, if you are using simpleXml to filter nodes for parsing, e.g. using XPath.
However, I don't actually use simpleXML, so for me that would be taking the long way around.
$simpleXml = new SimpleXMLElement($xml);
$xmlDom = dom_import_simplexml($simpleXml);
print parseTree($xmlDom);
Thank you for all the help!

You can get the text node of a DOM element with simplexml just by treating it like a string:
foreach($xml->children() as $x) {
$result .= "$x"
However, this prints out:
This is a link
with some text following it.
TitleTitle
..because the text node is treated as one block and there is no way to tell where the child fits in inside the text node. The child node is also added twice because of the other else {}, but you can just take that out.
Sorry if I didn't help much, but I don't think there's any way to find out where the child node fits in the text node unless the xml is consistent (but then, why not use tags). If you know what element you want to strip the text out of, strip_tags() will work great.

This has already been answered, but CASTING TO STRING ( i.e. $sString = (string) oSimpleXMLNode->TagName) always worked for me.

Try this:
$parser = new SimpleXMLElement($xml);
echo html_entity_decode(strip_tags($parser->asXML()));
That's pretty much equivalent to:
$parser = simplexml_load_string($xml);
echo dom_import_simplexml($parser)->textContent;

Like #tandu said, it's not possible, but if you can modify your XML, this will work:
$xml = <<<EOF
<articles>
<article>
This is a link
</article>
<link>Title</link>
<article>
with some text following it.
</article>
</articles>

PHP: Keeping HTML inside XML node without CDATA

I've got an xml like this:
<father>
<son>Text with <b>HTML</b>.</son>
</father>
I'm using simplexml_load_string to parse it into SimpleXmlElement. Then I get my node like this
$xml->father->son->__toString(); //output: "Text with .", but expected "Text with <b>HTML</b>."
I need to handle simple HTML such as:
<b>text</b> or <br/> inside the xml which is sent by many users.
Me problem is that I can't just ask them to use CDATA because they won't be able to handle it properly, and they are already use to do without.
Also, if it's possible I don't want the file to be edited because the information need to be the one sent by the user.
The function simplexml_load_string simply erase anything inside HTML node and the HTML node itself.
How can I keep the information ?
SOLUTION
To handle the problem I used the asXml as explained by #ThW:
$tmp = $xml->father->son->asXml(); //<son>Text with <b>HTML</b>.</son>
I just added a preg_match to erase the node.

A CDATA section is a character node, just like a text node. But it does less encoding/decoding. This is mostly a downside, actually. On the upside something in a CDATA section might be more readable for a human and it allows for some BC in special cases. (Think HTML script tags.)
For an XML API they are nearly the same. Here is a small DOM example (SimpleXML abstracts to much).
$document = new DOMDocument();
$father = $document->appendChild(
$document->createElement('father')
);
$son = $father->appendChild(
$document->createElement('son')
);
$son->appendChild(
$document->createTextNode('With <b>HTML</b><br>It\'s so nice.')
);
$son = $father->appendChild(
$document->createElement('son')
);
$son->appendChild(
$document->createCDataSection('With <b>HTML</b><br>It\'s so nice.')
);
$document->formatOutput = TRUE;
echo $document->saveXml();
Output:
<?xml version="1.0"?>
<father>
<son>With <b>HTML</b><br>It's so nice.</son>
<son><![CDATA[With <b>HTML</b><br>It's so nice.]]></son>
</father>
As you can see they are serialized very differently - but from the API view they are basically exchangeable. If you're using an XML parser the value you get back should be the same in both cases.
So the first possibility is just letting the HTML fragment be stored in a character node. It is just a string value for the outer XML document itself.
The other way would be using XHTML. XHTML is XML compatible HTML. You can mix an match different XML formats, so you could add the XHTML fragment as part of the outer XML.
That seems to be what you're receiving. But SimpleXML has some problems with mixed nodes. So here is an example how you can read it in DOM.
$xml = <<<'XML'
<father>
<son>With <b>HTML</b><br/>It's so nice.</son>
</father>
XML;
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXpath($document);
$result = '';
foreach ($xpath->evaluate('/father/son[1]/node()') as $child) {
$result .= $document->saveXml($child);
}
echo $result;
Output:
With <b>HTML</b><br/>It's so nice.
Basically you need to save each child of the son element as XML.
SimpleXML is based on the same DOM library internally. That allows you to convert a SimpleXMLElement into a DOM node. From there you can again save each child as XML.
$father = new SimpleXMLElement($xml);
$sonNode = dom_import_simplexml($father->son);
$document = $sonNode->ownerDocument;
$result = '';
foreach ($sonNode->childNodes as $child) {
$result .= $document->saveXml($child);
}
echo $result;

can't access xml node PHP

I have a page in php where I have to parse an xml.
I have done this for example:
$hotelNodes = $xml_data->getElementsByTagName('Hotel');
foreach($hotelNodes as $hotel){
$supplementsNodes2 = $hotel->getElementsByTagName('BoardBase');
foreach($supplementsNodes2 as $suppl2) {
echo'<p>HERE</p>'; //not enter here
}
}
}
In this code I access to each hotel of my xml, and foreach hotel I would like to search the tag BoardBase but it doesn0t enter inside it.
This is my xml (cutted of many parts!!!!!)
<hotel desc="DESC" name="Hotel">
<selctedsupplements>
<boardbases>
<boardbase bbpublishprice="0" bbprice="0" bbname="Colazione Continentale" bbid="1"></boardbase>
</boardbases>
</selctedsupplements>
</occupancy></occupancies>
</hotel>
I have many nodes that doesn't have BoardBase but sometimes there is but not enter.
Is possible that this node isn't accessible?
This xml is received by a server with a SoapClient.
If I inspect the XML printed in firebug I can see the node with opacity like this:
I have also tried this:
$supplementsNodes2 = $hotel->getElementsByTagName('boardbase');
but without success

2 issues I can see from the get-go: XML names are case-sensitive, hence:
$hotelNodes = $xml_data->getElementsByTagName('Hotel');
Can't work, because your xml node looks like:
<hotel desc="DESC" name="Hotel">
hotel => lower-case!
As you can see here:
[...] names for such things as elements, while XML is explicitly case sensitive.
The official specs specify tag names as case-sensitive, so getElementsByTagName('FOO') won't return the same elements as getElementsByTagName('foo')...
Secondly, you seem to have some tag-soup going on:
</occupancy></occupancies>
<!-- tag names don't match, both are closing tags -->
This is just plain invalid markup, it should read either:
<occupancy></occupancy>
or
<occupancies></occupancies>
That would be the first 2 ports of call.
I've set up a quick codepad using this code, which you can see here:
$xml = '<hotel desc="DESC" name="Hotel">
<selctedsupplements>
<boardbases>
<boardbase bbpublishprice="0" bbprice="0" bbname="Colazione Continentale" bbid="1"></boardbase>
</boardbases>
</selctedsupplements>
<occupancy></occupancy>
</hotel>';
$dom = new DOMDocument;
$dom->loadXML($xml);
$badList = $dom->getElementsByTagName('Hotel');
$correctList = $dom->getElementsByTagName('hotel');
echo sprintf("%d",$badList->lenght),
' compared to ',
$correctList->length, PHP_EOL;
The output was "0 compared to 1", meaning that using a lower-case selector returned 1 element, the one with the upper-case H returned an empty list.
To get to the boardbase tags for each hotel tag, you just have to write this:
$hotels = $dom->getElementsByTagName('html');
foreach($hotels as $hotel)
{
$supplementsNodes2 = $hotel->getElementsByTagName('boardbase');
foreach($supplementsNodes2 as $node)
{
var_dump($node);//you _will_ get here now
}
}
As you can see on this updated codepad.

Alessandro, your XML is a mess (=un casino), you really need to get that straight. Elias' answer pointed out some very basic stuff to consider.
I built on the code pad Elias has been setting up, it is working perfectly with me:
$dom = new DOMDocument;
$dom->loadXML($xml);
$hotels = $dom->getElementsByTagName('hotel');
foreach ($hotels as $hotel) {
$bbs = $hotel->getElementsByTagName('boardbase');
foreach ($bbs as $bb) echo $bb->getAttribute('bbname');
}
see http://codepad.org/I6oxkEOC

PHP Dealing with missing XML data

If I have three sets of data, say:
<note><from>Me</from><to>someone</to><message>hello</message></note>
<note><from>Me</from><to></to><message>Need milk & eggs</message></note>
<note><from>Me</from><message>Need milk & eggs</message></note>
and I'm using simplexml is there a way to have simple xml check that there's an empty/absent tag automatically?
I would like the output to be:
FROM TO MESSAGE
Me someone hello
Me NULL Need milk & eggs
Me NULL Need milk & eggs
Right now I'm doing it manually and I quickly realised that it's going to take a very long time to do it for long xml files.
My current sample code:
$xml = simplexml_load_string($string);
if ($xml->from != "") {$out .= $xml->from."\t"} else {$out .= "NULL\t";}
//repeat for all children, checking by name
Sometimes the order is different as well, there might be a xml with:
<note><message>pick up cd</message><from>me</from></note>
so iterating through the children and checking by index count doesn't work.
The actual xml files I'm working with are thousands of lines each, so I obviously can't just code in every tag.

It sounds like you need a DTD (Document Type Definition), which will define the required format of the XML file, and specify which elements are required, optional, what they can contain, etc.
DTDs can be used to validate an XML file before you do any processing with it.
Unfortunately, PHP's simplexml library doesn't do anything with DTD, but the DomDocument library does, so you may want to use that instead.
I'll leave it as a separate excersise for you to research how to create a DTD file. If you need more help with that, I'd suggest asking it as a separate question.

You could use the DOMDocument instead. I have created a quick demo that splits the <note> elements into an array using the XML tag names as keys. You could then iterate the resultant array to create your output.
I corrected the invalid XML by replacing the ampersand with the HTML entity equivalent (&).
<?php
libxml_use_internal_errors(true);
$xml = <<<XML
<notes>
<note><from>Me</from><to>someone</to><message>hello</message></note>
<note><from>Me</from><to></to><message>Need milk & eggs</message></note>
<note><from>Me</from><message>Need milk & eggs</message></note>
<note><message>pick up cd</message><from>me</from></note>
</notes>
XML;
function getNotes($nodelist) {
$notes = array();
foreach ($nodelist as $node) {
$noteParts = array();
foreach ($node->childNodes as $child) {
$noteParts[$child->tagName] = $child->nodeValue;
}
$notes[] = $noteParts;
}
return $notes;
}
$dom = new DOMDocument();
$dom->recover = true;
$dom->loadXML($xml);
$xpath = new DOMXPath($dom);
$nodelist = $xpath->query("//note");
$notes = getNotes($nodelist);
print_r($notes);
?>
Edit: If you change to $noteParts = array(); to $noteParts = array('from' => null, 'to' => null, 'message' => null); then it will always create the full set of keys.

Getting the text portion of a node using php Simple XML

Given the php code:
$xml = <<<EOF
<articles>
<article>
This is a link
<link>Title</link>
with some text following it.
</article>
</articles>
EOF;
function traverse($xml) {
$result = "";
foreach($xml->children() as $x) {
if ($x->count()) {
$result .= traverse($x);
}
else {
$result .= $x;
}
}
return $result;
}
$parser = new SimpleXMLElement($xml);
traverse($parser);
I expected the function traverse() to return:
This is a link Title with some text following it.
However, it returns only:
Title
Is there a way to get the expected result using simpleXML (obviously for the purpose of consuming the data rather than just returning it as in this simple example)?

There might be ways to achieve what you want using only SimpleXML, but in this case, the simplest way to do it is to use DOM. The good news is if you're already using SimpleXML, you don't have to change anything as DOM and SimpleXML are basically interchangeable:
// either
$articles = simplexml_load_string($xml);
echo dom_import_simplexml($articles)->textContent;
// or
$dom = new DOMDocument;
$dom->loadXML($xml);
echo $dom->documentElement->textContent;
Assuming your task is to iterate over each <article/> and get its content, your code will look like
$articles = simplexml_load_string($xml);
foreach ($articles->article as $article)
{
$articleText = dom_import_simplexml($article)->textContent;
}

node->asXML();// It's the simple solution i think !!

So, the simple answer to my question was: Simplexml can't process this kind of XML. Use DomDocument instead.
This example shows how to traverse the entire XML. It seems that DomDocument will work with any XML whereas SimpleXML requires the XML to be simple.
function attrs($list) {
$result = "";
foreach ($list as $attr) {
$result .= " $attr->name='$attr->value'";
}
return $result;
}
function parseTree($xml) {
$result = "";
foreach ($xml->childNodes AS $item) {
if ($item->nodeType == 1) {
$result .= "<$item->nodeName" . attrs($item->attributes) . ">" . parseTree($item) . "</$item->nodeName>";
}
else {
$result .= $item->nodeValue;
}
}
return $result;
}
$xmlDoc = new DOMDocument();
$xmlDoc->loadXML($xml);
print parseTree($xmlDoc->documentElement);
You could also load the xml using simpleXML and then convert it to DOM using dom_import_simplexml() as Josh said. This would be useful, if you are using simpleXml to filter nodes for parsing, e.g. using XPath.
However, I don't actually use simpleXML, so for me that would be taking the long way around.
$simpleXml = new SimpleXMLElement($xml);
$xmlDom = dom_import_simplexml($simpleXml);
print parseTree($xmlDom);
Thank you for all the help!

You can get the text node of a DOM element with simplexml just by treating it like a string:
foreach($xml->children() as $x) {
$result .= "$x"
However, this prints out:
This is a link
with some text following it.
TitleTitle
..because the text node is treated as one block and there is no way to tell where the child fits in inside the text node. The child node is also added twice because of the other else {}, but you can just take that out.
Sorry if I didn't help much, but I don't think there's any way to find out where the child node fits in the text node unless the xml is consistent (but then, why not use tags). If you know what element you want to strip the text out of, strip_tags() will work great.

This has already been answered, but CASTING TO STRING ( i.e. $sString = (string) oSimpleXMLNode->TagName) always worked for me.

Try this:
$parser = new SimpleXMLElement($xml);
echo html_entity_decode(strip_tags($parser->asXML()));
That's pretty much equivalent to:
$parser = simplexml_load_string($xml);
echo dom_import_simplexml($parser)->textContent;

Like #tandu said, it's not possible, but if you can modify your XML, this will work:
$xml = <<<EOF
<articles>
<article>
This is a link
</article>
<link>Title</link>
<article>
with some text following it.
</article>
</articles>

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Parsing xml-like data - php

Related

How to extract the text in a SimpleXmlElement object? [duplicate]

PHP: Keeping HTML inside XML node without CDATA

can't access xml node PHP

PHP Dealing with missing XML data

Getting the text portion of a node using php Simple XML

Categories

Resources