iterating over unknown XML structure with PHP (DOM)

iterating over unknown XML structure with PHP (DOM) - php

I want to write a function that parses a (theoretically) unknown XML data structure into an equivalent PHP array.
Here is my sample XML:
<?xml version="1.0" encoding="UTF-8"?>
<content>
<title>Sample Text</title>
<introduction>
<paragraph>This is some rudimentary text</paragraph>
</introduction>
<description>
<paragraph>Here is some more text</paragraph>
<paragraph>Even MORE text</paragraph>
<sub_section>
<sub_para>This is a smaller, sub paragraph</sub_para>
<sub_para>This is another smaller, sub paragraph</sub_para>
</sub_section>
</description>
</content>
I modified this DOM iterating function from devarticles:
$data = 'path/to/xmldoc.xml';
$xmlDoc = new DOMDocument(); #create a DOM element
$xmlDoc->load( $data ); #load data into the element
$xmlRoot = $xmlDoc->firstChild; #establish root
function xml2array($node)
{
if ($node->hasChildNodes())
{
$subNodes = $node->childNodes;
foreach ($subNodes as $subNode)
{
#filter node types
if (($subNode->nodeType != 3) || (($subNode->nodeType == 3)))
{
$arraydata[$subNode->nodeName]=$subNode->nodeValue;
}
xml2array($subNode);
}
}
return $arraydata;
}
//The getNodesInfo function call
$xmlarray = xml2array($xmlRoot);
// print the output - with a little bit of formatting for ease of use...
foreach($xmlarray as $xkey)
{
echo"$xkey<br/><br/>";
}
Now, because of the way I'm passing the elements to the array I'm overwriting any elements that share a node name (since I ideally want to give the keys the same names as their originating nodes). My recursion isn't great... However, even if I empty the brackets - the second tier of nodes are still coming in as values on the first tier (see the text of the description node).
Anyone got any ideas how I can better construct this?

You might be better off just snagging some code off the net
http://www.bin-co.com/php/scripts/xml2array/
/**
* xml2array() will convert the given XML text to an array in the XML structure.
* Link: http://www.bin-co.com/php/scripts/xml2array/
* Arguments : $contents - The XML text
* $get_attributes - 1 or 0. If this is 1 the function will get the attributes as well as the tag values - this results in a different array structure in the return value.
* $priority - Can be 'tag' or 'attribute'. This will change the way the resulting array sturcture. For 'tag', the tags are given more importance.
* Return: The parsed XML in an array form. Use print_r() to see the resulting array structure.
* Examples: $array = xml2array(file_get_contents('feed.xml'));
* $array = xml2array(file_get_contents('feed.xml', 1, 'attribute'));
*/
function xml2array($contents, $get_attributes=1, $priority = 'tag') {

You might be interested in SimpleXML or xml_parse_into_struct.
$arraydata is neither passed to subsequent calls to xml2array() nor is the return value used, so yes "My recursion isn't great..." is true ;-)
To append a new element to an existing array you can use empty square brackets, $arr[] = 123; $arr[$x][] = 123;

You might also want to check out XML Unserializer
http://pear.php.net/package/XML_Serializer/redirected

Related

PHP Parse HTML file [duplicate]

I'm parsing some XML with PHP DOM extension in order to store the data in some other form. Quite unsurprisingly, when I parse an element I pretty often need to obtain all children elements of some name. There is the method DOMElement::getElementsByTagName($name), but it returns all descendants with that name, not just immediate children. There is also the property DOMNode::$childNodes but (1) it contains node list, not element list, and even if I managed to turn the list items into elements (2) I'd still need to check all of them for the name. Is there really no elegant solution to get only the children of some specific name or am I missing something in the documentation?
Some illustration:
<?php
DOMDocument();
$document->loadXML(<<<EndOfXML
<a>
<b>1</b>
<b>2</b>
<c>
<b>3</b>
<b>4</b>
</c>
</a>
EndOfXML
);
$bs = $document
->getElementsByTagName('a')
->item(0)
->getElementsByTagName('b');
foreach($bs as $b){
echo $b->nodeValue . "\n";
}
// Returns:
// 1
// 2
// 3
// 4
// I'd like to obtain only:
// 1
// 2
?>

simple iteration process
$parent = $p->parentNode;
foreach ( $parent->childNodes as $pp ) {
if ( $pp->nodeName == 'p' ) {
if ( strlen( $pp->nodeValue ) ) {
echo "{$pp->nodeValue}\n";
}
}
}

An elegant manner I can imagine would be using a FilterIterator that is suitable for the job. Exemplary one that is able to work on such a said DOMNodeList and (optionally) accepting a tagname to filter for as an exemplary DOMElementFilter from the Iterator Garden does:
$a = $doc->getElementsByTagName('a')->item(0);
$bs = new DOMElementFilter($a->childNodes, 'b');
foreach($bs as $b){
echo $b->nodeValue . "\n";
}
This will give the results you're looking for:
1
2
You can find DOMElementFilter in the Development branch now. It's perhaps worth to allow * for any tagname as it's possible with getElementsByTagName("*") as well. But that's just some commentary.
Hier is a working usage example online: https://eval.in/57170

My solution used in a production:
Finds a needle (node) in a haystack (DOM)
function getAttachableNodeByAttributeName(\DOMElement $parent = null, string $elementTagName = null, string $attributeName = null, string $attributeValue = null)
{
$returnNode = null;
$needleDOMNode = $parent->getElementsByTagName($elementTagName);
$length = $needleDOMNode->length;
//traverse through each existing given node object
for ($i = $length; --$i >= 0;) {
$needle = $needleDOMNode->item($i);
//only one DOM node and no attributes specified?
if (!$attributeName && !$attributeValue && 1 === $length) return $needle;
//multiple nodes and attributes are specified
elseif ($attributeName && $attributeValue && $needle->getAttribute($attributeName) === $attributeValue) return $needle;
}
return $returnNode;
}
Usage:
$countryNode = getAttachableNodeByAttributeName($countriesNode, 'country', 'iso', 'NL');
Returns DOM element from parent countries node by specified attribute iso using country ISO code 'NL', basically like a real search would do. Find a certain country by it's name in an array / object.
Another usage example:
$productNode = getAttachableNodeByAttributeName($products, 'partner-products');
Returns DOM node element containing only single (root) node, without searching by any attribute.
Note: for this you must make sure that root nodes are unique by elements' tag name, e.g. countries->country[ISO] - countries node here is unique and parent to all child nodes.

PHP DOM: How to get child elements by tag name in an elegant manner?

I'm parsing some XML with PHP DOM extension in order to store the data in some other form. Quite unsurprisingly, when I parse an element I pretty often need to obtain all children elements of some name. There is the method DOMElement::getElementsByTagName($name), but it returns all descendants with that name, not just immediate children. There is also the property DOMNode::$childNodes but (1) it contains node list, not element list, and even if I managed to turn the list items into elements (2) I'd still need to check all of them for the name. Is there really no elegant solution to get only the children of some specific name or am I missing something in the documentation?
Some illustration:
<?php
DOMDocument();
$document->loadXML(<<<EndOfXML
<a>
<b>1</b>
<b>2</b>
<c>
<b>3</b>
<b>4</b>
</c>
</a>
EndOfXML
);
$bs = $document
->getElementsByTagName('a')
->item(0)
->getElementsByTagName('b');
foreach($bs as $b){
echo $b->nodeValue . "\n";
}
// Returns:
// 1
// 2
// 3
// 4
// I'd like to obtain only:
// 1
// 2
?>

simple iteration process
$parent = $p->parentNode;
foreach ( $parent->childNodes as $pp ) {
if ( $pp->nodeName == 'p' ) {
if ( strlen( $pp->nodeValue ) ) {
echo "{$pp->nodeValue}\n";
}
}
}

An elegant manner I can imagine would be using a FilterIterator that is suitable for the job. Exemplary one that is able to work on such a said DOMNodeList and (optionally) accepting a tagname to filter for as an exemplary DOMElementFilter from the Iterator Garden does:
$a = $doc->getElementsByTagName('a')->item(0);
$bs = new DOMElementFilter($a->childNodes, 'b');
foreach($bs as $b){
echo $b->nodeValue . "\n";
}
This will give the results you're looking for:
1
2
You can find DOMElementFilter in the Development branch now. It's perhaps worth to allow * for any tagname as it's possible with getElementsByTagName("*") as well. But that's just some commentary.
Hier is a working usage example online: https://eval.in/57170

My solution used in a production:
Finds a needle (node) in a haystack (DOM)
function getAttachableNodeByAttributeName(\DOMElement $parent = null, string $elementTagName = null, string $attributeName = null, string $attributeValue = null)
{
$returnNode = null;
$needleDOMNode = $parent->getElementsByTagName($elementTagName);
$length = $needleDOMNode->length;
//traverse through each existing given node object
for ($i = $length; --$i >= 0;) {
$needle = $needleDOMNode->item($i);
//only one DOM node and no attributes specified?
if (!$attributeName && !$attributeValue && 1 === $length) return $needle;
//multiple nodes and attributes are specified
elseif ($attributeName && $attributeValue && $needle->getAttribute($attributeName) === $attributeValue) return $needle;
}
return $returnNode;
}
Usage:
$countryNode = getAttachableNodeByAttributeName($countriesNode, 'country', 'iso', 'NL');
Returns DOM element from parent countries node by specified attribute iso using country ISO code 'NL', basically like a real search would do. Find a certain country by it's name in an array / object.
Another usage example:
$productNode = getAttachableNodeByAttributeName($products, 'partner-products');
Returns DOM node element containing only single (root) node, without searching by any attribute.
Note: for this you must make sure that root nodes are unique by elements' tag name, e.g. countries->country[ISO] - countries node here is unique and parent to all child nodes.

Comparing 2 XML Files using PHP

I want to compare 2 big xml files and retrieve the differences. Like ExamXML and DiffDog do. The solution I found was cycling through all child nodes of each file simultaneously and check if they are equal. But I have no idea how to achieve that... How can I loop through all child nodes and their properties? How can I check if the first element of the first file is equal to the first element of the second file, the second element of the first file is equal to the second element of the second file and so on?
Do yo have a better idea to compare 2 xml files?

I was looking for something to compare two XML like you, and I found this solution that works very well.
http://www.jevon.org/wiki/Comparing_Two_SimpleXML_Documents
I hope that helps to someone.

Have you looked at using XPath at all? Seems like an easy way to grab all of the child nodes. Then you'd be able to loop through the nodes and compare the attributes/textContent.

This might be a very alternative solution for you but this is how I would do it.
First, I'd try to get the format into something much more manageable like an array so I would convert the XML to an array.
http://www.bytemycode.com/snippets/snippet/445/
This is some simple code to do just that.
Then PHP has an array_diff() function that can show you the differences.
http://www.php.net/manual/en/function.array-diff.php
This may or may not work for you considering what you need to do with the differences but if you're looking to just identify and act upon them this might be a very quick solution to your problem.

Try the xmldiff extension
http://pecl.php.net/xmldiff
It's based on the same library as the perl module DifferenceMarkup, you'll get a diff XML document and can even merge then.

//Child by Child XML files comparison in PHP
//Returns an array of non matched children in variable &$reasons
$reasons = array();
$xml1 = new SimpleXMLElement(file_get_contents($xmlFile1));
$xml2 = new SimpleXMLElement(file_get_contents($xmlFile2));
$result = XMLFileComparison($xml1, $xml2, $reasons);
/**
* XMLFileComparison
* Discription :- This function compares XML files. Returns array
* of nodes do not match in pass by reference parameter
* #param $xml1 Object Node Object
* #param $xml2 Object Node Object
* #param &$reasons Array pass by reference
* returns array of nodes do not match
* #param $strict_comparison Bool default False
* #return bool <b>TRUE</b> on success or array of strings on failure.
*/
function XMLFileComparison(SimpleXMLElement $xml1, SimpleXMLElement $xml2, &$reasons, $strict_comparison = false)
{
static $str;
// compare text content
if ($strict_comparison) {
if ("$xml1" != "$xml2") return "Values are not equal (strict)";
} else {
if (trim("$xml1") != trim("$xml2"))
{
return " Values are not equal";
}
}
// get all children
$XML1ChildArray = array();
$XML2ChildArray = array();
foreach ($xml1->children() as $b) {
if (!isset($XML1ChildArray[$b->getName()]))
$XML1ChildArray[$b->getName()] = array();
$XML1ChildArray[$b->getName()][] = $b;
}
foreach ($xml2->children() as $b) {
if (!isset($XML2ChildArray[$b->getName()]))
$XML2ChildArray[$b->getName()] = array();
$XML2ChildArray[$b->getName()][] = $b;
}
//print_r($XML1ChildArray);
//print_r($XML2ChildArray);
// cycle over children
if (count($XML1ChildArray) != count($XML2ChildArray)) return "mismatched children count";// Second File has less or more children names (we don't have to search through Second File's children too)
foreach ($XML1ChildArray as $child_name => $children) {
if (!isset($XML2ChildArray[$child_name])) return "Second file does not have child $child_name"; // Second file has none of this child
if (count($XML1ChildArray[$child_name]) != count($XML2ChildArray[$child_name])) return "mismatched $child_name children count"; // Second file has less or more children
print_r($child_name);
foreach ($children as $child) {
// do any of search2 children match?
$found_match = false;
//$reasons = array();
foreach ($XML2ChildArray[$child_name] as $id => $second_child) {
$str = $str.$child_name.($id+1)."/"; // Adding 1 to $id to match with XML data nodes numbers
//print_r($child, $second_child);
// recursive function call until reach to the end of node
if (($r = XMLFileComparison($child, $second_child, $reasons, $strict_comparison)) === true) {
// found a match: delete second
$found_match = true;
unset($XML2ChildArray[$child_name][$id]);
$str = str_replace($child_name.($id+1)."/", "", $str);
break;
}
else {
unset($XML2ChildArray[$child_name][$id]);
$reasons[$str] = $r;
$str = str_replace($child_name.($id+1)."/", "", $str);
break;
}
}
}
}
return True;
}

PHP: How to convert array to XML with support to attributes (DOMi ?)

I'm using DOMi ( http://domi.sourceforge.net ) to create XML from arrays.
But I don't know how to create attributes in these XML (in arrays, so these attributes appear in the XML). How can I construct these arrays so I can get some tags with attributes after the convertion?
Thank you!

Looking at the source code, apparently you pass the second argument "attributes" to attachToXml:
public function attachToXml($data, $prefix, &$parentNode = false) {
if(!$parentNode) {
$parentNode = &$this->mainNode;
}
// i don't like how this is done, but i can't see an easy alternative
// that is clean. if the prefix is attributes, instead of creating
// a node, just put all of the data onto the parent node as attributes
if(strtolower($prefix) == 'attributes') {
// set all of the attributes onto the node
foreach($data as $key=>$val)
$parentNode->setAttribute($key, $val);
$node = &$parentNode;
}
//...
}

php and XML range into array

The code below helps be to get the WHOLE XML and put it into an array. What I'm wondering is, what would be a good way to get the XML only from item 3 - 6 or any arbitrary range instead of the whole document.
$d = new DOMDocument();
$d->load('http://news.google.com/?output=rss');
foreach ($d->getElementsByTagName('item') as $t) {
$list = array ( 'title' => $t->getElementsByTagName('title')->item(0)->nodeValue);
array_push($mt_arr, $list);
}
Thanks

You can use Xpath.
You can either use DOMXpath or use the xpath method to create an Xpath query that will return the subset of nodes.
$d->xpath('/SOME/XPATH/STATEMENT');

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

iterating over unknown XML structure with PHP (DOM) - php

You might also want to check out XML Unserializer http://pear.php.net/package/XML_Serializer/redirected

Related

PHP Parse HTML file [duplicate]

PHP DOM: How to get child elements by tag name in an elegant manner?

Comparing 2 XML Files using PHP

PHP: How to convert array to XML with support to attributes (DOMi ?)

php and XML range into array

Categories

Resources