Get xpath of xml node within recursive function

Get xpath of xml node within recursive function - php

Lets say i have some code to iterate through an XML file recursively like this:
$xmlfile = new SimpleXMLElement('http://www.domain.com/file.xml',null,true);
xmlRecurse($xmlfile,0);
function xmlRecurse($xmlObj,$depth) {
foreach($xmlObj->children() as $child) {
echo str_repeat('-',$depth).">".$child->getName().": ".$subchild."\n";
foreach($child->attributes() as $k=>$v){
echo "Attrib".str_repeat('-',$depth).">".$k." = ".$v."\n";
}
xmlRecurse($child,$depth+1);
}
}
How would i calculate the xpath of each node so i can store it for mapping to other code?

The obvious way to do it is to pass the XPath as a third parameter and build it as you dig deeper. You have to account for siblings having the same name, so you have to keep track of the number of precedent siblings with the same name as current child while iterating.
Working example:
function xmlRecurse($xmlObj,$depth=0,$xpath=null) {
if (!isset($xpath)) {
$xpath='/'.$xmlObj->getName().'/';
}
$position = array();
foreach($xmlObj->children() as $child) {
$name = $child->getName();
if(isset($position[$name])) {
++$position[$name];
}
else {
$position[$name]=1;
}
$path=$xpath.$name.'['.$position[$name].']';
echo str_repeat('-',$depth).">".$name.": $path\n";
foreach($child->attributes() as $k=>$v){
echo "Attrib".str_repeat('-',$depth).">".$k." = ".$v."\n";
}
xmlRecurse($child,$depth+1,$path.'/');
}
}
Attention though, the whole idea of mapping a whole document and storing XPath along the way seems weird. You might actually be working on the wrong solution to a totally different problem.

You can pass to your xmlRecurse third param called $xpath (with current node xPath representation) and add xpath representation of the children on each iteration:
function xmlRecurse($xmlObj,$depth,$xpath) {
$i=0;
foreach($xmlObj->children() as $child) {
echo str_repeat('-',$depth).">".$child->getName().": ".$subchild."\n";
foreach($child->attributes() as $k=>$v){
echo "Attrib".str_repeat('-',$depth).">".$k." = ".$v."\n";
}
xmlRecurse($child,$depth+1,$xpath.'/'.$child->getName().'['.$i++.']');
}
}

With SimpleXML, I think you can only do it as others have pointed out: by recursing the node path as a string argument.
With DOMDocument, you could use the $node->parentNode property to crawl back to the document element and construct it for an arbitrary node (for example if you had a reference to a node and wanted to discover where in the tree it was without prior knowledge of how you got to that node).

$domNode = dom_import_simplexml($node);
$xpath = $domNode->getNodePath();
You need PHP 5 >= 5.2.0 for this to work.

Following up on MightyE's idea about backtracking:
function whereami($node)
{
if ($node instanceof SimpleXMLElement)
{
$node = dom_import_simplexml($node);
}
elseif (!$node instanceof DOMNode)
{
die('Not a node?');
}
$q = new DOMXPath($node->ownerDocument);
$xpath = '';
do
{
$position = 1 + $q->query('preceding-sibling::*[name()="' . $node->nodeName . '"]', $node)->length;
$xpath = '/' . $node->nodeName . '[' . $position . ']' . $xpath;
$node = $node->parentNode;
}
while (!$node instanceof DOMDocument);
return $xpath;
}
I wouldn't recommend it for the case at hand (mapping a whole document, as opposed to a single given node) but it might be useful for future reference.

Related

How to get a list of all html elements in PHP?

According to the documentation for DOMDocument::getElementsByTagName, I can call the function with "*" argument, and get a list of all HTML elements from some HTML code.
However, with the following code:
<?php
$dom = new DOMDocument();
$dom->loadHTML("<html><body><div>hello</div><div>bye</div></body></html>");
$nodes = $dom->getElementsByTagName("*");
foreach ($nodes as $node) {
$new_text= new DOMText($node->textContent."MODIFIED");
$node->removeChild($node->firstChild);
$node->appendChild($new_text);
}
$content = $dom->saveHTML();
echo $content;
?>
I get a list of only one element, and the result of execution of the code above is:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>hellobyeMODIFIED</html>
while I would expect something like this:
<html><body><div>helloMODIFIED</div><div>byeMODIFIED</div></body></html>
Shouldn't DOMDocument::getElementsByTagName method return a list of as many HTML elements as available in the HTML code?
Note: I need to create DOMText instances explicitly, because I need this to work in PHP 5.4. DOMNode::textContent is accessible for writing only from PHP 5.6

The DOMDocument::getElementsByTagName method actually returns all the tags, if the first argument is '*'. But your code replaces <body> tag (including all child nodes) with a text node at the first iteration.
Iterate the nodes, and modify only the nodes with nodeType property equal to XML_TEXT_NODE:
$nodes = $dom->getElementsByTagName('*');
foreach ($nodes as $node) {
for ($child = $node->firstChild; $child; $child = $child->nextSibling) {
if (! ($child->nodeType === XML_TEXT_NODE && trim($child->textContent))) {
continue;
}
// The textContent is writable since PHP 5.6.1
if (PHP_VERSION_ID >= 50601) {
$child->textContent .= 'MODIFIED';
continue;
}
// For older versions, create DOMText explicitly
$text = new DOMText($child->textContent . 'MODIFIED');
try {
if ($child->parentNode->replaceChild($text, $child))
$child = $text;
} catch (Exception $e) {
trigger_error("Failed to modify text '$child->textContent': "
. $e->getMessage(), E_USER_WARNING);
}
}
}
echo $dom->saveHTML();
Note, for PHP versions 5.6.1 and newer, you don't need to create DOMText instances explicitly, since the DOMNode::textContent property is accessible for read and write. So you can simply modify the text by assigning a string value to this property. Only make sure that the node has no child nodes other than XML_TEXT_NODE.
The code above checks if trim($child->textContent) is not empty, because the document may contain extra space characters (including newline), e.g.:
<div><!-- newline/spaces -->
<span>text</span><!-- newline/spaces -->
</div><!-- newline/spaces -->

This function 'DOMDocument::getElementsByTagName' returns a new instance of class DOMNodeList containing all the elements.
And it works fine:
<?php
$dom = new DOMDocument();
$dom->loadHTML("<html><body><div>hello</div><div>bye</div></body></html>");
$nodes = $dom->getElementsByTagName("*");
foreach ($nodes as $node) {
echo $node->tagName."<br />";
}
?>
it output all tags of your document.
Probably you need smth like:
<?php
$dom = new DOMDocument();
$dom->loadHTML("<html><body><div>hello</div><div>bye</div></body></html>");
$nodes = $dom->getElementsByTagName("*");
foreach ($nodes as $node) {
if ($node->tagName=='div'){
$node->nodeValue .= "new content";
}
}
$content = $dom->saveHTML();
echo htmlspecialchars($content);
?>

Try this:-
foreach($dom->getElementsByTagName('*') as $element ){
}

php XML DOM - updating XML dom elements

I have an XML that looks like this:
<nitf:body.content>
<nitf:block>
<nitf:p style="#style1">Contents of paragraph1.</nitf:p>
<nitf:p style="#style2">Contents of paragraph2.</nitf:p>
<nitf:p style="#style1"><nitf:em class="#bold">This is bold</nitf:em> This is not bold</nitf:p>
<nitf:p style="#style1"><nitf:em class="#italic">This is italic</nitf:em> This is not italic</nitf:p>
</nitf:block>
</nitf:body.content>
And I made a loop to update the text of all nitf:em tags as following:
foreach($this->doc->getElementsByTagNameNS($this->nitfNS, 'em') as $em) {
$class = $em->getAttribute('class');
if ($class == '#italic') {
$em->nodeValue = '<i>' . $em->nodeValue . '</i>';
}
elseif (strpos($class, 'bold') !== FALSE) {
$em->nodeValue = '<b>' . $em->nodeValue . '</b>';
}
$this->doc->saveXML($em);
}
Now when I loop again through the paragraph elements, the paragraphs that should be updated by the previous loop are all empty.
foreach ($this->doc->getElementsByTagNameNS($this->nitfNS, 'p') as $element) {
$textnode = $element->childNodes->item(0);
$txt = $textnode->wholeText; // this is EMPTY now
}
I read somewhere that"<>" characters might mess up the DOM parser. If that is the case here how can I update the em elements with the desired html tags (italic & bold).
Thanks in advance

You have made 2 mistakes. One is the property $textnode->wholeText - it does not exists. If you like to fetch the text content use $textnode->textContent.
The other mistake is setting DOMElement::$nodeValue with some XML fragment. That will not work. The property does contain only text, not the tags. In fact you should never set it to anything else then an empty string (to delete all child nodes). The escaping is broken.
For your problem create a new node, move all child nodes from the em to it and append the new node back to the em.
$document = new DOMDocument();
$document->loadXml($xml);
foreach($document->getElementsByTagNameNS($nitfNS, 'em') as $em) {
$class = $em->getAttribute('class');
$newNode = FALSE;
if ($class == '#italic') {
$newNode = $document->createElement('i');
} elseif (strpos($class, 'bold') !== FALSE) {
$newNode = $document->createElement('b');
}
if ($newNode) {
while ($em->firstChild) {
$newNode->appendChild($em->firstChild);
}
$em->appendChild($newNode);
}
echo $document->saveXML($em), "\n\n";
}
Output:
<nitf:em class="#bold"><b>This is bold</b></nitf:em>
<nitf:em class="#italic"><i>This is italic</i></nitf:em>

DomNode get value of item

Hello I'm new with domnode and i'm trying to check the values from an xml tree which loads ok.
Here is my code but I dont understand why is not working.
private function createCSV($xml, $f)
{
foreach ($xml->getElementsByTagName('*') as $item)
{
$hasChild = $item->hasChildNodes() ? true : false;
if(!$hasChild)
{
//echo 'Doesn\'t have children';
echo 'Value: ' . $item->nodeValue;
}
else
{
//echo 'Has children';
$this->createCSV($item, $f);
}
}
}
$item->nodeValue doesnt print anything to the browser.
I read the documentation but I can't see any mistake.
PS. $item->tagname doesnt work either.
UPDATE
whe using this: echo $item->ownerDocument->saveHTML($item);
I get the tags listed but i dont get the data inside(between the tags) like innerHTML in javascript.
UPDATE
sample xml data : http://pastebin.com/dkuUUC0Q

Text nodes are also considered child nodes, but you're only iterating element nodes (get Elements ByTagName). Because of this you're almost never getting into the 2nd condition.
Try this:
if(!$xml->hasChildNodes()){
printf('Value: %s', $xml->nodeValue);
return;
}
foreach($xml->childNodes as $item)
$this->createCSV($item, $f);
XPath version:
$xpath = new DOMXPath($xml);
$text = $xpath->query('//text()[normalize-space()]');
foreach($text as $node)
printf('Value: %s', $node->nodeValue);

How to remove invalid element from DOM?

We have the following code that lists the xpaths where $value is found.
We have detected for a given URL (see on picture) a non standard tag td1 which in addition doesn't have a closing tag. Probably the site developers have put that there intentionally, as you see in the screen shot below.
This element creates problems identifying the corect XPath for nodes.
A broken Xpath example :
/html/body/div[2]/div[2]/table/tr[2]/td/table/tr[1]/td[2]/table/tr[2]/td[2]/table[3]/tr[2]/**td1**/td[2]/span/u[1]
(as you see td1 is identified and chained in the Xpath)
We think by removing this element it helps us to build the valid XPath we are after.
A valid example is
/html/body/div[2]/div[2]/table/tr[2]/td/table/tr[1]/td[2]/table/tr[2]/td[2]/table[3]/tr[2]/td[2]/span/u[1]
How can we remove prior loading in DOMXpath? Do you have some other approach?
We would like to remove all the invalid tags which may be other than td1, as h8, diw, etc...
private function extract($url, $value) {
$dom = new DOMDocument();
$file = 'content.txt';
//$current = file_get_contents($url);
$current = CurlTool::downloadFile($url, $file);
//file_put_contents($file, $current);
#$dom->loadHTMLFile($current);
//use DOMXpath to navigate the html with the DOM
$dom_xpath = new DOMXpath($dom);
$elements = $dom_xpath->query("//*[text()[contains(., '" . $value . "')]]");
var_dump($elements);
if (!is_null($elements)) {
foreach ($elements as $element) {
var_dump($element);
echo "\n1.[" . $element->nodeName . "]\n";
$nodes = $element->childNodes;
foreach ($nodes as $node) {
if( ($node->nodeValue != null) && ($node->nodeValue === $value) ) {
echo '2.' . $node->nodeValue . "\n";
$xpath = preg_replace("/\/text\(\)/", "", $node->getNodePath());
echo '3.' . $xpath . "\n";
}
}
}
}
}

You could use XPath to find the offending nodes and remove them, while promoting its children into its place in the DOM. Then your paths will be correct.
$dom_xpath = new DOMXpath($dom);
$results = $dom_xpath->query('//td1'); // (or any offending element)
foreach ($results as $invalidNode)
{
$parentNode = $invalidNode->parentNode;
while ($invalidNode->childNodes)
{
$firstChild = $invalidNode->firstChild;
$parentNode->insertBefore($firstChild,$invalidNode);
}
$parentNode->removeChild($invalidNode);
}
EDIT:
You could also build a list of offending elements by using a list of valid elements and negating it.
// Build list manually from the HTML spec:
// See: http://www.w3.org/TR/html5/section-index.html#elements-1
$validTags = array();
// Convert list to XPath:
$validTagsStr = '';
foreach ($validTags as $tag)
{
if ($validTagsStr)
{ $validTagsStr .= ' or '; }
$validTagsStr .= 'self::'.$tag;
}
$results = $dom_xpath->query('//*[not('.$validTagsStr.')');

Sooo... perhaps str_replace($current, "<td1 va-laign=\"top\">", "") could do the trick?

ignoring nested elements when parsing xml with php

probably a simple question to answer for someone:::
xml:
<foobar>
<foo>i am a foo</foo>
<bar>i am a bar</bar>
<foo>i am a <bar>bar</bar></foo>
</foobar>
In the above, I want to display all elements that are <foo>. When the script gets to the line with the nested < bar > the result is "i am a bar" .. which isn't the result I had hoped for.
Is it not possible to print out the entire contents of that element as it is, so that i see: "i am a <bar>bar</bar>"
php:
$xml = file_get_contents('sample');
$dom = new DOMDocument;
#$dom->loadHTML($xml);
$resources= $dom->getElementsByTagName('foo');
foreach ($resources as $resource){
echo $resource->nodeValue . "\n";
}

After some trolling and trying to do what I needed with SimpleXML, I arrived at the following conclusion. My issue with SimpleXML was where the elements are. If the xml is structured, and the hierarchy is standard ... I have no problem.
If the XML is a web page for example, and the <foo> element is anywhere, SimpleXML doesn't have a good facility like getElementsByTagName to pull out the element wherever it may be....
<?php
$doc = new DOMDocument();
$doc->load('sample');
$element_name = 'foo';
if ($doc->getElementsByTagName($element_name)->length > 0) {
$resources = $doc->getElementsByTagName($element_name);
foreach ($resources as $resource) {
$id = null;
if (!$resource->hasAttribute('id')) {
$resource->setAttribute('id', gen_uuid());
}
$innerHTML = null;
$children = $resource->childNodes;
foreach ($children as $child) {
$tmp_doc = new DOMDocument();
$tmp_doc->appendChild($tmp_doc->importNode($child,true));
$innerHTML .= rtrim($tmp_doc->saveHTML());
}
$resource->nodevalue = $innerHTML;
}
}
echo $doc->saveHTML();
?>

Rather than writing all that code, you might try XPath. That expression would be "//foo", which would get a list of all the elements in the document named "foo".
http://php.net/manual/en/simplexmlelement.xpath.php

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Get xpath of xml node within recursive function - php

$domNode = dom_import_simplexml($node); $xpath = $domNode->getNodePath(); You need PHP 5 >= 5.2.0 for this to work.

Related

How to get a list of all html elements in PHP?

php XML DOM - updating XML dom elements

DomNode get value of item

How to remove invalid element from DOM?

ignoring nested elements when parsing xml with php

Categories

Resources