php XML DOM - updating XML dom elements - php

I have an XML that looks like this:
<nitf:body.content>
<nitf:block>
<nitf:p style="#style1">Contents of paragraph1.</nitf:p>
<nitf:p style="#style2">Contents of paragraph2.</nitf:p>
<nitf:p style="#style1"><nitf:em class="#bold">This is bold</nitf:em> This is not bold</nitf:p>
<nitf:p style="#style1"><nitf:em class="#italic">This is italic</nitf:em> This is not italic</nitf:p>
</nitf:block>
</nitf:body.content>
And I made a loop to update the text of all nitf:em tags as following:
foreach($this->doc->getElementsByTagNameNS($this->nitfNS, 'em') as $em) {
$class = $em->getAttribute('class');
if ($class == '#italic') {
$em->nodeValue = '<i>' . $em->nodeValue . '</i>';
}
elseif (strpos($class, 'bold') !== FALSE) {
$em->nodeValue = '<b>' . $em->nodeValue . '</b>';
}
$this->doc->saveXML($em);
}
Now when I loop again through the paragraph elements, the paragraphs that should be updated by the previous loop are all empty.
foreach ($this->doc->getElementsByTagNameNS($this->nitfNS, 'p') as $element) {
$textnode = $element->childNodes->item(0);
$txt = $textnode->wholeText; // this is EMPTY now
}
I read somewhere that"<>" characters might mess up the DOM parser. If that is the case here how can I update the em elements with the desired html tags (italic & bold).
Thanks in advance

You have made 2 mistakes. One is the property $textnode->wholeText - it does not exists. If you like to fetch the text content use $textnode->textContent.
The other mistake is setting DOMElement::$nodeValue with some XML fragment. That will not work. The property does contain only text, not the tags. In fact you should never set it to anything else then an empty string (to delete all child nodes). The escaping is broken.
For your problem create a new node, move all child nodes from the em to it and append the new node back to the em.
$document = new DOMDocument();
$document->loadXml($xml);
foreach($document->getElementsByTagNameNS($nitfNS, 'em') as $em) {
$class = $em->getAttribute('class');
$newNode = FALSE;
if ($class == '#italic') {
$newNode = $document->createElement('i');
} elseif (strpos($class, 'bold') !== FALSE) {
$newNode = $document->createElement('b');
}
if ($newNode) {
while ($em->firstChild) {
$newNode->appendChild($em->firstChild);
}
$em->appendChild($newNode);
}
echo $document->saveXML($em), "\n\n";
}
Output:
<nitf:em class="#bold"><b>This is bold</b></nitf:em>
<nitf:em class="#italic"><i>This is italic</i></nitf:em>

Related

How to get a list of all html elements in PHP?

According to the documentation for DOMDocument::getElementsByTagName, I can call the function with "*" argument, and get a list of all HTML elements from some HTML code.
However, with the following code:
<?php
$dom = new DOMDocument();
$dom->loadHTML("<html><body><div>hello</div><div>bye</div></body></html>");
$nodes = $dom->getElementsByTagName("*");
foreach ($nodes as $node) {
$new_text= new DOMText($node->textContent."MODIFIED");
$node->removeChild($node->firstChild);
$node->appendChild($new_text);
}
$content = $dom->saveHTML();
echo $content;
?>
I get a list of only one element, and the result of execution of the code above is:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>hellobyeMODIFIED</html>
while I would expect something like this:
<html><body><div>helloMODIFIED</div><div>byeMODIFIED</div></body></html>
Shouldn't DOMDocument::getElementsByTagName method return a list of as many HTML elements as available in the HTML code?
Note: I need to create DOMText instances explicitly, because I need this to work in PHP 5.4. DOMNode::textContent is accessible for writing only from PHP 5.6
The DOMDocument::getElementsByTagName method actually returns all the tags, if the first argument is '*'. But your code replaces <body> tag (including all child nodes) with a text node at the first iteration.
Iterate the nodes, and modify only the nodes with nodeType property equal to XML_TEXT_NODE:
$nodes = $dom->getElementsByTagName('*');
foreach ($nodes as $node) {
for ($child = $node->firstChild; $child; $child = $child->nextSibling) {
if (! ($child->nodeType === XML_TEXT_NODE && trim($child->textContent))) {
continue;
}
// The textContent is writable since PHP 5.6.1
if (PHP_VERSION_ID >= 50601) {
$child->textContent .= 'MODIFIED';
continue;
}
// For older versions, create DOMText explicitly
$text = new DOMText($child->textContent . 'MODIFIED');
try {
if ($child->parentNode->replaceChild($text, $child))
$child = $text;
} catch (Exception $e) {
trigger_error("Failed to modify text '$child->textContent': "
. $e->getMessage(), E_USER_WARNING);
}
}
}
echo $dom->saveHTML();
Note, for PHP versions 5.6.1 and newer, you don't need to create DOMText instances explicitly, since the DOMNode::textContent property is accessible for read and write. So you can simply modify the text by assigning a string value to this property. Only make sure that the node has no child nodes other than XML_TEXT_NODE.
The code above checks if trim($child->textContent) is not empty, because the document may contain extra space characters (including newline), e.g.:
<div><!-- newline/spaces -->
<span>text</span><!-- newline/spaces -->
</div><!-- newline/spaces -->
This function 'DOMDocument::getElementsByTagName' returns a new instance of class DOMNodeList containing all the elements.
And it works fine:
<?php
$dom = new DOMDocument();
$dom->loadHTML("<html><body><div>hello</div><div>bye</div></body></html>");
$nodes = $dom->getElementsByTagName("*");
foreach ($nodes as $node) {
echo $node->tagName."<br />";
}
?>
it output all tags of your document.
Probably you need smth like:
<?php
$dom = new DOMDocument();
$dom->loadHTML("<html><body><div>hello</div><div>bye</div></body></html>");
$nodes = $dom->getElementsByTagName("*");
foreach ($nodes as $node) {
if ($node->tagName=='div'){
$node->nodeValue .= "new content";
}
}
$content = $dom->saveHTML();
echo htmlspecialchars($content);
?>
Try this:-
foreach($dom->getElementsByTagName('*') as $element ){
}

How to pull a specific tag from a node in xml with PHP?

Here is where I set up basic variables, such as creating the new DomDoc and such as well as loading some of the Tags. This all works fine at the moment.
<?php
if (isset($_GET['edit'])&& $_GET['edit']=='delete' && isset($_GET['id'])&&!empty($_GET['id'])){
$dom = new DomDocument();
$dom->preserveWhiteSpace = false;
$dom->load("data.xml");
$root = $dom->documentElement;
$record = $root->getElementsByTagName("data");
$ID=$root->getElementsByTagName("ID");
$nodetoremove = null;
//$namenode=$root->getElementsByTagName("own_name");
//$name="";
//$datenode=$root->getElementsByTagName("sign_in");
//$date="";
$newid=$_GET['id'];
foreach($ID as $node){
$pid =$node->textContent;
Here I am checking if it's a new ID and if it is it does the following as seen.
if ($pid == $newid)
{
$nodetoremove=$node->parentNode;
}
}
The issue is here. I am able to go through the selected node I wish to delete ($nodetoremove) and select a specific element (sign_in) but I am unsure how to so. Right now all I can do is go through and print all of the elements within the nodes of $nodetoremove. Is there a way I can get the element I want from XML this way?
//Prints all information within $nodetoremove
foreach ($nodetoremove->childNodes AS $item){
print $item->nodeName . "=" . $item->nodeValue . "<br>";
}
foreach ($nodetoremove as $node) {
}
//Sets $name to the first Child of $nodetoremove
$name=$nodetoremove->firstChild->nodeValue;
//Checks if the nods to remove is not null, if it is removes $nodetoremove
if($nodetoremove!=null){
$root->removeChild($nodetoremove);
?>

How to remove invalid element from DOM?

We have the following code that lists the xpaths where $value is found.
We have detected for a given URL (see on picture) a non standard tag td1 which in addition doesn't have a closing tag. Probably the site developers have put that there intentionally, as you see in the screen shot below.
This element creates problems identifying the corect XPath for nodes.
A broken Xpath example :
/html/body/div[2]/div[2]/table/tr[2]/td/table/tr[1]/td[2]/table/tr[2]/td[2]/table[3]/tr[2]/**td1**/td[2]/span/u[1]
(as you see td1 is identified and chained in the Xpath)
We think by removing this element it helps us to build the valid XPath we are after.
A valid example is
/html/body/div[2]/div[2]/table/tr[2]/td/table/tr[1]/td[2]/table/tr[2]/td[2]/table[3]/tr[2]/td[2]/span/u[1]
How can we remove prior loading in DOMXpath? Do you have some other approach?
We would like to remove all the invalid tags which may be other than td1, as h8, diw, etc...
private function extract($url, $value) {
$dom = new DOMDocument();
$file = 'content.txt';
//$current = file_get_contents($url);
$current = CurlTool::downloadFile($url, $file);
//file_put_contents($file, $current);
#$dom->loadHTMLFile($current);
//use DOMXpath to navigate the html with the DOM
$dom_xpath = new DOMXpath($dom);
$elements = $dom_xpath->query("//*[text()[contains(., '" . $value . "')]]");
var_dump($elements);
if (!is_null($elements)) {
foreach ($elements as $element) {
var_dump($element);
echo "\n1.[" . $element->nodeName . "]\n";
$nodes = $element->childNodes;
foreach ($nodes as $node) {
if( ($node->nodeValue != null) && ($node->nodeValue === $value) ) {
echo '2.' . $node->nodeValue . "\n";
$xpath = preg_replace("/\/text\(\)/", "", $node->getNodePath());
echo '3.' . $xpath . "\n";
}
}
}
}
}
You could use XPath to find the offending nodes and remove them, while promoting its children into its place in the DOM. Then your paths will be correct.
$dom_xpath = new DOMXpath($dom);
$results = $dom_xpath->query('//td1'); // (or any offending element)
foreach ($results as $invalidNode)
{
$parentNode = $invalidNode->parentNode;
while ($invalidNode->childNodes)
{
$firstChild = $invalidNode->firstChild;
$parentNode->insertBefore($firstChild,$invalidNode);
}
$parentNode->removeChild($invalidNode);
}
EDIT:
You could also build a list of offending elements by using a list of valid elements and negating it.
// Build list manually from the HTML spec:
// See: http://www.w3.org/TR/html5/section-index.html#elements-1
$validTags = array();
// Convert list to XPath:
$validTagsStr = '';
foreach ($validTags as $tag)
{
if ($validTagsStr)
{ $validTagsStr .= ' or '; }
$validTagsStr .= 'self::'.$tag;
}
$results = $dom_xpath->query('//*[not('.$validTagsStr.')');
Sooo... perhaps str_replace($current, "<td1 va-laign=\"top\">", "") could do the trick?

Retrieving single node value from a nodelist

I'm having difficulty extracting a single node value from a nodelist.
My code takes an xml file which holds several fields, some containing text, file paths and full image names with extensions.
I run an expath query over it, looking for the node item with a certain id. It then stores the matched node item and saves it as $oldnode
Now my problem is trying to extract a value from that $oldnode. I have tried to var_dump($oldnode) and print_r($oldnode) but it returns the following: "object(DOMElement)#8 (0) { } "
Im guessing the $oldnode variable is an object, but how do I access it?
I am able to echo out the whole node list by using: echo $oldnode->nodeValue;
This displays all the nodes in the list.
Here is the code which handles the xml file. line 6 is the line in question...
$xpathexp = "//item[#id=". $updateID ."]";
$xpath = new DOMXpath($xml);
$nodelist = $xpath->query($xpathexp);
if((is_null($nodelist)) || (! is_numeric($nodelist))) {
$oldnode = $nodelist->item(0);
echo $oldnode->nodeValue;
//$imgUpload = strchr($oldnode->nodeValue, ' ');
//$imgUpload = strrchr($imgUpload, '/');
//explode('/',$imgUpload);
//$imgUpload = trim($imgUpload);
$newItem = new DomDocument;
$item_node = $newItem ->createElement('item');
//Create attribute on the node as well
$item_node ->setAttribute("id", $updateID);
$largeImageText = $newItem->createElement('largeImgText');
$largeImageText->appendChild( $newItem->createCDATASection($largeImgText));
$item_node->appendChild($largeImageText);
$urlANode = $newItem->createElement('urlA');
$urlANode->appendChild( $newItem->createCDATASection($urlA));
$item_node->appendChild($urlANode);
$largeImg = $newItem->createElement('largeImg');
$largeImg->appendChild( $newItem->createCDATASection($imgUpload));
$item_node->appendChild($largeImg);
$thumbnailTextNode = $newItem->createElement('thumbnailText');
$thumbnailTextNode->appendChild( $newItem->createCDATASection($thumbnailText));
$item_node->appendChild($thumbnailTextNode);
$urlB = $newItem->createElement('urlB');
$urlB->appendChild( $newItem->createCDATASection($urlA));
$item_node->appendChild($urlB);
$thumbnailImg = $newItem->createElement('thumbnailImg');
$thumbnailImg->appendChild( $newItem->createCDATASection(basename($_FILES['thumbnailImg']['name'])));
$item_node->appendChild($thumbnailImg);
$newItem->appendChild($item_node);
$newnode = $xml->importNode($newItem->documentElement, true);
// Replace
$oldnode->parentNode->replaceChild($newnode, $oldnode);
// Display
$xml->save($xmlFileData);
//header('Location: index.php?a=112&id=5');
Any help would be great.
Thanks
Wasn't it supposed to be echo $oldnode->firstChild->nodeValue;? I remember this because technically you need the value from the text node.. but I might be mistaken, it's been a while. You could give it a try?
After our discussion in the comments on this answer, I came up with this solution. I'm not sure if it can be done cleaner, perhaps. But it should work.
$nodelist = $xpath->query($xpathexp);
if((is_null($nodelist)) || (! is_numeric($nodelist))) {
$oldnode = $nodelist->item(0);
$largeImg = null;
$thumbnailImg = null;
foreach( $oldnode->childNodes as $node ) {
if( $node->nodeName == "largeImg" ) {
$largeImg = $node->nodeValue;
} else if( $node->nodeName == "thumbnailImg" ) {
$thumbnailImg = $node->nodeValue;
}
}
var_dump($largeImg);
var_dump($thumbnailImg);
}
You could also use getElementsByTagName on the $oldnode, then see if it found anything (and if a node was found, $oldnode->getElementsByTagName("thumbnailImg")->item(0)->nodeValue). Which might be cleaner then looping through them.

Get xpath of xml node within recursive function

Lets say i have some code to iterate through an XML file recursively like this:
$xmlfile = new SimpleXMLElement('http://www.domain.com/file.xml',null,true);
xmlRecurse($xmlfile,0);
function xmlRecurse($xmlObj,$depth) {
foreach($xmlObj->children() as $child) {
echo str_repeat('-',$depth).">".$child->getName().": ".$subchild."\n";
foreach($child->attributes() as $k=>$v){
echo "Attrib".str_repeat('-',$depth).">".$k." = ".$v."\n";
}
xmlRecurse($child,$depth+1);
}
}
How would i calculate the xpath of each node so i can store it for mapping to other code?
The obvious way to do it is to pass the XPath as a third parameter and build it as you dig deeper. You have to account for siblings having the same name, so you have to keep track of the number of precedent siblings with the same name as current child while iterating.
Working example:
function xmlRecurse($xmlObj,$depth=0,$xpath=null) {
if (!isset($xpath)) {
$xpath='/'.$xmlObj->getName().'/';
}
$position = array();
foreach($xmlObj->children() as $child) {
$name = $child->getName();
if(isset($position[$name])) {
++$position[$name];
}
else {
$position[$name]=1;
}
$path=$xpath.$name.'['.$position[$name].']';
echo str_repeat('-',$depth).">".$name.": $path\n";
foreach($child->attributes() as $k=>$v){
echo "Attrib".str_repeat('-',$depth).">".$k." = ".$v."\n";
}
xmlRecurse($child,$depth+1,$path.'/');
}
}
Attention though, the whole idea of mapping a whole document and storing XPath along the way seems weird. You might actually be working on the wrong solution to a totally different problem.
You can pass to your xmlRecurse third param called $xpath (with current node xPath representation) and add xpath representation of the children on each iteration:
function xmlRecurse($xmlObj,$depth,$xpath) {
$i=0;
foreach($xmlObj->children() as $child) {
echo str_repeat('-',$depth).">".$child->getName().": ".$subchild."\n";
foreach($child->attributes() as $k=>$v){
echo "Attrib".str_repeat('-',$depth).">".$k." = ".$v."\n";
}
xmlRecurse($child,$depth+1,$xpath.'/'.$child->getName().'['.$i++.']');
}
}
With SimpleXML, I think you can only do it as others have pointed out: by recursing the node path as a string argument.
With DOMDocument, you could use the $node->parentNode property to crawl back to the document element and construct it for an arbitrary node (for example if you had a reference to a node and wanted to discover where in the tree it was without prior knowledge of how you got to that node).
$domNode = dom_import_simplexml($node);
$xpath = $domNode->getNodePath();
You need PHP 5 >= 5.2.0 for this to work.
Following up on MightyE's idea about backtracking:
function whereami($node)
{
if ($node instanceof SimpleXMLElement)
{
$node = dom_import_simplexml($node);
}
elseif (!$node instanceof DOMNode)
{
die('Not a node?');
}
$q = new DOMXPath($node->ownerDocument);
$xpath = '';
do
{
$position = 1 + $q->query('preceding-sibling::*[name()="' . $node->nodeName . '"]', $node)->length;
$xpath = '/' . $node->nodeName . '[' . $position . ']' . $xpath;
$node = $node->parentNode;
}
while (!$node instanceof DOMDocument);
return $xpath;
}
I wouldn't recommend it for the case at hand (mapping a whole document, as opposed to a single given node) but it might be useful for future reference.

Categories