PHP DOM Document - get everything between two nodes - php

I have this as a part of my XML that I am loading in a DOM Document:
<error n='\Author'/>
Some Text 1
<formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><msup><mrow/> <mrow><mn>1</mn><mo>,</mo></mrow> </msup></math></formula>
Some Text 2
<formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><msup><mrow/> <mn>2</mn> </msup></math></formula>
<error n='\address' />
My goal is to get everything as nodeValue between the
<error n='\Author' />
And
<error n='\address' />
How can this be done?
I tested this:
$author_node = $xpath_xml->query("//error[#n='\Author']/following-sibling::*[1]")->item(0);
if ($author_node != null) {
$i = 1;
$nextNodeName = "";
$author = "";
while ($nextNodeName != "error" && $i < 20) {
$nextNodeName = $xpath_xml->query("//error[#n='\Author']/following-sibling::*[$i]")->item(0)->tagName;
if ($nextNodeName == "error")
continue;
$author .= $nextNode->nodeValue;
}
But Am getting only the formula content, not the text between formulas.
Thank you.

The *only selects element nodes, not text nodes. So only the <formula> elements are selected. You need to use node(). But you could use xpath directly to selected the needed nodes. Look for an explanation of the Kayessian method.
$dom = new DOMDocument();
$dom->loadXml($xml);
$xpath = new DOMXpath($dom);
$nodes = $xpath->evaluate(
'//error[#n="\\Author"][1]
/following-sibling::node()
[
count(
.|
//error[#n="\\Author"][1]
/following-sibling::error[#n="\\address"][1]
/preceding-sibling::node()
)
=
count(
//error[#n="\\Author"][1]
/following-sibling::error[#n="\\address"][1]
/preceding-sibling::node()
)
]'
);
$result = '';
foreach ($nodes as $node) {
$result .= $node->nodeValue;
}
var_dump($result);
Demo: https://eval.in/125494
If you want to save not only the text content, but the XML fragment, you can use DOMDocument::saveXml() with the node as argument.
$result = '';
foreach ($nodes as $node) {
$result .= $node->ownerDocument->saveXml($node);
}
var_dump($result);

Related

Why does not display the attribute html via xpath php

Why does not display the attribute html via xpath php
<?php
$content = '<div class="keep-me">Keep this div</div><div class="remove-me" id="test">Remove this div</div>';
$badClasses = array('');
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($content);
libxml_clear_errors();
$xPath = new DOMXpath($dom);
foreach($badClasses as $badClass){
$domNodeList = $xPath->query('//div[#class="remove-me"]/#id');
$domElemsToRemove = ''; // container of deleted elements
foreach ( $domNodeList as $domElement ) {
$domElemsToRemove .= $dom->saveHTML($domElement); // concat them
$domElement->parentNode->removeChild($domElement); // then remove
}
}
$content = $dom->saveHTML();
echo htmlentities($domElemsToRemove);
?>
Works - //div[#class="remove-me"] or //div[#class="remove-me"]/text()
Not working - //div[#class="remove-me"]/#id
Maybe there is a way easier
The XPath //div[#class="remove-me"]/#id is correct, but you need to just loop over the returned elements and add the nodeValue to a list of matching ID's...
$xPath = new DOMXpath($dom);
$domNodeList = $xPath->query('//div[#class="remove-me"]/#id');
$ids = []; // container of deleted elements
foreach ( $domNodeList as $domElement ) {
$ids[] = $domElement->nodeValue;
}
print_r($ids);
If the aim is to fetch the ID of any element with class "remove-me" as is how I interpret the question then perhaps you can try like this - untested btw...
.... other code before
$xp=new DOMXpath( $dom );
$col= $xp->query( '*[#class="remove-me"]' );
if( $col->length > 0 ){
foreach($col as $node){
$id=$node->hasAttribute('id') ? $node->getAttribute('id') : 'banana';
echo $id;
}
}
however looking at the code in the question suggests that you wish to delete nodes - in which case build an array of nodes ( nodelist ) and iterate through it from the end to the front - ie: backwards...

Getting link tag via DOMDocument

I convert an atom feed into RSS using atom2rss.xsl. Works fine.
Then, using DOMDocument, I try to get the post title and URL:
$feed = new DOMDocument();
$feed->loadHTML('<?xml encoding="utf-8" ?>' . $html);
if (!empty($feed) && is_object($feed) ) {
foreach ($feed->getElementsByTagName("item") as $item){
echo 'url: '. $item->getElementsByTagName("link")->item(0)->nodeValue;
echo 'title'. $item->getElementsByTagName("title")->item(0)->nodeValue;
}
return;
}
But the post URL is empty.
See this eval which contains HTML. What am I doing wrong? I suspect I am not getting the link tag properly via $item->getElementsByTagName("link")->item(0)->nodeValue.
I think the problem is that there are several <link> elements in each item and the one (I think) your interested in is the one with rel="self" as an attribute. The quickest way (without messing around with XPath) is to loop over each <link> element checking for the right rel value and then take the href attribute from that...
if (!empty($feed) && is_object($feed) ) {
foreach ($feed->getElementsByTagName("item") as $item){
$url = "";
// Look for the 'right' link tag and extract URL from that
foreach ( $item->getElementsByTagName("link") as $link ) {
if ( $link->getAttribute("rel") == "self" ) {
$url = $link->getAttribute("href");
break;
}
}
echo 'url: '. $url;
echo 'title'. $item->getElementsByTagName("title")->item(0)->nodeValue;
}
return;
}
which gives...
url: https://www.blogger.com/feeds/2984353310628523257/posts/default/1947782625877709813titleExtraordinary Genius - Cp274
function get_links($link)
{
$ret = array();
$dom = new DOMDocument();
#$dom->loadHTML(file_get_contents($link));
$dom->preserveWhiteSpace = false;
$links = $dom->getElementsByTagName('a');
foreach ($links as $tag){
$ret[$tag->getAttribute('href')] = $tag->childNodes->item(0)->nodeValue;
}
return $ret;
}
print_r(get_links('http://www.google.com'));
OR u can use DOMXpath
$html = file_get_contents('http://www.google.com');
$dom = new DOMDocument();
#$dom->loadHTML($html);
// take all links
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
echo $url.'
';

How to get element xml using xpath php

I have issue with how to get element xml using xpath php, i already create a php file to extract the "attributes" xml by using xpath php.
What i want is how to extract every element in xml by using xpath.
test.xml
<?xml version="1.0" encoding="UTF-8"?>
<InvoicingData>
<CreationDate> 2014-02-02 </CreationDate>
<OrderNumber> XXXX123 </OrderNumber>
<InvoiceDetails>
<InvoiceDetail>
<SalesCode> XX1A </SalesCode>
<SalesName> JohnDoe </SalesName>
</InvoiceDetail>
</InvoiceDetails>
</InvoicingData>
read.php
<?php
$doc = new DOMDocument();
$doc->loadXML(file_get_contents("test.xml"));
$xpath = new DOMXpath($doc);
$nodes = $xpath->query('//*');
$names = array();
foreach ($nodes as $node)
{
$names[] = $node->nodeName;
}
echo join(PHP_EOL, ($names));
?>
From the code above it will print like this :
CreationDate OrderNumber InvoiceDetails InvoiceDetail SalesCode
SalesName
So, the problem is, how to get the element inside the attribute, basically this is what i want to print :
2014-02-02 XXXX123 XX1A JohnDoe
You use $node->textContent to get the textual value of the node (and its descendants, if any).
In response to your first comment:
You didn't use $node->textContent. Try this:
$doc = new DOMDocument();
$doc->loadXML(file_get_contents("test.xml"));
$xpath = new DOMXpath($doc);
$nodes = $xpath->query('//*');
$names = array();
$values = array(); // created a separate array for the values
foreach ($nodes as $node)
{
$names[] = $node->nodeName;
$values[] = $node->textContent; // push to $values array
}
echo join(PHP_EOL, ($values));
However, if you only want to push the textual values when they're a direct child of an element and still want to collect all node names as well, you could do something like:
foreach ($nodes as $node)
{
$names[] = $node->nodeName;
// check that this node only contains one text node
if( $node->childNodes->length == 1 && $node->firstChild instanceof DOMText ) {
$values[] = $node->textContent;
}
}
echo join(PHP_EOL, ($values));
And if you only care about the nodes that directly contain textual values, you could do something like this:
// this XPath query only selects those nodes that directly contain non-whitespace text
$nodes = $xpath->query('//*[./text()[normalize-space()]]');
$values = array();
foreach ($nodes as $node)
{
// add nodeName as key
// (only works reliable of there's never a duplicate nodeName in your XML)
// and add textContent as value
$values[ $node->nodeName ] = trim( $node->textContent );
}
var_dump( $values );

How can I get all attributes with PHP xpath?

Given the following HTML string:
<div
class="example-class"
data-caption="Example caption"
data-link="https://www.example.com"
data-image-url="https://example.com/example.jpg">
</div>
How can I use PHP with xpath to output / retrieve an array with all attributes as key / value pairs?
Hoping for output like:
Array
(
[data-caption] => Example caption
[data-link] => https://www.example.com
[data-image-url] => https://example.com/example.jpg
)
// etc etc...
I know how to get individual attributes, but I'm hoping to do it in one fell swoop. Here's what I currently have:
function get_data($html = '') {
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//div/#data-link');
foreach ($nodes as $node) {
var_dump($node);
}
}
Thanks!
In XPath, you can use #* to reference attributes of any name, for example :
$nodes = $xpath->query('//div/#*');
foreach ($nodes as $node) {
echo $node->nodeName ." : ". $node->nodeValue ."<br>";
}
eval.in demo
output :
class : example-class
data-caption : Example caption
data-link : https://www.example.com
data-image-url : https://example.com/example.jpg
I think this should do what you want - or at least, give you the basis to proceed.
define('BR','<br />');
$strhtml='<div
class="example-class"
data-caption="Example caption"
data-link="https://www.example.com"
data-image-url="https://example.com/example.jpg">
</div>';
$dom=new DOMDocument;
$dom->loadHTML( $strhtml );
$xpath=new DOMXPath( $dom );
$col=$xpath->query('//div');
if( $col ){
foreach( $col as $node ) if( $node->nodeType==XML_ELEMENT_NODE ) {
foreach( $node->attributes as $attr ) echo $attr->nodeName.' '.$attr->nodeValue.BR;
}
}
$dom = $col = $xpath = null;

Remove empty tags from a XML with PHP

Question
How can I remove empty xml tags in PHP?
Example:
$value1 = "2";
$value2 = "4";
$value3 = "";
xml = '<parentnode>
<tag1> ' .$value1. '</tag1>
<tag2> ' .$value2. '</tag2>
<tag3> ' .$value3. '</tag3>
</parentnode>';
XML Result:
<parentnode>
<tag1>2</tag1>
<tag2>4</tag2>
<tag3></tag3> // <- Empty tag
</parentnode>
What I want!
<parentnode>
<tag1>2</tag1>
<tag2>4</tag2>
</parentnode>
The XML without the empty tags like "tag3"
Thanks!
You can use XPath with the predicate not(node()) to select all elements that do not have child nodes.
<?php
$doc = new DOMDocument;
$doc->preserveWhiteSpace = false;
$doc->loadxml('<parentnode>
<tag1>2</tag1>
<tag2>4</tag2>
<tag3></tag3>
<tag2>4</tag2>
<tag3></tag3>
<tag2>4</tag2>
<tag3></tag3>
</parentnode>');
$xpath = new DOMXPath($doc);
foreach( $xpath->query('//*[not(node())]') as $node ) {
$node->parentNode->removeChild($node);
}
$doc->formatOutput = true;
echo $doc->savexml();
prints
<?xml version="1.0"?>
<parentnode>
<tag1>2</tag1>
<tag2>4</tag2>
<tag2>4</tag2>
<tag2>4</tag2>
</parentnode>
This works recursively and removes nodes that:
contain only spaces
do not have attributes
do not have child notes
// not(*) does not have children elements
// not(#*) does not have attributes
// text()[normalize-space()] nodes that include whitespace text
while (($node_list = $xpath->query('//*[not(*) and not(#*) and not(text()[normalize-space()])]')) && $node_list->length) {
foreach ($node_list as $node) {
$node->parentNode->removeChild($node);
}
}
$dom = new DOMDocument;
$dom->loadXML($xml);
$elements = $dom->getElementsByTagName('*');
foreach($elements as $element) {
if ( ! $element->hasChildNodes() OR $element->nodeValue == '') {
$element->parentNode->removeChild($element);
}
}
echo $dom->saveXML();
CodePad.
The solution that worked with my production PHP SimpleXMLElement object code, by using Xpath, was:
/*
* Remove empty (no children) and blank (no text) XML element nodes, but not an empty root element (/child::*).
* This does not work recursively; meaning after empty child elements are removed, parents are not reexamined.
*/
foreach( $this->xml->xpath('/child::*//*[not(*) and not(text()[normalize-space()])]') as $emptyElement ) {
unset( $emptyElement[0] );
}
Note that it is not required to use PHP DOM, DOMDocument, DOMXPath, or dom_import_simplexml().
//this is a recursively option
do {
$removed = false;
foreach( $this->xml->xpath('/child::*//*[not(*) and not(text()[normalize-space()])]') as $emptyElement ) {
unset( $emptyElement[0] );
$removed = true;
}
} while ($removed) ;
If you're going to be a lot of this, just do something like:
$value[] = "2";
$value[] = "4";
$value[] = "";
$xml = '<parentnode>';
for($i=1,$m=count($value); $i<$m+1; $i++)
$xml .= !empty($value[$i-1]) ? "<tag{$i}>{$value[$i-1]}</tag{$i}>" : null;
$xml .= '</parentnode>';
echo $xml;
Ideally though, you should probably use domdocument.

Categories