Iterate through elements with DOMDocument & DOMXPath - php

I am trying to iterate through every child element of the containing div:
$html = ' <div id="roothtml">
<h1>
Introduction</h1>
<p>text</p>
<h2>
text</h2>
<p>
test</p>
</div>';
And I have this PHP:
$dom = new DOMDocument();
$dom->loadHTML($html);
$dom->preserveWhitespace = false;
$xpath = new DOMXPath($dom);
$els = $xpath->query("/div");
print_r($els);
All I get though is DOMNodeList Object ( )
Having looked at the IBM tutorial I should be getting an array. What is it I am doing wrong?
Any help is appreciated.

You're using the wrong query string, you should be using //div.
Iterate over the list like this:
$els = $xpath->query("//div");
foreach( $els as $el) {
echo $el->textContent;
}

Related

How to parse data usin DOM Parser in php and getting empty array

I'm trying to pull out some datas using the DOM Parser technique.
My code :
<?php
// create new DOMDocument
$document = new \DOMDocument('1.0', 'UTF-8');
// set error level
$internalErrors = libxml_use_internal_errors(true);
$data = '<div id="show">
<ul class="browse_in_widget_col">
<li>
<a href="accounting/">
Accounting
</a>
<span>
(7420)
</span>
</li>
</div>';
$dom = new DOMDocument();
$dom->loadHTML($data, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xp = new DOMXPath($dom);
$makes = $xp->query('//ul[#class="browse_in_widget_col"]/ul');
$makeList = [];
foreach ( $makes as $make ) {
$makeList[] = $make->textContent;
}
print_r($makeList);
?>
Here i want to pull out the between the element <a> tag.
Example here i need Accounting from this element. How i can do that ?
Help me to get all the values in the a tag. Now I'm getting the empty array
In your XPath expression, you are looking for a nested <ul> tag, which there isn't one. If you just want the contents of the <a> tags, you can change the query to //ul[#class="browse_in_widget_col"]//a.
$xp = new DOMXPath($dom);
$makes = $xp->query('//ul[#class="browse_in_widget_col"]//a');
$makeList = [];
foreach ( $makes as $make ) {
$makeList[] = trim($make->textContent);
}
I've also added trim() to the output to remove any whitespace.

PHP DOMDocument: Delete elements by class

I' trying to delete every node with a given class.
To find the elements I use:
$xpath = new DOMXPath($dom);
foreach( $xpath->query('//div[contains(attribute::class, "foo")]') as $e ) {
// Delete this node
}
But how can I delete the elements in this foreach-loop?
Edit: By the way: How can I check first if there is a element with the class "foo" in the DOM (before starting the loop)?
Update:
This is my HTML:
<div class="main">
<div class="delete_this" contenteditable="true">Target</div>
<div class="class1"></div>
<div class="content"><p>Anything</p></div>
</div>
This doesn't work for the example above:
$xpath = new DOMXPath($dom);
foreach( $xpath->query('//div[contains(attribute::class, "delete_this")]') as $e ) {
$e->parentNode->removeChild($e);
}
You need to use the removeChild() method of the parent element:
$xpath = new DOMXPath($dom);
foreach($xpath->query('//div[contains(attribute::class, "foo")]') as $e ) {
// Delete this node
$e->parentNode->removeChild($e);
}
Btw, about your second question, if there are no elements found, the loop won't iterate at all.
Here comes a working example:
$html = <<<EOF
<div class="main">
<div class="delete_this" contenteditable="true">Target</div>
<div class="class1"></div>
<div class="content"><p>Anything</p></div>
</div>
EOF;
$doc = new DOMDocument();
$doc->loadHTML($html);
$selector = new DOMXPath($doc);
foreach($selector->query('//div[contains(attribute::class, "delete_this")]') as $e ) {
$e->parentNode->removeChild($e);
}
echo $doc->saveHTML($doc->documentElement);
For the second part of the question, the result of the query has a length property which you can use to see if anything was matched:
$xpath = new DOMXPath($doc);
$nodes = $xpath->query('//div[contains(attribute::class, "foo")]');
printf('Removing %d nodes', $nodes->length);
This removes all divs with that class.
To actually remove all the elements by class use *:
$selector = new \DOMXPath( $doc );
foreach ( $selector->query( '//*[contains(attribute::class, "' . $class . '")]' ) as $e ) {
$e->parentNode->removeChild( $e );
}

Php DOM and Xpath - Replace node but keep children of old node

Consider the following html:
<html>
<title>Xyz</title>
<body>
<div>
<div class='mycls'>
<div>1 Books</div>
<div>2 Papers</div>
<div>3 Pencils</div>
</div>
</div>
<body>
</html>
$dom = new DOMDocument();
$dom->loadHTML([loaded html of remote url through curl]);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('html/body/div[#class="mycls"]');
till here its working fine, i need to replace the node to get following:
<body>
<div>
<span>
<div>1 Books</div>
<div>2 Papers</div>
<div>3 Pencils</div>
</span>
</div>
<body>
Something like the following should work for you:
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$oldNode = $xpath->query('//div[#class="mycls"]')->item(0);
$span = $dom->createElement('span');
if ($oldNode->hasChildNodes()) {
$children = [];
foreach ($oldNode->childNodes as $child) {
$children[] = $child;
}
foreach ($children as $child) {
$span->appendChild($child->parentNode->removeChild($child));
}
}
$oldNode->parentNode->replaceChild($span, $oldNode);
echo htmlspecialchars($dom->saveHTML());
Demo: http://codepad.viper-7.com/WNTrR5
Note that in the demo I also have fixed your HTML which was utterly broken :-)
If you demo is really the HTML you are getting back from the cURL call and you cannot change it (no control over it) you can do:
$libxmlErrors = libxml_use_internal_errors(true); // at the start
and
libxml_use_internal_errors($libxmlErrors); // at the end
To prevent errors popping up

PHP: Fetch content from a html page using xpath()

I'm trying to fetch the content of a div in a html page using xpath and domdocument. This is the structure of the page:
<div id="content">
<div class="div1"></div>
<span class="span1></span>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<div class="div2"></div>
</div>
I want to get only the content of p, not spans and divs. I came thru this xpath expression .//*[#id='content']/p but guess something's not right because i'm getting only the first p. Tried using other expression with following-sibling and node() but all return the first p only.
.//*[#id='content']/span/following-sibling::p
.//*[#id='content']/node()[self::p]
This is how's used xpath:
$domDocument=new DOMDocument();
$domDocument->encoding = 'UFT8';
$domDocument->loadHTML($page);
$domXPath = new DOMXPath($domDocument);
$domNodeList = $domXPath->query($this->xpath);
$content = $this->GetHTMLFromDom($domNodeList);
And this is how i get html from nodes:
private function GetHTMLFromDom($domNodeList){
$domDocument = new DOMDocument();
$node = $domNodeList->item(0);
foreach($node->childNodes as $childNode)
$domDocument->appendChild($domDocument->importNode($childNode, true));
return $domDocument->saveHTML();
}
This XPath expression:
//div[#id='content']/p
Result in the wanted node set (five p elements)
EDIT: Now it's clear what is your problem. You need to iterate over the NodeList:
private function GetHTMLFromDom($domNodeList){
$domDocument = new DOMDocument();
foreach ($nodelist as $node) {
$domDocument->appendChild($domDocument->importNode($node, true));
}
return $domDocument->saveHTML();
}

How to get nodes in first level using PHP DOMDocument?

I'm new to PHP DOM object and have a problem I can't find a solution. I have a DOMDocument with following HTML:
<div id="header">
</div>
<div id="content">
<div id="sidebar">
</div>
<div id="info">
</div>
</div>
<div id="footer">
</div>
I need to get all nodes that are on first level (header, content, footer). hasChildNodes() does not work, because first level node may not have children (header, footer).
For now my code looks like:
$dom = new DOMDocument();
$dom -> preserveWhiteSpace = false;
$dom -> loadHTML($html);
$childs = $dom -> getElementsByTagName('div');
But this gets me all div's. any advice?
You may have to go outside of DOMDocument - maybe convert to SimpleXML or DOMXpath
$file = $DOCUMENT_ROOT. "test.html";
$doc = new DOMDocument();
$doc->loadHTMLFile($file);
$xpath = new DOMXpath($doc);
$elements = $xpath->query("/");
Here's how I grab the first level elements (in this case, the top level TD elements in a table row:
$doc = new DOMDocument();
$doc->preserveWhiteSpace = false;
$doc->loadHTML( $tr_element );
$xpath = new DOMXPath( $doc );
$td = $xpath->query("//tr/td[1]")->item(0);
do{
if( $innerHTML = self::DOMinnerHTML( $td ) )
array_push( $arr, $innerHTML );
$td = $td->nextSibling;
} while( $td != null );
$arr now contains the top TD elements, but not nested table TDs which you would get from
$dom->getElementsByTagName( 'td' );
The DOMinnerHTML function is something I snagged somewhere to get the innerHTML of an element/node:
public static function DOMinnerHTML( $element, $deep=true )
{
$innerHTML = "";
$children = $element->childNodes;
foreach ($children as $child)
{
$tmp_dom = new DOMDocument();
$tmp_dom->appendChild( $tmp_dom->importNode( $child, $deep ) );
$innerHTML.=trim($tmp_dom->saveHTML());
}
return $innerHTML;
}

Categories