How to get nodes in first level using PHP DOMDocument? - php

I'm new to PHP DOM object and have a problem I can't find a solution. I have a DOMDocument with following HTML:
<div id="header">
</div>
<div id="content">
<div id="sidebar">
</div>
<div id="info">
</div>
</div>
<div id="footer">
</div>
I need to get all nodes that are on first level (header, content, footer). hasChildNodes() does not work, because first level node may not have children (header, footer).
For now my code looks like:
$dom = new DOMDocument();
$dom -> preserveWhiteSpace = false;
$dom -> loadHTML($html);
$childs = $dom -> getElementsByTagName('div');
But this gets me all div's. any advice?

You may have to go outside of DOMDocument - maybe convert to SimpleXML or DOMXpath
$file = $DOCUMENT_ROOT. "test.html";
$doc = new DOMDocument();
$doc->loadHTMLFile($file);
$xpath = new DOMXpath($doc);
$elements = $xpath->query("/");

Here's how I grab the first level elements (in this case, the top level TD elements in a table row:
$doc = new DOMDocument();
$doc->preserveWhiteSpace = false;
$doc->loadHTML( $tr_element );
$xpath = new DOMXPath( $doc );
$td = $xpath->query("//tr/td[1]")->item(0);
do{
if( $innerHTML = self::DOMinnerHTML( $td ) )
array_push( $arr, $innerHTML );
$td = $td->nextSibling;
} while( $td != null );
$arr now contains the top TD elements, but not nested table TDs which you would get from
$dom->getElementsByTagName( 'td' );
The DOMinnerHTML function is something I snagged somewhere to get the innerHTML of an element/node:
public static function DOMinnerHTML( $element, $deep=true )
{
$innerHTML = "";
$children = $element->childNodes;
foreach ($children as $child)
{
$tmp_dom = new DOMDocument();
$tmp_dom->appendChild( $tmp_dom->importNode( $child, $deep ) );
$innerHTML.=trim($tmp_dom->saveHTML());
}
return $innerHTML;
}

Related

Why does not display the attribute html via xpath php

Why does not display the attribute html via xpath php
<?php
$content = '<div class="keep-me">Keep this div</div><div class="remove-me" id="test">Remove this div</div>';
$badClasses = array('');
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($content);
libxml_clear_errors();
$xPath = new DOMXpath($dom);
foreach($badClasses as $badClass){
$domNodeList = $xPath->query('//div[#class="remove-me"]/#id');
$domElemsToRemove = ''; // container of deleted elements
foreach ( $domNodeList as $domElement ) {
$domElemsToRemove .= $dom->saveHTML($domElement); // concat them
$domElement->parentNode->removeChild($domElement); // then remove
}
}
$content = $dom->saveHTML();
echo htmlentities($domElemsToRemove);
?>
Works - //div[#class="remove-me"] or //div[#class="remove-me"]/text()
Not working - //div[#class="remove-me"]/#id
Maybe there is a way easier
The XPath //div[#class="remove-me"]/#id is correct, but you need to just loop over the returned elements and add the nodeValue to a list of matching ID's...
$xPath = new DOMXpath($dom);
$domNodeList = $xPath->query('//div[#class="remove-me"]/#id');
$ids = []; // container of deleted elements
foreach ( $domNodeList as $domElement ) {
$ids[] = $domElement->nodeValue;
}
print_r($ids);
If the aim is to fetch the ID of any element with class "remove-me" as is how I interpret the question then perhaps you can try like this - untested btw...
.... other code before
$xp=new DOMXpath( $dom );
$col= $xp->query( '*[#class="remove-me"]' );
if( $col->length > 0 ){
foreach($col as $node){
$id=$node->hasAttribute('id') ? $node->getAttribute('id') : 'banana';
echo $id;
}
}
however looking at the code in the question suggests that you wish to delete nodes - in which case build an array of nodes ( nodelist ) and iterate through it from the end to the front - ie: backwards...

How to web-scrape in in divs with DOMparser

I am trying to get div and for other pages, trying to put it in a foreach.
But facing some troubles,
<div class="article_info">
<ul class="c-result_box">
<li>
<div class="inner cf">
<div class="c-header">
<div class="c-logo">
<im src="/e/designs/31sumai/common/img/logo_08.png" alt="#">
</div>
<p class="c-supplier">三井のマンション</p>
<p class="c-name">
パークリュクス大阪天満
</p>
I'm trying to get the text inside the <a> element, here is my codes, what I am missing here?
$start_id = 1501;
while(true){
$url = 'https://www.31sumai.com/mfr/K'.$start_id.'/outline.html';
$html = file_get_contents($url);
libxml_use_internal_errors(true);
$DOMParser = new \DOMDocument();
$DOMParser->loadHTML($html);
$xpath = new \DOMXPath($DOMParser);
$classname="c-name";
$nodes = $finder->query("//*[contains(#class, '$classname')]");
$MyTable = false;
$insertData = [];
foreach($nodes as $node){
$allNames = [];
foreach($node->getElementsByTagName('a') as $a){
$name = $a->getElementsByTagName('a');
$allProperties[] = [
'names' => $name];
}
}
Thank you for helping!
You can rely on your XPath query to pull all the text node that you want, and then just get the nodeValue property within your loop:
$start_id = "1501";
$url = "https://www.31sumai.com/mfr/K$start_id/outline.html";
$html = file_get_contents($url);
libxml_use_internal_errors(true);
$DOMParser = new \DOMDocument();
$DOMParser->loadHTML($html);
$xpath = new \DOMXPath($DOMParser);
$classname="c-name";
$nodes = $xpath->query("//*[contains(#class, '$classname')]/a/text()");
foreach($nodes as $node){
echo $node->nodeValue;
}

PHP DOMDocument: Delete elements by class

I' trying to delete every node with a given class.
To find the elements I use:
$xpath = new DOMXPath($dom);
foreach( $xpath->query('//div[contains(attribute::class, "foo")]') as $e ) {
// Delete this node
}
But how can I delete the elements in this foreach-loop?
Edit: By the way: How can I check first if there is a element with the class "foo" in the DOM (before starting the loop)?
Update:
This is my HTML:
<div class="main">
<div class="delete_this" contenteditable="true">Target</div>
<div class="class1"></div>
<div class="content"><p>Anything</p></div>
</div>
This doesn't work for the example above:
$xpath = new DOMXPath($dom);
foreach( $xpath->query('//div[contains(attribute::class, "delete_this")]') as $e ) {
$e->parentNode->removeChild($e);
}
You need to use the removeChild() method of the parent element:
$xpath = new DOMXPath($dom);
foreach($xpath->query('//div[contains(attribute::class, "foo")]') as $e ) {
// Delete this node
$e->parentNode->removeChild($e);
}
Btw, about your second question, if there are no elements found, the loop won't iterate at all.
Here comes a working example:
$html = <<<EOF
<div class="main">
<div class="delete_this" contenteditable="true">Target</div>
<div class="class1"></div>
<div class="content"><p>Anything</p></div>
</div>
EOF;
$doc = new DOMDocument();
$doc->loadHTML($html);
$selector = new DOMXPath($doc);
foreach($selector->query('//div[contains(attribute::class, "delete_this")]') as $e ) {
$e->parentNode->removeChild($e);
}
echo $doc->saveHTML($doc->documentElement);
For the second part of the question, the result of the query has a length property which you can use to see if anything was matched:
$xpath = new DOMXPath($doc);
$nodes = $xpath->query('//div[contains(attribute::class, "foo")]');
printf('Removing %d nodes', $nodes->length);
This removes all divs with that class.
To actually remove all the elements by class use *:
$selector = new \DOMXPath( $doc );
foreach ( $selector->query( '//*[contains(attribute::class, "' . $class . '")]' ) as $e ) {
$e->parentNode->removeChild( $e );
}

How to Remove the Parent Div using PHP DOMDocument

$html_string = '<div class="quote" post_id="57"
style="border:1px solid #000;padding:15px;margin:15px;"
user_id="1" user_name="david_cameron"><strong><span
style="font-size:200%;">My Name is Rashid Farooq</span></strong></div>';
I want to remove the Parent Div and get only the following output
<strong><span style="font-size:200%;">My Name is David Cameron</span></strong>
I have tried
$dom = new DOMDocument;
$dom->loadHTML($html_string);
$divs = $dom->getElementsByTagName('div');
$innerHTML_contents = $divs->item(0)->textContent
echo $innerHTML_contents
But It gives me the only 'My Name is David Cameron' and strip all the tags.
How Can I remove only the parent div and get all other html contents in the div?
try to use this function
function DOMinnerHTML($element)
{
$innerHTML = "";
$children = $element->childNodes;
foreach ($children as $child)
{
$tmp_dom = new DOMDocument();
$tmp_dom->appendChild($tmp_dom->importNode($child, true));
$innerHTML.=trim($tmp_dom->saveHTML());
}
return $innerHTML;
}
like
$dom = new DOMDocument;
$dom->loadHTML($html_string);
$divs = $dom->getElementsByTagName('div');
$innerHTML_contents = DOMinnerHTML($divs->item(0));
echo $innerHTML_contents
output
<strong><span style="font-size:200%;">My Name is Rashid Farooq</span></strong>

Iterate through elements with DOMDocument & DOMXPath

I am trying to iterate through every child element of the containing div:
$html = ' <div id="roothtml">
<h1>
Introduction</h1>
<p>text</p>
<h2>
text</h2>
<p>
test</p>
</div>';
And I have this PHP:
$dom = new DOMDocument();
$dom->loadHTML($html);
$dom->preserveWhitespace = false;
$xpath = new DOMXPath($dom);
$els = $xpath->query("/div");
print_r($els);
All I get though is DOMNodeList Object ( )
Having looked at the IBM tutorial I should be getting an array. What is it I am doing wrong?
Any help is appreciated.
You're using the wrong query string, you should be using //div.
Iterate over the list like this:
$els = $xpath->query("//div");
foreach( $els as $el) {
echo $el->textContent;
}

Categories