DomDocument get all divs and put inside an array - php

I have have some divs with the same Id and same Class as you can see below:
<div id="results_information" class="control_results">
<!-- I have divs, subDivs, span, images inside -->
</div>
<div id="results_information" class="control_results">
<!-- I have divs, subDivs, span, images inside -->
</div>
....
In my case I want to save all of them inside an array to be used later, I want to save in this format:
[0] => '<div id="results_information" class="control_results">
<!-- I have divs, subDivs, span, images inside -->
</div>',
[1] => '<div id="results_information" class="control_results">
<!-- I have divs, subDivs, span, images inside -->
</div>',
....
For that I'm using this code below:
$dom = new DOMDocument(); // Create DOMDocument object.
$dom->loadHTMLFile($htmlOut); // Load target file.
$div =$dom->getElementById('results_information'); // Take all div elements.
But it doesn't work, how I can solve this problem and put my divs inside an array?

To solve your problem you need to do the following steps below:
First of all, you should be based on selecting a class and not an ID (Because id in this situation should be unique).
In this situation we assume that you have the following html inside a variable called $htmlOut:
<div id="results_information" class="control_results">
<span style="background:black; color:white">
hellow world
</span>
<strong>2</strong>
</div>
<div id="results_information" class="control_results">
<strong>2</strong>
<img src="hello.png" />
</div>
We need to extract all the html that exists inside theses two class called control_results and put inside an array, for this we need to work with DomDocument and DomXPath:
$array = array();
$dom = new DomDocument();
$dom->loadHtml($htmlOut);
$finder = new DomXPath($dom);
$classname = "control_results";
$nodes = $finder->query("//*[contains(#class, '$classname')]");
With that code we can extract all the content of the divs with classname control_results and put inside the variable $nodes.
Now we need to parser the variable $nodes (that is an array) and extract all the HTML of that two class. For this I create a function to handle:
function get_inner_html( $node ) {
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
$innerHTML .= $child->ownerDocument->saveXML( $child );
}
return $innerHTML;
}
This function will extract every childNodes (Every HTML code inside the class control_results) and returns.
Now you only need to create a foreach for the variable $nodes and call that function, like this:
foreach ($nodes as $rowNode) {
$array[] = get_inner_html($rowNode);
}
var_dump($array);
Below is the complete code:
$htmlOut = '
<div id="results_information" class="control_results">
<span style="background:black; color:white">
hellow world
</span>
<strong>2</strong>
</div>
<div id="results_information" class="control_results">
<strong>2</strong>
<img src="hello.png" />
</div>
';
$array = array();
$dom = new DomDocument();
$dom->loadHtml($htmlOut);
$finder = new DomXPath($dom);
$classname = "control_results";
$nodes = $finder->query("//*[contains(#class, '$classname')]");
foreach ($nodes as $rowNode) {
$array[] = get_inner_html($rowNode);
}
var_dump($array);
function get_inner_html( $node ) {
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
$innerHTML .= $child->ownerDocument->saveXML( $child );
}
return $innerHTML;
}
But this code has a little problem, if you check the results in array is:
0 => string '<span style="background:black; color:white">hellow world</span><strong>2</strong>',
1 => string '<strong>2</strong><img src="hello.png"/>'
instead of:
0 => string '<div id="results_information" class="control_results"><span style="background:black; color:white">hellow world</span><strong>2</strong></div>',
1 => string '<div id="results_information" class="control_results"><strong>2</strong><img src="hello.png"/></div>'
In this case you can perform a foreach of this array and include that div in the init of the contents and close that div in the final of the contents and re-save that array.

You will need to use xpath and get the elements using class name.
$dom = new DOMDocument();
$xpath = new DOMXpath($dom);
$div = $xpath->query('//div[contains(#class, "control_results")]')

Related

Need to get divs from string based on matching class

I have a variable $company_id = 8; and a block of HTML content stored as a string called all_content:
<div class="company-id-8">
Content One
</div>
<div class="company-id-9">
Content Two
</div>
<div class="company-id-8">
Content Three
</div>
<div class="company-id-3">
Content Four
</div>
I need to remove all of the divs from all_content that don't match the current company ID class. So, once filtered, the above html should become:
<div class="company-id-8">
Content One
</div>
<div class="company-id-8">
Content Three
</div>
I have the following code to filter out divs that don't belong to the current company:
$dom = new DomDocument();
$dom->loadHTML( $full_message );
$finder = new DomXPath($dom);
$classname = "company-id-" . $company_id;
$nodes = $finder->query("//div[contains(#class, '$classname')]");
foreach ( $nodes as $node ) {
$filtered_content .= ;
}
I can't seem to work out how to get my filtered div nodes back into the filtered_content string though?
How can I tidy this up and get it working?
Solution is to do the following:
$filtered_content = "";
foreach ( $nodes as $node ) {
$tmp_doc = new DOMDocument();
$tmp_doc->appendChild($tmp_doc->importNode($node,true));
$filtered_content .= $tmp_doc->saveHTML();
}
filtered_content ends up being a usable HTML string with the correct content.

Replace content specific HTML tag using PHP

I have HTML code:
<div>
<h1>Header</h1>
<code><p>First code</p></code>
<p>Next example</p>
<code><b>Second example</b></code>
</div>
Using PHP I want replace all < symbols located in code elements for example above code I want converted to:
<div>
<h1>Header</h1>
<code><p>First code</p></code>
<p>Next example</p>
<code><b>Second example</b></code>
</div>
I try using PHP DomDocument class but my work was ineffective. Below is my code:
$dom = new DOMDocument();
$dom->loadHTML($content);
$innerHTML= '';
$tmp = '';
if(count($dom->getElementsByTagName('*'))){
foreach ($dom->getElementsByTagName('*') as $child) {
if($child->tagName == 'code'){
$tmp = $child->ownerDocument->saveXML( $child);
$innerHTML .= htmlentities($tmp);
}
else{
$innerHTML .= $child->ownerDocument->saveXML($child);
}
}
}
So, you're iterating over the markup properly, and your use of saveXML() was close to what you want, but nowhere in your code do you try to actually change the contents of the element. This should work:
<?php
$content='<div>
<h1>Header</h1>
<code><p>First code</p></code>
<p>Next example</p>
<code><b>Second example</b></code>
</div>';
$dom = new DOMDocument();
$dom->loadHTML($content, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
foreach ($dom->getElementsByTagName('code') as $child) {
// get the markup of the children
$html = implode(array_map([$child->ownerDocument,"saveHTML"], iterator_to_array($child->childNodes)));
// create a node from the string
$text = $dom->createTextNode($html);
// remove existing child nodes
foreach ($child->childNodes as $node) {
$child->removeChild($node);
}
// append the new text node - escaping is done automatically
$child->appendChild($text);
}
echo $dom->saveHTML();

Xpath DOMDocument not returning data from node

I am scanning a site using a dom document and receiving the #src attribute from the inner img element.
The html is:
<div class="content">
<div class="post">
<img src="abc"/>
</div>
<div class="post">
<img src="abc1"/>
</div>
<div class="post">
</div>
</div>
Note: I am specifically not using #src in my xpath. Also, this is precisely the way i need it to be coded.
Here is my php:
$document = new DOMDocument();
$document->loadXML($file);
$XPath = new DOMXPath($document);
$Query = '//div[#class="content"]//div[#class="post"]';
$posts = $XPath->query($Query);
foreach ($posts as $post) {
if($XPath->evaluate("#src", $post))
{
$return[] = $XPath->evaluate("#src", $post)->item(0);
}else{
$return[] = "";
}
}
It's adding positions to the array $return however they are all empty array positions.
My question is how do i make it output the data from the php code:
$return[] = $XPath->evaluate("#src", $post)->item(0);
This doesn't work:
$return[] = $XPath->evaluate("#src", $post)->item(0)->nodeValue;
.//#src[1]:
. => relative to node
//# => descendant
#src => the src attribute
[1] => the first one.
You can even use string(nodeset) if you only care about the value (of course, leave that out if you need to be able to manipulate the attributes and use your ->item(0) solution).
foreach ($posts as $post){
$return[] = $XPath->evaluate("string(.//#src[1])",$post);
}

How do I extract this value using PHP Dom

I do have html file this is just a prt of it though...
<div id="result" >
<div class="res_item" id="1" h="63c2c439b62a096eb3387f88465d36d0">
<div class="res_main">
<h2 class="res_main_top">
<img
src="/ff/gigablast.com.png"
alt="favicon for gigablast.com"
width=16
height=16
/>
<a
href="http://www.gigablast.com/"
rel="nofollow"
>
Gigablast
</a>
<div class="res_main">
<h2 class="res_main_top">
<img
src="/ff/ask.com.png"
alt="favicon for ask.com"
width=16
height=16
/>
<a
href="http://ask.com/" rel="nofollow"
>
Ask.com - What's Your Question?
</a>....
I want extract only url address (for example: http://www.gigablast.com and http://ask.com/ - there are atleast 10 urls in that html) from above using PHP Dom Document..I know up to this but dont know how to move ahead??
$doc = new DomDocument;
$doc->loadHTMLFile('urllist.html');
$data = $doc->getElementById('result');
then what?? this is inside tag hence I cant use $data->getElementsByTagName() here!!
Using XPath to narrow down the field to a elements inside the <div class="res_main"> element:
$doc = new DomDocument();
$doc->loadHTMLFile('urllist.html');
$xpath = new DomXpath($doc);
$query = '//div[#class="res_main"]//a';
$nodes = $xpath->query($query);
$urls = array();
foreach ($nodes as $node) {
$href = $node->getAttribute('href');
if (!empty($href)) {
$urls[] = $href;
}
}
This solves the problem of picking up all the <a> elements inside of the document, since it allows you to filter only the ones you want (since you don't care about navigation links, etc)...
You can call getElementsByTagName on a DOMElement object:
$doc = new DomDocument;
$doc->loadHTMLFile('urllist.html');
$result = $doc->getElementById('result');
$anchors = $result->getElementsByTagName('a');
$urls = array();
foreach ($anchors as $a) {
$urls[] = $a->getAttribute('href');
}
If you want to get image sources as well, that would be easy to add.
If you are just trying to extract the href attribute of all a tags in the document (and the <div id="result"> doesn't matter, you could use this:
$doc = new DomDocument;
$doc->loadHTMLFile('urllist.html');
$anchors = $doc->getElementsByTagName('a');
$urls = array();
foreach($anchors as $anchor) {
$urls[] = $anchor->attributes->href;
}
// $urls is your collection of urls in the original document.

How to get nodes in first level using PHP DOMDocument?

I'm new to PHP DOM object and have a problem I can't find a solution. I have a DOMDocument with following HTML:
<div id="header">
</div>
<div id="content">
<div id="sidebar">
</div>
<div id="info">
</div>
</div>
<div id="footer">
</div>
I need to get all nodes that are on first level (header, content, footer). hasChildNodes() does not work, because first level node may not have children (header, footer).
For now my code looks like:
$dom = new DOMDocument();
$dom -> preserveWhiteSpace = false;
$dom -> loadHTML($html);
$childs = $dom -> getElementsByTagName('div');
But this gets me all div's. any advice?
You may have to go outside of DOMDocument - maybe convert to SimpleXML or DOMXpath
$file = $DOCUMENT_ROOT. "test.html";
$doc = new DOMDocument();
$doc->loadHTMLFile($file);
$xpath = new DOMXpath($doc);
$elements = $xpath->query("/");
Here's how I grab the first level elements (in this case, the top level TD elements in a table row:
$doc = new DOMDocument();
$doc->preserveWhiteSpace = false;
$doc->loadHTML( $tr_element );
$xpath = new DOMXPath( $doc );
$td = $xpath->query("//tr/td[1]")->item(0);
do{
if( $innerHTML = self::DOMinnerHTML( $td ) )
array_push( $arr, $innerHTML );
$td = $td->nextSibling;
} while( $td != null );
$arr now contains the top TD elements, but not nested table TDs which you would get from
$dom->getElementsByTagName( 'td' );
The DOMinnerHTML function is something I snagged somewhere to get the innerHTML of an element/node:
public static function DOMinnerHTML( $element, $deep=true )
{
$innerHTML = "";
$children = $element->childNodes;
foreach ($children as $child)
{
$tmp_dom = new DOMDocument();
$tmp_dom->appendChild( $tmp_dom->importNode( $child, $deep ) );
$innerHTML.=trim($tmp_dom->saveHTML());
}
return $innerHTML;
}

Categories