XPATH Get Attribute of Current Node - php

Having trouble getting the attribute of the current node in PHP and making a condition based on that attribute...
Example XML
<div class='parent'>
<div class='title'>A Title</div>
<div class='child'>some text</div>
<div class='child'>some text</div>
<div class='title'>A Title</div>
<div class='child'>some text</div>
<div class='child'>some text</div>
</div>
What I am trying to do is traverse the XML in PHP and do different things based on the class of the element/node
Eg.
$doc->loadHTML($xml_string);
$xpath = new DOMXpath($doc);
$nodeLIST = $xpath->query("//div[#class='parent']/div");
foreach ($nodeLIST as $node) {
if (CURRENT DIV NODE ATTRIBUTE EQUALS TITLE) {
SET $TITLE VARIABLE TO THE TEXT() OF THE CURRENT NODE
}
ELSEIF(CURRENT DIV NODE ATTRIBUTE EQUALS CHILD){
SET $CHILD VARIABLE TO THE TEXT() OF THE CURRENT NODE
}
}
I've tried all kind of things like the following...
if ($xpath->query("./[#class='title']/text()",$node)->length > 0) { }
But all i keep getting is PHP errors saying that my XPATH syntax is not valid. Can anyone help me?

You can achieve this by using getAttribute() method. Example:
foreach($nodeLIST as $node) {
$attribute = $node->getAttribute('class');
if($attribute == 'title') {
// do something
} elseif ($attribute == 'child') {
// do something
}
}

$node->getAttribute('class') gives you the attribute value, $node->textContent the string contents of the node. I wouldn't dive into XPath to read out the string value.

You can filter the 'title' and 'child' sets in different nodelists:
$titles = $xpath->query("//div[#class='parent']/div[#class='title']");
$children = $xpath->query("//div[#class='parent']/div[#class='child']");
And then process them separately:
foreach ($titles as $title) {
echo $title->textContent."\n";
}
foreach ($children as $child) {
echo $child->textContent."\n";
}
See: http://codepad.viper-7.com/x4LA50

Related

Replace content specific HTML tag using PHP

I have HTML code:
<div>
<h1>Header</h1>
<code><p>First code</p></code>
<p>Next example</p>
<code><b>Second example</b></code>
</div>
Using PHP I want replace all < symbols located in code elements for example above code I want converted to:
<div>
<h1>Header</h1>
<code><p>First code</p></code>
<p>Next example</p>
<code><b>Second example</b></code>
</div>
I try using PHP DomDocument class but my work was ineffective. Below is my code:
$dom = new DOMDocument();
$dom->loadHTML($content);
$innerHTML= '';
$tmp = '';
if(count($dom->getElementsByTagName('*'))){
foreach ($dom->getElementsByTagName('*') as $child) {
if($child->tagName == 'code'){
$tmp = $child->ownerDocument->saveXML( $child);
$innerHTML .= htmlentities($tmp);
}
else{
$innerHTML .= $child->ownerDocument->saveXML($child);
}
}
}
So, you're iterating over the markup properly, and your use of saveXML() was close to what you want, but nowhere in your code do you try to actually change the contents of the element. This should work:
<?php
$content='<div>
<h1>Header</h1>
<code><p>First code</p></code>
<p>Next example</p>
<code><b>Second example</b></code>
</div>';
$dom = new DOMDocument();
$dom->loadHTML($content, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
foreach ($dom->getElementsByTagName('code') as $child) {
// get the markup of the children
$html = implode(array_map([$child->ownerDocument,"saveHTML"], iterator_to_array($child->childNodes)));
// create a node from the string
$text = $dom->createTextNode($html);
// remove existing child nodes
foreach ($child->childNodes as $node) {
$child->removeChild($node);
}
// append the new text node - escaping is done automatically
$child->appendChild($text);
}
echo $dom->saveHTML();

How to get a list of all html elements in PHP?

According to the documentation for DOMDocument::getElementsByTagName, I can call the function with "*" argument, and get a list of all HTML elements from some HTML code.
However, with the following code:
<?php
$dom = new DOMDocument();
$dom->loadHTML("<html><body><div>hello</div><div>bye</div></body></html>");
$nodes = $dom->getElementsByTagName("*");
foreach ($nodes as $node) {
$new_text= new DOMText($node->textContent."MODIFIED");
$node->removeChild($node->firstChild);
$node->appendChild($new_text);
}
$content = $dom->saveHTML();
echo $content;
?>
I get a list of only one element, and the result of execution of the code above is:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>hellobyeMODIFIED</html>
while I would expect something like this:
<html><body><div>helloMODIFIED</div><div>byeMODIFIED</div></body></html>
Shouldn't DOMDocument::getElementsByTagName method return a list of as many HTML elements as available in the HTML code?
Note: I need to create DOMText instances explicitly, because I need this to work in PHP 5.4. DOMNode::textContent is accessible for writing only from PHP 5.6
The DOMDocument::getElementsByTagName method actually returns all the tags, if the first argument is '*'. But your code replaces <body> tag (including all child nodes) with a text node at the first iteration.
Iterate the nodes, and modify only the nodes with nodeType property equal to XML_TEXT_NODE:
$nodes = $dom->getElementsByTagName('*');
foreach ($nodes as $node) {
for ($child = $node->firstChild; $child; $child = $child->nextSibling) {
if (! ($child->nodeType === XML_TEXT_NODE && trim($child->textContent))) {
continue;
}
// The textContent is writable since PHP 5.6.1
if (PHP_VERSION_ID >= 50601) {
$child->textContent .= 'MODIFIED';
continue;
}
// For older versions, create DOMText explicitly
$text = new DOMText($child->textContent . 'MODIFIED');
try {
if ($child->parentNode->replaceChild($text, $child))
$child = $text;
} catch (Exception $e) {
trigger_error("Failed to modify text '$child->textContent': "
. $e->getMessage(), E_USER_WARNING);
}
}
}
echo $dom->saveHTML();
Note, for PHP versions 5.6.1 and newer, you don't need to create DOMText instances explicitly, since the DOMNode::textContent property is accessible for read and write. So you can simply modify the text by assigning a string value to this property. Only make sure that the node has no child nodes other than XML_TEXT_NODE.
The code above checks if trim($child->textContent) is not empty, because the document may contain extra space characters (including newline), e.g.:
<div><!-- newline/spaces -->
<span>text</span><!-- newline/spaces -->
</div><!-- newline/spaces -->
This function 'DOMDocument::getElementsByTagName' returns a new instance of class DOMNodeList containing all the elements.
And it works fine:
<?php
$dom = new DOMDocument();
$dom->loadHTML("<html><body><div>hello</div><div>bye</div></body></html>");
$nodes = $dom->getElementsByTagName("*");
foreach ($nodes as $node) {
echo $node->tagName."<br />";
}
?>
it output all tags of your document.
Probably you need smth like:
<?php
$dom = new DOMDocument();
$dom->loadHTML("<html><body><div>hello</div><div>bye</div></body></html>");
$nodes = $dom->getElementsByTagName("*");
foreach ($nodes as $node) {
if ($node->tagName=='div'){
$node->nodeValue .= "new content";
}
}
$content = $dom->saveHTML();
echo htmlspecialchars($content);
?>
Try this:-
foreach($dom->getElementsByTagName('*') as $element ){
}

DomDocument get all divs and put inside an array

I have have some divs with the same Id and same Class as you can see below:
<div id="results_information" class="control_results">
<!-- I have divs, subDivs, span, images inside -->
</div>
<div id="results_information" class="control_results">
<!-- I have divs, subDivs, span, images inside -->
</div>
....
In my case I want to save all of them inside an array to be used later, I want to save in this format:
[0] => '<div id="results_information" class="control_results">
<!-- I have divs, subDivs, span, images inside -->
</div>',
[1] => '<div id="results_information" class="control_results">
<!-- I have divs, subDivs, span, images inside -->
</div>',
....
For that I'm using this code below:
$dom = new DOMDocument(); // Create DOMDocument object.
$dom->loadHTMLFile($htmlOut); // Load target file.
$div =$dom->getElementById('results_information'); // Take all div elements.
But it doesn't work, how I can solve this problem and put my divs inside an array?
To solve your problem you need to do the following steps below:
First of all, you should be based on selecting a class and not an ID (Because id in this situation should be unique).
In this situation we assume that you have the following html inside a variable called $htmlOut:
<div id="results_information" class="control_results">
<span style="background:black; color:white">
hellow world
</span>
<strong>2</strong>
</div>
<div id="results_information" class="control_results">
<strong>2</strong>
<img src="hello.png" />
</div>
We need to extract all the html that exists inside theses two class called control_results and put inside an array, for this we need to work with DomDocument and DomXPath:
$array = array();
$dom = new DomDocument();
$dom->loadHtml($htmlOut);
$finder = new DomXPath($dom);
$classname = "control_results";
$nodes = $finder->query("//*[contains(#class, '$classname')]");
With that code we can extract all the content of the divs with classname control_results and put inside the variable $nodes.
Now we need to parser the variable $nodes (that is an array) and extract all the HTML of that two class. For this I create a function to handle:
function get_inner_html( $node ) {
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
$innerHTML .= $child->ownerDocument->saveXML( $child );
}
return $innerHTML;
}
This function will extract every childNodes (Every HTML code inside the class control_results) and returns.
Now you only need to create a foreach for the variable $nodes and call that function, like this:
foreach ($nodes as $rowNode) {
$array[] = get_inner_html($rowNode);
}
var_dump($array);
Below is the complete code:
$htmlOut = '
<div id="results_information" class="control_results">
<span style="background:black; color:white">
hellow world
</span>
<strong>2</strong>
</div>
<div id="results_information" class="control_results">
<strong>2</strong>
<img src="hello.png" />
</div>
';
$array = array();
$dom = new DomDocument();
$dom->loadHtml($htmlOut);
$finder = new DomXPath($dom);
$classname = "control_results";
$nodes = $finder->query("//*[contains(#class, '$classname')]");
foreach ($nodes as $rowNode) {
$array[] = get_inner_html($rowNode);
}
var_dump($array);
function get_inner_html( $node ) {
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
$innerHTML .= $child->ownerDocument->saveXML( $child );
}
return $innerHTML;
}
But this code has a little problem, if you check the results in array is:
0 => string '<span style="background:black; color:white">hellow world</span><strong>2</strong>',
1 => string '<strong>2</strong><img src="hello.png"/>'
instead of:
0 => string '<div id="results_information" class="control_results"><span style="background:black; color:white">hellow world</span><strong>2</strong></div>',
1 => string '<div id="results_information" class="control_results"><strong>2</strong><img src="hello.png"/></div>'
In this case you can perform a foreach of this array and include that div in the init of the contents and close that div in the final of the contents and re-save that array.
You will need to use xpath and get the elements using class name.
$dom = new DOMDocument();
$xpath = new DOMXpath($dom);
$div = $xpath->query('//div[contains(#class, "control_results")]')

How to get the value of special attributes / custom attributes of HTML using PHP DOM Parser?

<li data-docid="thisisthevaluetoget" class="search-results-item">
</li>
How to get the value of "data-docid"?
You can use DOMDocument to get at the attributes:
$html = '<li data-docid="thisisthevaluetoget" class="search-results-item"></li>';
$doc = new DOMDocument;
$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('li');
foreach ($nodes as $node) {
if ($node->hasAttributes()) {
foreach ($node->attributes as $a) {
echo $a->nodeName.': '.$a->nodeValue.'<br/>';
}
}
}
You may do this using JavaScript + jQuery. You may get the value and pass it into another php file using $_GET method.
an example is here

Remove HTML element from parsed HTML document on a condition

I've parsed a HTML document using Simple PHP HTML DOM Parser. In the parsed document there's a ul-tag with some li-tags in it. One of these li-tags contains one of those dreaded "Add This" buttons which I want to remove.
To make this worse, the list item has no class or id, and it is not always in the same position in the list. So there is no easy way (correct me if I'm wrong) to remove it with the parser.
What I want to do is to search for the string 'addthis.com' in all li-elements and remove any element that contains that string.
<ul>
<li>Foobar</li>
<li>addthis.com</li><!-- How do I remove this? -->
<li>Foobar</li>
</ul>
FYI: This is purley a hobby project in my quest to learn PHP and not a case of content theft for profit.
All suggestions are welcome!
Couldn't find a method to remove nodes explicitly, but can remove with setting outertext to empty.
$html = new simple_html_dom();
$html->load(file_get_contents("test.html"), false, false); // preserve formatting
foreach($html->find('ul li') as $element) {
if (count($element->find('a.addthis_button')) > 0) {
$element->outertext="";
}
}
echo $html;
Well what you can do is use jQuery after the parsing. Something like this:
$('li').each(function(i) {
if($(this).html() == "addthis.com"){
$(this).remove();
}
});
This solution uses DOMDocument class and domnode.removechild method:
$str="<ul><li>Foobar</li><li>addthis.com</li><li>Foobar</li></ul>";
$remove='addthis.com';
$doc = new DOMDocument();
$doc->loadHTML($str);
$elements = $doc->getElementsByTagName('li');
$domElemsToRemove = array();
foreach ($elements as $element) {
$pos = strpos($element->textContent, $remove); // or similar $element->nodeValue
if ($pos !== false) {
$domElemsToRemove[] = $element;
}
}
foreach( $domElemsToRemove as $domElement ){
$domElement->parentNode->removeChild($domElement);
}
$str = $doc->saveHTML(); // <ul><li>Foobar</li><li>Foobar</li></ul>

Categories