I'd like to get the text for the ul>li that immediately follows the with the text ABC. The text in this case would be 123.
<h2>CDE</h2>
<ul>...</ul>
<h2>ABC</h2>
<ul>
<li>
<span>123</span>
</li>
</ul>
This is what I have, but it's not working
$dom = new DOMDocument();
$dom->loadHTML($html); // $html is the code above
$h2_all = $dom->getElementsByTagName('h2');
foreach($h2_all as $h2) {
$h2_text = $h2->textContent;
if (trim(strtolower($h2_text)) == 'abc') {
var_dump($h2->nextSibling);
}
}
I assume it's because $h2 doesn't contain the ul data I need, but I'm not sure how to get it.
You can use an xpath query:
$dom = new DOMDocument;
$dom->loadHTML($html);
$xp = new DOMXPath($dom);
$qry = '//ul[preceding::h2[1] = "ABC"]/li/span';
$result = $xp->query($qry)->item(0)->nodeValue;
query details:
// # the path can start from anywhere in the dom tree
ul
[preceding::h2[1] = "ABC"] # condition: the first preceding h2 has the value "ABC"
/li/span # lets continue the path until the span node
Check the siblings and find the first ul:
$ul = null;
foreach($dom->getElementsByTagName('h2') as $h2) {
if(trim(strtolower($h2->textContent)) == "abc") {
$obj = $h2->nextSibling;
while($obj != null) {
if($obj->nodeName == "ul") {
$ul = $obj;
break 2;
}
$obj = $obj->nextSibling;
}
}
}
//make sure ul has at least one li
if($ul != null && $ul->firstChild != null) {
echo $ul->firstChild->nodeValue;
}
Related
I have the following:
$node = $doc->getElementsByTagName('img');
if ($node->item(0) == null || $node->item(0) == '') {
// do stuff
} elseif ($node->item(0)->hasAttribute('src')) {
// do other stuff
} else {
// do more other stuff
}
What I want is to only return images from the body tag.
I have tried:
$body = $doc->getElementsByTagName('body');
foreach ($body as $body_node) {
$node = $body_node->getElementsByTagName('img');
}
however if there is an image in header it still seems to get returned by
$node->item(0)->hasAttribute('src')
Personally there should never be an img in the header but I find some url's add them in a noscript tag in the the header.
So how do I return only images from he body tag excluding any found in the head tag?
Do it using DOMXPath:
$xpath = new DOMXpath($doc);
$nodes = $xpath->query('//body//img');
$nodes is now a DOMNodeList that you can iterate over.
If you only want img nodes that have a src attribute:
$nodes = $xpath->query('//body//img[#src]');
Edit: Here is a fully working example:
<?php
$contents = file_get_contents('http://stackoverflow.com/');
$doc = new DOMDocument();
$doc->loadHTML($contents);
$xpath = new DOMXpath($doc);
$nodes = $xpath->query('//body//img');
foreach ($nodes as $node) {
echo $node->getAttribute('src') . "\n";
}
This question already has answers here:
How to get innerHTML of DOMNode?
(9 answers)
Closed 5 years ago.
How to Change innerHTML of a php DOMElement ?
Another solution:
1) create new DOMDocumentFragment from the HTML string to be inserted;
2) remove old content of our element by deleting its child nodes;
3) append DOMDocumentFragment to our element.
function setInnerHTML($element, $html)
{
$fragment = $element->ownerDocument->createDocumentFragment();
$fragment->appendXML($html);
while ($element->hasChildNodes())
$element->removeChild($element->firstChild);
$element->appendChild($fragment);
}
Alternatively, we can replace our element with its clean copy and then append DOMDocumentFragment to this clone.
function setInnerHTML($element, $html)
{
$fragment = $element->ownerDocument->createDocumentFragment();
$fragment->appendXML($html);
$clone = $element->cloneNode(); // Get element copy without children
$clone->appendChild($fragment);
$element->parentNode->replaceChild($clone, $element);
}
Test:
$doc = new DOMDocument();
$doc->loadXML('<div><span style="color: green">Old HTML</span></div>');
$div = $doc->getElementsByTagName('div')->item(0);
echo $doc->saveHTML();
setInnerHTML($div, '<p style="color: red">New HTML</p>');
echo $doc->saveHTML();
// Output:
// <div><span style="color: green">Old HTML</span></div>
// <div><p style="color: red">New HTML</p></div>
I needed to do this for a project recently and ended up with an extension to DOMElement: http://www.keyvan.net/2010/07/javascript-like-innerhtml-access-in-php/
Here's an example showing how it's used:
<?php
require_once 'JSLikeHTMLElement.php';
$doc = new DOMDocument();
$doc->registerNodeClass('DOMElement', 'JSLikeHTMLElement');
$doc->loadHTML('<div><p>Para 1</p><p>Para 2</p></div>');
$elem = $doc->getElementsByTagName('div')->item(0);
// print innerHTML
echo $elem->innerHTML; // prints '<p>Para 1</p><p>Para 2</p>'
// set innerHTML
$elem->innerHTML = 'FF';
// print document (with our changes)
echo $doc->saveXML();
?>
I think the best thing you can do is come up with a function that will take the DOMElement that you want to change the InnerHTML of, copy it, and replace it.
In very rough PHP:
function replaceElement($el, $newInnerHTML) {
$newElement = $myDomDocument->createElement($el->nodeName, $newInnerHTML);
$el->parentNode->insertBefore($newElement, $el);
$el->parentNode->removeChild($el);
return $newElement;
}
This doesn't take into account attributes and nested structures, but I think this will get you on your way.
I ended up making this function using a few functions from other people on this page. I changed the one from Joanna Goch the way that Peter Brand says mostly, and also added some code from Guest and from other places.
This function does not use an extension, and does not use appendXML (which is very picky and breaks even if it sees one BR tag that is not closed) and seems to be working good.
function set_inner_html( $element, $content ) {
$DOM_inner_HTML = new DOMDocument();
$internal_errors = libxml_use_internal_errors( true );
$DOM_inner_HTML->loadHTML( mb_convert_encoding( $content, 'HTML-ENTITIES', 'UTF-8' ) );
libxml_use_internal_errors( $internal_errors );
$content_node = $DOM_inner_HTML->getElementsByTagName('body')->item(0);
$content_node = $element->ownerDocument->importNode( $content_node, true );
while ( $element->hasChildNodes() ) {
$element->removeChild( $element->firstChild );
}
$element->appendChild( $content_node );
}
It seems that appendXML doesn't work always - for example if you try to append XML with 3 levels. Here is the function I wrote that always work (you want to set $content as innerHTML to $element):
function setInnerHTML($DOM, $element, $content) {
$DOMInnerHTML = new DOMDocument();
$DOMInnerHTML->loadHTML($content);
$contentNode = $DOMInnerHTML->getElementsByTagName('body')->item(0)->firstChild;
$contentNode = $DOM->importNode($contentNode, true);
$element->appendChild($contentNode);
return $elementNode;
}
Have a look at this library PHP Simple HTML DOM Parser http://simplehtmldom.sourceforge.net/
It looks pretty straightforward. You can change innertextproperty of your elements. It might help.
Here is a replace by class function I just wrote:
It will replace the innerHtml of a class. You can also specify the node type eg. div/p/a etc.
function replaceInnerHtmlByClass($html, $replace=null, $class=null, $nodeType=null){
if(!$nodeType){ $nodeType = '*'; }
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//{$nodeType}[contains(concat(' ', normalize-space(#class), ' '), '$class')]");
foreach($nodes as $node) {
while($node->childNodes->length){
$node->removeChild($node->firstChild);
}
$fragment = $dom->createDocumentFragment();
$fragment->appendXML($replace);
$node->appendChild($fragment);
}
return $dom->saveHTML($dom->documentElement);
}
Here is another function I wrote to remove nodes with a specific class but preserving the inner html.
Setting replace to true will discard the inner html.
Setting replace to any other content will replace the inner html with the provided content.
function stripTagsByClass($html, $class=null, $nodeType=null, $replace=false){
if(!$nodeType){ $nodeType = '*'; }
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//{$nodeType}[contains(concat(' ', normalize-space(#class), ' '), '$class')]");
foreach($nodes as $node) {
$innerHTML = '';
$children = $node->childNodes;
foreach($children as $child) {
$tmp = new DOMDocument();
$tmp->appendChild($tmp->importNode($child,true));
$innerHTML .= $tmp->saveHTML();
}
$fragment = $dom->createDocumentFragment();
if($replace !== null && $replace !== false){
if($replace === true){ $replace = ''; }
$innerHTML = $replace;
}
$fragment->appendXML($innerHTML);
$node->parentNode->replaceChild($fragment, $node);
}
return $dom->saveHTML($dom->documentElement);
}
Theses functions can easily be adapted to use other attributes as the selector.
I only needed it to evaluate the class attribute.
Developing on from Joanna Goch's answer, this function will insert either a text node or an HTML fragment:
function nodeFromContent($node, $content) {
//creates a text node, or dom node if content contains html
$lt = strpos($content, '<');
$gt = strrpos($content, '>');
if (!($lt === false || $gt === false) && $gt > $lt) {
//< followed by > means potentially contains HTML
$DOMInnerHTML = new DOMDocument();
$DOMInnerHTML->loadHTML($content);
$contentNode = $DOMInnerHTML->getElementsByTagName('body')->item(0);
$newNode = $node->ownerDocument->importNode($contentNode, true);
} else {
$newNode = $node->ownerDocument->createTextNode($content);
}
return $newNode;
}
usage
$newNode = nodeFromContent($node, $content);
$node->parentNode->insertBefore($newNode, $node);
//or $node->appendChild($newNode) depending on what you require
here is how you do it:
$doc = new DOMDocument('');
$label = $doc->createElement('label');
$label->appendChild($doc->createTextNode('test'));
$li->appendChild($label);
echo $doc->saveHTML();
function setInnerHTML($DOM, $element, $innerHTML) {
$node = $DOM->createTextNode($innerHTML);
$element->appendChild($node);
}
Suppose I have a string containing some HTML. I want to remove every li tag before reaching the first p tag.
How do I achieve something like that?
Example string:
$str = "<img src='something.png'/>some_text_here<li>needs_to_be_removed</li>
<li>also_needs_to_be_removed</li>some_other_text<p>finally</p>more_text_here
<li>this_should_not_be_removed</li>";`
The first two li tags need to be removed.
here is what you need. Simple and effective:
$mystring = "mystringwith<li>toberemovedstring</li><li>againremove</li><p>do not remove me</p>";//the string you provide
$findme = '<li>';//the string you want to search in $mystring
$findpee = '<p>';//haha pee also where to end it
$pos = strpos($mystring, $findme);//first position of <li>
$pospee = strpos($mystring, $findpee);// then position of pee.. get it :)
//Then we remove it
$result=substr_replace ( $mystring ,"" , $pos, ($pospee-$pos));
echo $result;
Edit: PHP sandbox
http://sandbox.onlinephpfunctions.com/code/e534259e2312682a04b64c6e3aae1521422aacd2
you can check the result here as well
You can do it with PHP's DOMdocument using the below traversal function
$doc = new DOMDocument();
$doc->loadHTML($str);
$foundp = false;
showDOMNode($doc);
//now $doc contains the string you want
$newstr = $doc->saveHTML();
function showDOMNode(DOMNode &$domNode) {
global $foundp;
foreach ($domNode->childNodes as $node)
{
if ($node->nodeName == "li" && $foundp==false){
//delete this node
$domNode->removeChild($node);
}
else if ($node->nodeName == "p"){
//stop here
$foundp = true;
return;
}
else if($node->hasChildNodes() && $foundp==false) {
//recursively
showDOMNode($node);
}
}
}
With XPath:
$str = "<img src='something.png'/>some_text_here<li>needs_to_be_removed</li>
<li>also_needs_to_be_removed</li>some_other_text<p>finally</p>more_text_here
<li>this_should_not_be_removed</li>";
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTML('<div>' . $str .'</div>', LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
// ^---------------^----- add a root element
$xp = new DOMXPath($dom);
$lis = $xp->query('//p[1]/preceding-sibling::li');
foreach ($lis as $li) {
$li->parentNode->removeChild($li);
}
$result = '';
// add each child node of the root element to the result
foreach ($dom->getElementsByTagName('div')->item(0)->childNodes as $child) {
$result .= $dom->saveHTML($child);
}
I would suggest using a php praser library will be much better and faster approach. I personally use this one https://github.com/paquettg/php-html-parser in my projects. it provides apis like
$child->nextSibling()
$content->innerHtml,
$content->firstChild()
and more which can come in handy.
You can just do a foreach loop for all elements, register "li" tag inside them and if for third occurance, you find a "p" tag, you can just delete the $child->previousSibling();
file.html
<div>
apple
</div>
$html = new DOMDocument();
$html->preserveWhiteSpace = true;
$html->loadHTML( file_get_contents('file.html') );
$nodes = $html->getElementsByTagName('*');
foreach($nodes as $i=>$node) {
if($node->nodeName == 'div')
echo $node->nodeValue;
}
this returns 'apple'. How do I get the child node including the child node's value, as in: apple
You can pass the a dom node to DOMDocument::saveXML and it will spit out the actual HTML instead:
$html = new DOMDocument();
$html->preserveWhiteSpace = true;
$html->loadHTML( file_get_contents('file.html') );
$nodes = $html->getElementsByTagName('*');
foreach($nodes as $i=>$node) {
if($node->nodeName == 'div') {
//Navigate to the specific element you want
//then pass it to saveXML
echo $html->saveXML($node->childNodes->item(1));
}
}
This question already has answers here:
How to get innerHTML of DOMNode?
(9 answers)
Closed 5 years ago.
How to Change innerHTML of a php DOMElement ?
Another solution:
1) create new DOMDocumentFragment from the HTML string to be inserted;
2) remove old content of our element by deleting its child nodes;
3) append DOMDocumentFragment to our element.
function setInnerHTML($element, $html)
{
$fragment = $element->ownerDocument->createDocumentFragment();
$fragment->appendXML($html);
while ($element->hasChildNodes())
$element->removeChild($element->firstChild);
$element->appendChild($fragment);
}
Alternatively, we can replace our element with its clean copy and then append DOMDocumentFragment to this clone.
function setInnerHTML($element, $html)
{
$fragment = $element->ownerDocument->createDocumentFragment();
$fragment->appendXML($html);
$clone = $element->cloneNode(); // Get element copy without children
$clone->appendChild($fragment);
$element->parentNode->replaceChild($clone, $element);
}
Test:
$doc = new DOMDocument();
$doc->loadXML('<div><span style="color: green">Old HTML</span></div>');
$div = $doc->getElementsByTagName('div')->item(0);
echo $doc->saveHTML();
setInnerHTML($div, '<p style="color: red">New HTML</p>');
echo $doc->saveHTML();
// Output:
// <div><span style="color: green">Old HTML</span></div>
// <div><p style="color: red">New HTML</p></div>
I needed to do this for a project recently and ended up with an extension to DOMElement: http://www.keyvan.net/2010/07/javascript-like-innerhtml-access-in-php/
Here's an example showing how it's used:
<?php
require_once 'JSLikeHTMLElement.php';
$doc = new DOMDocument();
$doc->registerNodeClass('DOMElement', 'JSLikeHTMLElement');
$doc->loadHTML('<div><p>Para 1</p><p>Para 2</p></div>');
$elem = $doc->getElementsByTagName('div')->item(0);
// print innerHTML
echo $elem->innerHTML; // prints '<p>Para 1</p><p>Para 2</p>'
// set innerHTML
$elem->innerHTML = 'FF';
// print document (with our changes)
echo $doc->saveXML();
?>
I think the best thing you can do is come up with a function that will take the DOMElement that you want to change the InnerHTML of, copy it, and replace it.
In very rough PHP:
function replaceElement($el, $newInnerHTML) {
$newElement = $myDomDocument->createElement($el->nodeName, $newInnerHTML);
$el->parentNode->insertBefore($newElement, $el);
$el->parentNode->removeChild($el);
return $newElement;
}
This doesn't take into account attributes and nested structures, but I think this will get you on your way.
I ended up making this function using a few functions from other people on this page. I changed the one from Joanna Goch the way that Peter Brand says mostly, and also added some code from Guest and from other places.
This function does not use an extension, and does not use appendXML (which is very picky and breaks even if it sees one BR tag that is not closed) and seems to be working good.
function set_inner_html( $element, $content ) {
$DOM_inner_HTML = new DOMDocument();
$internal_errors = libxml_use_internal_errors( true );
$DOM_inner_HTML->loadHTML( mb_convert_encoding( $content, 'HTML-ENTITIES', 'UTF-8' ) );
libxml_use_internal_errors( $internal_errors );
$content_node = $DOM_inner_HTML->getElementsByTagName('body')->item(0);
$content_node = $element->ownerDocument->importNode( $content_node, true );
while ( $element->hasChildNodes() ) {
$element->removeChild( $element->firstChild );
}
$element->appendChild( $content_node );
}
It seems that appendXML doesn't work always - for example if you try to append XML with 3 levels. Here is the function I wrote that always work (you want to set $content as innerHTML to $element):
function setInnerHTML($DOM, $element, $content) {
$DOMInnerHTML = new DOMDocument();
$DOMInnerHTML->loadHTML($content);
$contentNode = $DOMInnerHTML->getElementsByTagName('body')->item(0)->firstChild;
$contentNode = $DOM->importNode($contentNode, true);
$element->appendChild($contentNode);
return $elementNode;
}
Have a look at this library PHP Simple HTML DOM Parser http://simplehtmldom.sourceforge.net/
It looks pretty straightforward. You can change innertextproperty of your elements. It might help.
Here is a replace by class function I just wrote:
It will replace the innerHtml of a class. You can also specify the node type eg. div/p/a etc.
function replaceInnerHtmlByClass($html, $replace=null, $class=null, $nodeType=null){
if(!$nodeType){ $nodeType = '*'; }
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//{$nodeType}[contains(concat(' ', normalize-space(#class), ' '), '$class')]");
foreach($nodes as $node) {
while($node->childNodes->length){
$node->removeChild($node->firstChild);
}
$fragment = $dom->createDocumentFragment();
$fragment->appendXML($replace);
$node->appendChild($fragment);
}
return $dom->saveHTML($dom->documentElement);
}
Here is another function I wrote to remove nodes with a specific class but preserving the inner html.
Setting replace to true will discard the inner html.
Setting replace to any other content will replace the inner html with the provided content.
function stripTagsByClass($html, $class=null, $nodeType=null, $replace=false){
if(!$nodeType){ $nodeType = '*'; }
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//{$nodeType}[contains(concat(' ', normalize-space(#class), ' '), '$class')]");
foreach($nodes as $node) {
$innerHTML = '';
$children = $node->childNodes;
foreach($children as $child) {
$tmp = new DOMDocument();
$tmp->appendChild($tmp->importNode($child,true));
$innerHTML .= $tmp->saveHTML();
}
$fragment = $dom->createDocumentFragment();
if($replace !== null && $replace !== false){
if($replace === true){ $replace = ''; }
$innerHTML = $replace;
}
$fragment->appendXML($innerHTML);
$node->parentNode->replaceChild($fragment, $node);
}
return $dom->saveHTML($dom->documentElement);
}
Theses functions can easily be adapted to use other attributes as the selector.
I only needed it to evaluate the class attribute.
Developing on from Joanna Goch's answer, this function will insert either a text node or an HTML fragment:
function nodeFromContent($node, $content) {
//creates a text node, or dom node if content contains html
$lt = strpos($content, '<');
$gt = strrpos($content, '>');
if (!($lt === false || $gt === false) && $gt > $lt) {
//< followed by > means potentially contains HTML
$DOMInnerHTML = new DOMDocument();
$DOMInnerHTML->loadHTML($content);
$contentNode = $DOMInnerHTML->getElementsByTagName('body')->item(0);
$newNode = $node->ownerDocument->importNode($contentNode, true);
} else {
$newNode = $node->ownerDocument->createTextNode($content);
}
return $newNode;
}
usage
$newNode = nodeFromContent($node, $content);
$node->parentNode->insertBefore($newNode, $node);
//or $node->appendChild($newNode) depending on what you require
here is how you do it:
$doc = new DOMDocument('');
$label = $doc->createElement('label');
$label->appendChild($doc->createTextNode('test'));
$li->appendChild($label);
echo $doc->saveHTML();
function setInnerHTML($DOM, $element, $innerHTML) {
$node = $DOM->createTextNode($innerHTML);
$element->appendChild($node);
}