Remove HTML Tag using DOMDocument

Remove HTML Tag using DOMDocument - php

I'd like to remove <font> tags from my html and am trying to use replaceChild to do so, but it doesn't seem to work properly. Can anyone catch what might be wrong?
$html = '<html><body><br><font class="heading2">Limited Size and Resources</font><p><br><strong>Q: When can a member use the limited size and resources exception?</strong></p></body></html>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$font_tags = $dom->GetElementsByTagName('font');
foreach($font_tags as $font_tag) {
foreach($font_tag as $child) {
$child->replaceChild($child->nodeValue, $font_tag);
}
}
echo $dom->saveHTML();
From what I understand, $font_tags is a DOMNodeList, so I need to iterate through it twice in order to use the DOMNode::replaceChild function. I then want to replace the current value with just the content inside of the tags. However, when I output the $html nothing changes. Any ideas what could be wrong?
Here is a PHP Sandbox to test the code.

I'll put my remarks inline
$html = '<html><body><br><font class="heading2">Limited Size and Resources</font><p><br><strong>Q: When can a member use the limited size and resources exception?</strong></p></body></html>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$font_tags = $dom->GetElementsByTagName('font');
/* You only need one loop, as it is iterating your collection
You would only need a second loop if each font tag had children of their own
*/
foreach($font_tags as $font_tag) {
/* replaceChild replaces children of the node being called
So, to replace the font tag, call the function on its parent
$prent will be that reference
*/
$prent = $font_tag->parentNode;
/* You can't insert arbitrary text, you have to create a textNode
That textNode must also be a member of your document
*/
$prent->replaceChild($dom->createTextNode($font_tag->nodeValue), $font_tag);
}
echo $dom->saveHTML();
Updated Sandbox: Hopefully I understood your requirements correctly

First, let us find out what wasn't working in your code.
foreach($font_tag as $child) wasn't even iterating once as $font_tag is a single 'font' tag element from font_tags array, and not an array itself.
$child->replaceChild($child->nodeValue, $font_tag); - A child node can't replace its parent ($font_tag), but the reverse is possible.
As replaceChild is a method of the parent node to replace its child.
For more details check the PHP: DOMNode::replaceChild documentation, or the point 2 below my code.
echo $html will output the $html string, but not the updated $dom object that we are modifying.
This would work -
$html = '<html><body><br><font class="heading2">Limited Size and Resources</font><p><br><strong>Q: When can a member use the limited size and resources exception?</strong></p></body></html>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$font_tags = $dom->GetElementsByTagName('font');
foreach($font_tags as $font_tag)
{
$new_node = $dom->createTextNode($font_tag->nodeValue);
$font_tag->parentNode->replaceChild($new_node, $font_tag);
}
echo $dom->saveHTML();
I am creating a $new_node directly in the $dom, so the node is live in the DOMDocument and not any local variable.
To replace the child object $font_tag, we have to first traverse to the parent node using the parentNode method.
Finally, we are printing out the modified $dom using saveHTML method, which will convert the DOMDocument into a HTML String.

Remove a specific span tag from HTML while preserving/keeping the inside content using PHP and DOMDocument
<?php
$content = '<span style="font-family: helvetica; font-size: 12pt;"><div>asdf</div><span>TWO</span>Business owners are fearful of leading. They would rather follow the leader than embrace a bold move that challenges their confidence. </span>';
$dom = new DOMDocument();
// Use LIBXML for preventing output of doctype, <html>, and <body> tags
$dom->loadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//span[#style="font-family: helvetica; font-size: 12pt;"]') as $span) {
// Move all span tag content to its parent node just before it.
while ($span->hasChildNodes()) {
$child = $span->removeChild($span->firstChild);
$span->parentNode->insertBefore($child, $span);
}
// Remove the span tag.
$span->parentNode->removeChild($span);
}
// Get the final HTML with span tags stripped
$output = $dom->saveHTML();
print_r($output);

Related

How do i form this replacement with regular expression

I need to remove a particular string using php, the string needs to start with <div class='event'> followed by any possible string but which contains $myVariable, which is then followed by </div>. How do I remove all this using preg_replace()? I have worked out it might be something like this
preg_replace("<div class='event'>(.*)" . $myVariable . "(.*)</div>", "", $content);
But I cant get it to work.
Update:
I need to remove a div and everything inside it, the div contains an event name and date but I can only delete the div based on the events name and so the date needs to be defined as practically any string.

Let's imagine you have a <div> and inside it, there is some text node with a specific word you define with $myVariable.
The task is:
Read the document in
Initialize DOM
Collect the <div> tags with .nodeValue containing $myVariable text
Remove those tags from the DOM
Return updated DOM
The code for that algorithm is below (DOM is initialized with a HTML string in the demo):
$html = "<<YOUR_HTML_STRING>>"
$dom = new DOMDocument; // Declaring the DOM
$dom->loadHTML($html); // Initializing the DOM with an HTML string
$myVariable = "2015-09-12"; // Your dynamic variable
$xpath = new DOMXPath($dom); // Initializing the DOMXpath
$divs = $xpath->query("//div[contains(.,'$myVariable')]"); // Collecting DIVs
// having $myVariable
foreach($divs as $div) {
$div->parentNode->removeChild($div); // Removing the DIVs
}
echo $dom->saveHTML(); // Getting the updated DOM
See IDEONE demo
Note that you can force DOMDocument to omit adding !DOCTYPE using the following to declare and initialize DOM:
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

Strip out div with class and keep all other html within

My html content:
$content = <div class="class-name some-other-class">
<p>ack</p>
</div>
Goal: Remove div with class="class-name so that I'm left with:
<p>ack</p>
I know strip_tags($content, '<p>'); would do the job in this instance but I want to be able to target the divs with a certain class and preserve other divs etc.
And I'm aware that you shouldn't pass html through regex - So whats the best way/proper way to achieving this.

$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($content); // loads your HTML
$xpath = new DOMXPath($doc);
// returns a list of all links with class containing class-name
$nlist = $xpath->query("div[contains(#class, 'class-name')]");
// Remove the nodes from the xpath query
foreach($nlist as $node) {
$node->parentNode->removeChild($node);
}
echo $doc->saveHtml();

Maybe with some jQuery? '$(".class-name").remove();'

PHP DOMDocument, retrieve just content of a div, without div tag

I'm using DOMDocument to retrieve on a HTML page a special div.
I just want to retrive the content of this div, without the div tag.
For example :
$dom = new DOMDocument;
$dom->loadHTML($webtext['content']);
$main = $dom->getElementById('inter');
$dom->saveHTML()
Here, i have the result :
<div id="inter">
//SOME THINGS IN MY DIV
</div>
And i just want to have :
//SOME THINGS IN MY DIV
Ideas ? Thanks !

I'm going to go with simple does it. You already have:
$dom = new DOMDocument;
$dom->loadHTML($webtext['content']);
$main = $dom->getElementById('inter');
$dom->saveHTML();
Now, DOMDocument::getElementById() returns one DOMElement which extends DOMNode which has the public stringnodeValue. Since you don't specify if you are expecting anything but text within that div, I'm going to assume that you want anything that may be stored in there as plain text. For that, we are going to remove $dom->saveHTML();, and instead replace it with:
$divString = $main->nodeValue;
With that, $divString will contain //SOME THINGS IN MY DIV, which, from your example, is the desired output.
If, however, you want the HTML of the inside of it and not just a String representation - replace it with the following instead:
$divString = "";
foreach($main->childNodes as $c)
$divString .= $c->ownerDocument->saveXML($c);
What that does is takes advantage of the inherited DOMNode::childNodes which contains a DOMNodeList each containing its own DOMNode (for reference, see above), and we loop through each one getting the ownerDocument which is a DOMDocument and we call the DOMDocument::saveXML() function. The reason we pass the current $c node in to the function is to prevent an entire valid document from being outputted, and because the ownerDocument is what we are looping through - we need to get one child at a time, with no children left behind. (sorry, it's late, couldn't resist.)
Now, after either option, you can do with $divString what you will. I hope this has helped explain the process to you and hopefully you walk away with a better understanding of what is going on instead of rote copying of code just because it works. ^^

you can use my custom function to remove extra div from content
$html_string = '<div id="inter">
SOME THINGS IN MY DIV
</div>';
// custom function
function DOMgetinnerHTML($element)
{
$innerHTML = "";
$children = $element->childNodes;
foreach ($children as $child)
{
$tmp_dom = new DOMDocument();
$tmp_dom->appendChild($tmp_dom->importNode($child, true));
$innerHTML.=trim($tmp_dom->saveHTML());
}
return $innerHTML;
}
your code will like
$dom = new DOMDocument;
$dom->loadHTML($html_string);
$divs = $dom->getElementsByTagName('div');
$innerHTML_contents = DOMgetinnerHTML($divs->item(0));
echo $innerHTML_contents
and your output will be
SOME THINGS IN MY DIV

you can use xpath
$xpath = new DOMXPath($xml);
foreach($xpath->query('//div[#id="inter"]/*') as $node)
{
$node->nodeValue
}
or simplu you can edit your code. see here
$main = $dom->getElementById('inter');
echo $main->nodeValue

How do I assemble pieces of HTML into a DOMDocument?

It appears that loadHTML and loadHTMLFile for a files representing sections of an HTML document seem to fill in html and body tags for each section, as revealed when I output with the following:
$doc = new DOMDocument();
$doc->loadHTMLFile($file);
$elements = $doc->getElementsByTagName('*');
if( !is_null($elements) ) {
foreach( $elements as $element ) {
echo "<br/>". $element->nodeName. ": ";
$nodes = $element->childNodes;
foreach( $nodes as $node ) {
echo $node->nodeValue. "\n";
}
}
}
Since I plan to assemble these parts into the larger document within my own code, and I've been instructed to use DOMDocument to do it, what can I do to prevent this behavior?

This is part of several modifications the HTML parser module of libxml makes to the document in order to work with broken HTML. It only occurs when using loadHTML and loadHTMLFile on partial markup. If you know the partial is valid X(HT)ML, use load and loadXML instead.
You could use
$doc->saveXml($doc->getElementsByTagName('body')->item(0));
to dump the outerHTML of the body element, e.g. <body>anything else</body> and strip the body element with str_replace or extract the inner html with substr.
$html = '<p>I am a fragment</p>';
$dom = new DOMDocument;
$dom->loadHTML($html); // added html and body tags
echo substr(
$dom->saveXml(
$dom->getElementsByTagName('body')->item(0)
),
6, -7
);
// <p>I am a fragment</p>
Note that this will use XHTML compliant markup, so <br> would become <br/>. As of PHP 5.3.5, there is no way to pass a node to saveHTML(). A bug request has been filed.

The closest you can get is to use the DOMDocumentFragment.
Then you can do:
$doc = new DOMDocument();
...
$f = $doc->createDocumentFragment();
$f->appendXML("<foo>text</foo><bar>text2</bar>");
$someElement->appendChild($f);
However, this expects XML, not HTML.
In any case, I think you're creating an artificial problem. Since you know the behavior is to create the html and body tags you can just extract the elements in the file from within the body tag and then import the, to the DOMDocument where you're assembling the final file. See DOMDocument::importNode.

How to remove an HTML element using the DOMDocument class

Is there a way to remove a HTML element by using the DOMDocument class?

In addition to Dave Morgan's answer you can use DOMNode::removeChild to remove child from list of children:
Removing a child by tag name
//The following example will delete the table element of an HTML content.
$dom = new DOMDocument();
//avoid the whitespace after removing the node
$dom->preserveWhiteSpace = false;
//parse html dom elements
$dom->loadHTML($html_contents);
//get the table from dom
if($table = $dom->getElementsByTagName('table')->item(0)) {
//remove the node by telling the parent node to remove the child
$table->parentNode->removeChild($table);
//save the new document
echo $dom->saveHTML();
}
Removing a child by class name
//same beginning
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTML($html_contents);
//use DomXPath to find the table element with your class name
$xpath = new DomXPath($dom);
$classname='MyTableName';
$xpath_results = $xpath->query("//table[contains(#class, '$classname')]");
//get the first table from XPath results
if($table = $xpath_results->item(0)){
//remove the node the same way
$table ->parentNode->removeChild($table);
echo $dom->saveHTML();
}
Resources
http://us2.php.net/manual/en/domnode.removechild.php
How to delete element with DOMDocument?
How to get full HTML from DOMXPath::query() method?

http://us2.php.net/manual/en/domnode.removechild.php
DomDocument is a DomNode.. You can just call remove child and you should be fine.
EDIT: Just noticed you were probably talking about the page you are working with currently. Don't know if DomDocument would work. You may wanna look to use javascript at that point (if its already been served up to the client)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Remove HTML Tag using DOMDocument - php

Related

How do i form this replacement with regular expression

Strip out div with class and keep all other html within

PHP DOMDocument, retrieve just content of a div, without div tag

How do I assemble pieces of HTML into a DOMDocument?

How to remove an HTML element using the DOMDocument class

Categories

Resources