How to saveHTML of DOMDocument without HTML wrapper?

How to saveHTML of DOMDocument without HTML wrapper? - php

I'm the function below, I'm struggling to output the DOMDocument without it appending the XML, HTML, body and p tag wrappers before the output of the content. The suggested fix:
$postarray['post_content'] = $d->saveXML($d->getElementsByTagName('p')->item(0));
Only works when the content has no block level elements inside it. However, when it does, as in the example below with the h1 element, the resulting output from saveXML is truncated to...
<p>If you like</p>
I've been pointed to this post as a possible workaround, but I can't understand how to implement it into this solution (see commented out attempts below).
Any suggestions?
function rseo_decorate_keyword($postarray) {
global $post;
$keyword = "Jasmine Tea"
$content = "If you like <h1>jasmine tea</h1> you will really like it with Jasmine Tea flavors. This is the last ocurrence of the phrase jasmine tea within the content. If there are other instances of the keyword jasmine tea within the text what happens to jasmine tea."
$d = new DOMDocument();
#$d->loadHTML($content);
$x = new DOMXpath($d);
$count = $x->evaluate("count(//text()[contains(translate(., 'ABCDEFGHJIKLMNOPQRSTUVWXYZ', 'abcdefghjiklmnopqrstuvwxyz'), '$keyword') and (ancestor::b or ancestor::strong)])");
if ($count > 0) return $postarray;
$nodes = $x->query("//text()[contains(translate(., 'ABCDEFGHJIKLMNOPQRSTUVWXYZ', 'abcdefghjiklmnopqrstuvwxyz'), '$keyword') and not(ancestor::h1) and not(ancestor::h2) and not(ancestor::h3) and not(ancestor::h4) and not(ancestor::h5) and not(ancestor::h6) and not(ancestor::b) and not(ancestor::strong)]");
if ($nodes && $nodes->length) {
$node = $nodes->item(0);
// Split just before the keyword
$keynode = $node->splitText(strpos($node->textContent, $keyword));
// Split after the keyword
$node->nextSibling->splitText(strlen($keyword));
// Replace keyword with <b>keyword</b>
$replacement = $d->createElement('strong', $keynode->textContent);
$keynode->parentNode->replaceChild($replacement, $keynode);
}
$postarray['post_content'] = $d->saveXML($d->getElementsByTagName('p')->item(0));
// $postarray['post_content'] = $d->saveXML($d->getElementsByTagName('body')->item(1));
// $postarray['post_content'] = $d->saveXML($d->getElementsByTagName('body')->childNodes);
return $postarray;
}

All of these answers are now wrong, because as of PHP 5.4 and Libxml 2.6 loadHTML now has a $option parameter which instructs Libxml about how it should parse the content.
Therefore, if we load the HTML with these options
$html->loadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
when doing saveHTML() there will be no doctype, no <html>, and no <body>.
LIBXML_HTML_NOIMPLIED turns off the automatic adding of implied html/body elements
LIBXML_HTML_NODEFDTD prevents a default doctype being added when one is not found.
Full documentation about Libxml parameters is here
(Note that loadHTML docs say that Libxml 2.6 is needed, but LIBXML_HTML_NODEFDTD is only available in Libxml 2.7.8 and LIBXML_HTML_NOIMPLIED is available in Libxml 2.7.7)

Just remove the nodes directly after loading the document with loadHTML():
# remove <!DOCTYPE
$doc->removeChild($doc->doctype);
# remove <html><body></body></html>
$doc->replaceChild($doc->firstChild->firstChild->firstChild, $doc->firstChild);

The issue with the top answer is that LIBXML_HTML_NOIMPLIED is unstable.
It can reorder elements (particularly, moving the top element's closing tag to the bottom of the document), add random p tags, and perhaps a variety of other issues[1]. It may remove the html and body tags for you, but at the cost of unstable behavior. In production, that's a red flag. In short:
Don't use LIBXML_HTML_NOIMPLIED. Instead, use substr.
Think about it. The lengths of <html><body> and </body></html> are fixed and at both ends of the document - their sizes never change, and neither do their positions. This allows us to use substr to cut them away:
$dom = new domDocument;
$dom->loadHTML($html, LIBXML_HTML_NODEFDTD);
echo substr($dom->saveHTML(), 12, -15); // the star of this operation
(THIS IS NOT THE FINAL SOLUTION HOWEVER! See below for the complete answer, keep reading for context)
We cut 12 away from the start of the document because <html><body> = 12 characters (<<>>+html+body = 4+4+4), and we go backwards and cut 15 off the end because \n</body></html> = 15 characters (\n+//+<<>>+body+html = 1 + 2 + 4 + 4 + 4)
Notice that I still use LIBXML_HTML_NODEFDTD omit the !DOCTYPE from being included. First, this simplifies the substr removal of the HTML/BODY tags. Second, we don't remove the doctype with substr because we don't know if the 'default doctype' will always be something of a fixed length. But, most importantly, LIBXML_HTML_NODEFDTD stops the DOM parser from applying a non-HTML5 doctype to the document - which at least prevents the parser from treating elements it doesn't recognize as loose text.
We know for a fact that the HTML/BODY tags are of fixed lengths and positions, and we know that constants like LIBXML_HTML_NODEFDTD are never removed without some type of deprecation notice, so the above method should roll well into the future, BUT...
...the only caveat is that the DOM implementation could change the way in HTML/BODY tags are placed within the document - for instance, removing the newline at the end of the document, adding spaces between the tags, or adding newlines.
This can be remedied by searching for the positions of the opening and closing tags for body, and using those offsets as for our lengths to trim off. We use strpos and strrpos to find the offsets from the front and back, respectively:
$dom = new domDocument;
$dom->loadHTML($html, LIBXML_HTML_NODEFDTD);
$trim_off_front = strpos($dom->saveHTML(),'<body>') + 6;
// PositionOf<body> + 6 = Cutoff offset after '<body>'
// 6 = Length of '<body>'
$trim_off_end = (strrpos($dom->saveHTML(),'</body>')) - strlen($dom->saveHTML());
// ^ PositionOf</body> - LengthOfDocument = Relative-negative cutoff offset before '</body>'
echo substr($dom->saveHTML(), $trim_off_front, $trim_off_end);
In closing, a repeat of the final, future-proof answer:
$dom = new domDocument;
$dom->loadHTML($html, LIBXML_HTML_NODEFDTD);
$trim_off_front = strpos($dom->saveHTML(),'<body>') + 6;
$trim_off_end = (strrpos($dom->saveHTML(),'</body>')) - strlen($dom->saveHTML());
echo substr($dom->saveHTML(), $trim_off_front, $trim_off_end);
No doctype, no html tag, no body tag. We can only hope the DOM parser will receive a fresh coat of paint soon and we can more directly eliminate these unwanted tags.

Use saveXML() instead, and pass the documentElement as an argument to it.
$innerHTML = '';
foreach ($document->getElementsByTagName('p')->item(0)->childNodes as $child) {
$innerHTML .= $document->saveXML($child);
}
echo $innerHTML;
http://php.net/domdocument.savexml

use DOMDocumentFragment
$html = 'what you want';
$doc = new DomDocument();
$fragment = $doc->createDocumentFragment();
$fragment->appendXML($html);
$doc->appendChild($fragment);
echo $doc->saveHTML();

A neat trick is to use loadXML and then saveHTML. The html and body tags are inserted at the load stage, not the save stage.
$dom = new DOMDocument;
$dom->loadXML('<p>My DOMDocument contents are here</p>');
echo $dom->saveHTML();
NB that this is a bit hacky and you should use Jonah's answer if you can get it to work.

It's 2017, and for this 2011 Question I don't like any of the answers.
Lots of regex, big classes, loadXML etc...
Easy solution which solves the known problems:
$dom = new DOMDocument();
$dom->loadHTML( '<html><body>'.mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8').'</body></html>' , LIBXML_HTML_NODEFDTD);
$html = substr(trim($dom->saveHTML()),12,-14);
Easy, Simple, Solid, Fast. This code will work regarding HTML tags and encoding like:
$html = '<p>äöü</p><p>ß</p>';
If anybody finds an error , please tell, I will use this myself.
Edit, Other valid options that work without errors (very similar to ones already given):
#$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
$saved_dom = trim($dom->saveHTML());
$start_dom = stripos($saved_dom,'<body>')+6;
$html = substr($saved_dom,$start_dom,strripos($saved_dom,'</body>') - $start_dom );
You could add body yourself to prevent any strange thing on the furure.
Thirt option:
$mock = new DOMDocument;
$body = $dom->getElementsByTagName('body')->item(0);
foreach ($body->childNodes as $child){
$mock->appendChild($mock->importNode($child, true));
}
$html = trim($mock->saveHTML());

I'm a bit late in the club but didn't want to not share a method I've found out about. First of all I've got the right versions for loadHTML() to accept these nice options, but LIBXML_HTML_NOIMPLIED didn't work on my system. Also users report problems with the parser (for example here and here).
The solution I created actually is pretty simple.
HTML to be loaded is put in a <div> element so it has a container containing all nodes to be loaded.
Then this container element is removed from the document (but the DOMElement of it still exists).
Then all direct children from the document are removed. This includes any added <html>, <head> and <body> tags (effectively LIBXML_HTML_NOIMPLIED option) as well as the <!DOCTYPE html ... loose.dtd"> declaration (effectively LIBXML_HTML_NODEFDTD).
Then all direct children of the container are added to the document again and it can be output.
$str = '<p>Lorem ipsum dolor sit amet.</p><p>Nunc vel vehicula ante.</p>';
$doc = new DOMDocument();
$doc->loadHTML("<div>$str</div>");
$container = $doc->getElementsByTagName('div')->item(0);
$container = $container->parentNode->removeChild($container);
while ($doc->firstChild) {
$doc->removeChild($doc->firstChild);
}
while ($container->firstChild ) {
$doc->appendChild($container->firstChild);
}
$htmlFragment = $doc->saveHTML();
XPath works as usual, just take care that there are multiple document elements now, so not a single root node:
$xpath = new DOMXPath($doc);
foreach ($xpath->query('/p') as $element)
{ # ^- note the single slash "/"
# ... each of the two <p> element
PHP 5.4.36-1+deb.sury.org~precise+2 (cli) (built: Dec 21 2014 20:28:53)

Okay I found a more elegant solution, but it's just tedious:
$d = new DOMDocument();
#$d->loadHTML($yourcontent);
...
// do your manipulation, processing, etc of it blah blah blah
...
// then to save, do this
$x = new DOMXPath($d);
$everything = $x->query("body/*"); // retrieves all elements inside body tag
if ($everything->length > 0) { // check if it retrieved anything in there
$output = '';
foreach ($everything as $thing) {
$output .= $d->saveXML($thing);
}
echo $output; // voila, no more annoying html wrappers or body tag
}
Alright, hopefully this does not omit anything and helps somebody?

None of the other solutions at the time of this writing (June, 2012) were able to completely meet my needs, so I wrote one which handles the following cases:
Accepts plain-text content which has no tags, as well as HTML content.
Does not append any tags (including <doctype>, <xml>, <html>, <body>, and <p> tags)
Leaves anything wrapped in <p> alone.
Leaves empty text alone.
So here is a solution which fixes those issues:
class DOMDocumentWorkaround
{
/**
* Convert a string which may have HTML components into a DOMDocument instance.
*
* #param string $html - The HTML text to turn into a string.
* #return \DOMDocument - A DOMDocument created from the given html.
*/
public static function getDomDocumentFromHtml($html)
{
$domDocument = new DOMDocument();
// Wrap the HTML in <div> tags because loadXML expects everything to be within some kind of tag.
// LIBXML_NOERROR and LIBXML_NOWARNING mean this will fail silently and return an empty DOMDocument if it fails.
$domDocument->loadXML('<div>' . $html . '</div>', LIBXML_NOERROR | LIBXML_NOWARNING);
return $domDocument;
}
/**
* Convert a DOMDocument back into an HTML string, which is reasonably close to what we started with.
*
* #param \DOMDocument $domDocument
* #return string - The resulting HTML string
*/
public static function getHtmlFromDomDocument($domDocument)
{
// Convert the DOMDocument back to a string.
$xml = $domDocument->saveXML();
// Strip out the XML declaration, if one exists
$xmlDeclaration = "<?xml version=\"1.0\"?>\n";
if (substr($xml, 0, strlen($xmlDeclaration)) == $xmlDeclaration) {
$xml = substr($xml, strlen($xmlDeclaration));
}
// If the original HTML was empty, loadXML collapses our <div></div> into <div/>. Remove it.
if ($xml == "<div/>\n") {
$xml = '';
}
else {
// Remove the opening <div> tag we previously added, if it exists.
$openDivTag = "<div>";
if (substr($xml, 0, strlen($openDivTag)) == $openDivTag) {
$xml = substr($xml, strlen($openDivTag));
}
// Remove the closing </div> tag we previously added, if it exists.
$closeDivTag = "</div>\n";
$closeChunk = substr($xml, -strlen($closeDivTag));
if ($closeChunk == $closeDivTag) {
$xml = substr($xml, 0, -strlen($closeDivTag));
}
}
return $xml;
}
}
I also wrote some tests which would live in that same class:
public static function testHtmlToDomConversions($content)
{
// test that converting the $content to a DOMDocument and back does not change the HTML
if ($content !== self::getHtmlFromDomDocument(self::getDomDocumentFromHtml($content))) {
echo "Failed\n";
}
else {
echo "Succeeded\n";
}
}
public static function testAll()
{
self::testHtmlToDomConversions('<p>Here is some sample text</p>');
self::testHtmlToDomConversions('<div>Lots of <div>nested <div>divs</div></div></div>');
self::testHtmlToDomConversions('Normal Text');
self::testHtmlToDomConversions(''); //empty
}
You can check that it works for yourself. DomDocumentWorkaround::testAll() returns this:
Succeeded
Succeeded
Succeeded
Succeeded

I am struggling with this on RHEL7 running PHP 5.6.25 and LibXML 2.9. (Old stuff in 2018, I know, but that is Red Hat for you.)
I have found that the much upvoted solution suggested by Alessandro Vendruscolo breaks the HTML by rearranging tags. I.e.:
<p>First.</p><p>Second.</p>'
becomes:
<p>First.<p>Second.</p></p>'
This goes for both the options he suggests you use: LIBXML_HTML_NOIMPLIED and LIBXML_HTML_NODEFDTD.
The solution suggested by Alex goes half way to solve it, but it does not work if <body> has more than one child node.
The solution that works for me is the follwing:
First, to load the DOMDocument, I use:
$doc = new DOMDocument()
$doc->loadHTML($content);
To save the document after massaging the DOMDocument, I use:
// remove <!DOCTYPE
$doc->removeChild($doc->doctype);
$content = $doc->saveHTML();
// remove <html><body></body></html>
$content = str_replace('<html><body>', '', $content);
$content = str_replace('</body></html>', '', $content);
I am the first to agree that this this is not a very elegant solution - but it works.

Use this function
$layout = preg_replace('~<(?:!DOCTYPE|/?(?:html|head|body))[^>]*>\s*~i', '', $layout);

Much like other members, I first revelled in the simplicity and awesome power of #Alessandro Vendruscolo answer. The ability to simply pass in some flagged constants to the constructor seemed too good to be true. For me it was. I have the correct versions of both LibXML as well as PHP however no matter what it still would add the HTML tag to the node structure of the Document object.
My solution worked way better than using the...
$html->loadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
Flags or....
# remove <!DOCTYPE
$doc->removeChild($doc->firstChild);
# remove <html><body></body></html>
$doc->replaceChild($doc->firstChild->firstChild->firstChild, $doc->firstChild);
Node Removal, which gets messy without a structured order in the DOM. Again code fragments have no way to predetermine DOM structure.
I started this journey wanting a simple way to do DOM traversal how JQuery does it or at least in some fashion that had a structured data set either singly linked, doubly linked or tree'd node traversal. I didn't care how as long as I could parse a string the way HTML does and also have the amazing power of the node entity class properties to use along the way.
So far DOMDocument Object has left me wanting... As with many other programmers it seems... I know I have seen a lot of frustration in this question so since I FINALLY.... (after roughly 30 hours of try and fail type testing) I have found a way to get it all. I hope this helps someone...
First off, I am cynical of EVERYTHING... lol...
I would have went a lifetime before agreeing with anyone that a third party class is in anyway needed in this use case. I very much was and am NOT a fan of using any third party class structure however I stumbled onto a great parser. (about 30 times in Google before I gave in so don't feel alone if you avoided it because it looked lame of unofficial in any way...)
If you are using code fragments and need the, code clean and unaffected by the parser in any way, without extra tags being used then use simplePHPParser.
It's amazing and acts a lot like JQuery. I not often impressed but this class makes use of a lot of good tools and I have had no parsing errors as of yet. I am a huge fan of being able to do what this class does.
You can find its files to download here, its startup instructions here, and its API here. I highly recommend using this class with its simple methods that can do a .find(".className") the same way a JQuery find method would be used or even familiar methods such as getElementByTagName() or getElementById()...
When you save out a node tree in this class it doesn't add anything at all. You can simply say $doc->save(); and it outputs the entire tree to a string without any fuss.
I will now be using this parser for all, non-capped-bandwidth, projects in the future.

If the flags solution answered by Alessandro Vendruscolo doesn't works, you may try this:
$dom = new DOMDocument();
$dom->loadHTML($content);
//do your stuff..
$finalHtml = '';
$bodyTag = $dom->documentElement->getElementsByTagName('body')->item(0);
foreach ($bodyTag->childNodes as $rootLevelTag) {
$finalHtml .= $dom->saveHTML($rootLevelTag);
}
echo $finalHtml;
$bodyTag will contain your full processed HTML code without all those HTML wraps, except for the <body> tag, which is the root of your content. Then you can use a regex or a trim function to remove it from the final string (after saveHTML) or, like in the case above, iterate over all of its childen, saving their content into a temporary variable $finalHtml and return it (what i believe being safer).

I came across this topic to find a way to remove HTML wrapper. Using LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD works great, but I have a problem with utf-8. After much effort I found a solution. I post it bellow for anyone has the same problem.
The problem caused because of <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
The problem:
$dom = new DOMDocument();
$dom->loadHTML('<meta http-equiv="Content-Type" content="text/html; charset=utf-8">' . $document, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$dom->saveHTML();
Solution 1:
$dom->loadHTML(mb_convert_encoding($document, 'HTML-ENTITIES', 'UTF-8'), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$dom->saveHTML($dom->documentElement));
Solution 2:
$dom->loadHTML($document, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
utf8_decode($dom->saveHTML($dom->documentElement));

Adding the <meta> tag will trigger the fixing behavior of DOMDocument. The good part is that you don't need to add that tag at all. If you wan't to use an encoding of your choosing just pass it as a constructor argument.
http://php.net/manual/en/domdocument.construct.php
$doc = new DOMDocument('1.0', 'UTF-8');
$node = $doc->createElement('div', 'Hello World');
$doc->appendChild($node);
echo $doc->saveHTML();
Output
<div>Hello World</div>
Thanks to #Bart

I had this requirement, too, and liked the solution posted by Alex above. There are a couple of issues, though - if the <body> element contains more than one child element, the resulting document will only contain only the first child element of <body>, not all of them. Also, I needed the stripping to handle things conditionally - only when you had document with the HTML headings. So I refined it as follows. Instead of removing <body>, I transformed it to a <div>, and stripped out the XML declaration and <html>.
function strip_html_headings($html_doc)
{
if (is_null($html_doc))
{
// might be better to issue an exception, but we silently return
return;
}
// remove <!DOCTYPE
if (!is_null($html_doc->firstChild) &&
$html_doc->firstChild->nodeType == XML_DOCUMENT_TYPE_NODE)
{
$html_doc->removeChild($html_doc->firstChild);
}
if (!is_null($html_doc->firstChild) &&
strtolower($html_doc->firstChild->tagName) == 'html' &&
!is_null($html_doc->firstChild->firstChild) &&
strtolower($html_doc->firstChild->firstChild->tagName) == 'body')
{
// we have 'html/body' - replace both nodes with a single "div"
$div_node = $html_doc->createElement('div');
// copy all the child nodes of 'body' to 'div'
foreach ($html_doc->firstChild->firstChild->childNodes as $child)
{
// deep copies each child node, with attributes
$child = $html_doc->importNode($child, true);
// adds node to 'div''
$div_node->appendChild($child);
}
// replace 'html/body' with 'div'
$html_doc->removeChild($html_doc->firstChild);
$html_doc->appendChild($div_node);
}
}

I have PHP 5.3 and the answers here did not work for me.
$doc->replaceChild($doc->firstChild->firstChild->firstChild, $doc->firstChild); replaced all the document with only the first child, I had many paragraphs and only the first was being saved, but the solution gave me a good starting point to write something without regex I left some comments and I am pretty sure this can be improved but if someone has the same problem as me it can be a good starting point.
function extractDOMContent($doc){
# remove <!DOCTYPE
$doc->removeChild($doc->doctype);
// lets get all children inside the body tag
foreach ($doc->firstChild->firstChild->childNodes as $k => $v) {
if($k !== 0){ // don't store the first element since that one will be used to replace the html tag
$doc->appendChild( clone($v) ); // appending element to the root so we can remove the first element and still have all the others
}
}
// replace the body tag with the first children
$doc->replaceChild($doc->firstChild->firstChild->firstChild, $doc->firstChild);
return $doc;
}
Then we could use it like this:
$doc = new DOMDocument();
$doc->encoding = 'UTF-8';
$doc->loadHTML('<p>Some html here</p><p>And more html</p><p>and some html</p>');
$doc = extractDOMContent($doc);
Note that appendChild accepts a DOMNode so we do not need to create new elements, we can just reuse existing ones that implement DOMNodesuch as DOMElement this can be important to keep code "sane" when manipulating multiple HTML/XML documents

I face 3 problems with DOMDocument class.
1- This class loads html with ISO encoding and utf-8 characters not showing in output.
2- Even if we give ‍‍‍LIBXML_HTML_NOIMPLIED flag to loadHtml method, until our input html does not contain a root tag, it will not be parse correctly.
3- This class considers the HTML5 tags invalid.
So I've override this class to solve these problems and I changed some of the methods.
class DOMEditor extends DOMDocument
{
/**
* Temporary wrapper tag , It should be an unusual tag to avoid problems
*/
protected $tempRoot = 'temproot';
public function __construct($version = '1.0', $encoding = 'UTF-8')
{
//turn off html5 errors
libxml_use_internal_errors(true);
parent::__construct($version, $encoding);
}
public function loadHTML($source, $options = LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD)
{
// this is a bitwise check if LIBXML_HTML_NOIMPLIED is set
if ($options & LIBXML_HTML_NOIMPLIED) {
// it loads the content with a temporary wrapper tag and utf-8 encoding
parent::loadHTML("<{$this->tempRoot}>" . mb_convert_encoding($source, 'HTML', 'UTF-8') . "</{$this->tempRoot}>", $options);
} else {
// it loads the content with utf-8 encoding and default options
parent::loadHTML(mb_convert_encoding($source, 'HTML', 'UTF-8'), $options);
}
}
private function unwrapTempRoot($output)
{
if ($this->firstChild->nodeName === $this->tempRoot) {
return substr($output, strlen($this->tempRoot) + 2, -strlen($this->tempRoot) - 4);
}
return $output;
}
public function saveHTML(DOMNode $node = null)
{
$html = html_entity_decode(parent::saveHTML($node));
if (is_null($node)) {
$html = $this->unwrapTempRoot($html);
}
return $html;
}
public function saveXML(DOMNode $node = null, $options = null)
{
if (is_null($node)) {
return '<?xml version="1.0" encoding="UTF-8" standalone="yes"?>' . PHP_EOL . $this->saveHTML();
}
return parent::saveXML($node);
}
}
Now im using DOMEditor instead of DOMDocument and it has worked well for me so far
$editor = new DOMEditor();
$editor->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
// works like a charm!
echo $editor->saveHTML();

My universal solution independent of how the HTML was loaded:
function getNodeHtml(DOMNode $node, $outer = true) {
$doc = new DOMDocument();
$node = $node instanceof DOMDocument ? $node->documentElement : $node;
foreach(($outer ? array($node) : $node->childNodes) as $n) {
$doc->appendChild($doc->importNode($n->cloneNode(true), true));
}
return $doc->saveHTML();
}
Sample results:
<p>foo bar </p> ━▶ <p>foo bar </p>
<p>foo</p><p>bar</p> ━▶ <p>foo</p><p>bar</p>
<p>foo </p> <p> bar</p> ━▶ <p>foo </p> <p> bar</p>
Hello! ━▶ Hello!
<html><body><b>foo</b></body></html> ━▶ <html><body><b>foo</b></body></html>

After reading lots of code about this topic, I ended up with the following solution that works very well for me and is easy to understand.
It fixes unwanted Doctype and <html> and <body> as well as encoding issues.
This code assumes that $htmlContent is encoded in utf-8.
$htmlContent = "<h1>This is a heading</h1><p>This is a paragraph</p>";
// 1.) Load the html
$dom = new DOMDocument();
$dom->loadHTML("<meta http-equiv='Content-Type' content='charset=utf-8' /><div>$htmlContent</div>");
// 2.) Do you logic
$dom->getElementsByTagName('h1')[0]->setAttribute('class', 'happy');
// 3.) Render the html
$wrapperNode = $dom->getElementsByTagName('div')[0];
$renderedHtml = $dom->saveHTML($wrapperNode);
// If you want to keep the wrapper div
echo $renderedHtml;
// Or remove the wrapper <div>
echo substr(trim($renderedHtml), 5, -6);
The key take aways are:
loadHTML assumes content to be iso-8859-1, if this is not the case, you need to add encoding information.
Wrap your html code in a div and render just this div, you can remove it with substring if you don’t want to keep it.

I maybe too late. But maybe somebody (like me) still has this issue. So, none of the above worked for me. Because $dom->loadHTML also close open tags as well, not only add html and body tags.
So add a < div > element is not working for me, because I have sometimes like 3-4 unclosed div in the html piece.
My solution:
1.) Add marker to cut, then load the html piece
$html_piece = "[MARK]".$html_piece."[/MARK]";
$dom->loadHTML($html_piece);
2.) do whatever you want with the document
3.) save html
$new_html_piece = $dom->saveHTML();
4.) before you return it, remove < p >< /p > tags from marker, strangely it is only appear on [MARK] but not on [/MARK]...!?
$new_html_piece = preg_replace( "/<p[^>]*?>(\[MARK\]|\s)*?<\/p>/", "[MARK]" , $new_html_piece );
5.) remove everything before and after marker
$pattern_contents = '{\[MARK\](.*?)\[\/MARK\]}is';
if (preg_match($pattern_contents, $new_html_piece, $matches)) {
$new_html_piece = $matches[1];
}
6.) return it
return $new_html_piece;
It would be a lot easier if LIBXML_HTML_NOIMPLIED worked for me. It schould, but it is not. PHP 5.4.17, libxml Version 2.7.8.
I find really strange, I use the HTML DOM parser and then, to fix this "thing" I have to use regex... The whole point was, not to use regex ;)

I came upon this issue as well.
Unfortunately, I did not felt comfortably using any of the solutions provided in this thread, so I went to check one that would satisfy me.
Here's what I made up and it works without issues:
$domxpath = new \DOMXPath($domDocument);
/** #var \DOMNodeList $subset */
$subset = $domxpath->query('descendant-or-self::body/*');
$html = '';
foreach ($subset as $domElement) {
/** #var $domElement \DOMElement */
$html .= $domDocument->saveHTML($domElement);
}
In essense it works in similar way to most of the solutions provided here, but instead of doing manual labor it uses xpath selector to select all the elements within the body and concatenates their html code.

my server got php 5.3 and can't upgrade so those options
LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD
are not for me.
To solve this i tell to the SaveXML Function to print the Body element and then just replace the "body" with "div"
here is my code, hope it's helping someone:
<?
$html = "your html here";
$tabContentDomDoc = new DOMDocument();
$tabContentDomDoc->loadHTML('<?xml encoding="UTF-8">'.$html);
$tabContentDomDoc->encoding = 'UTF-8';
$tabContentDomDocBody = $tabContentDomDoc->getElementsByTagName('body')->item(0);
if(is_object($tabContentDomDocBody)){
echo (str_replace("body","div",$tabContentDomDoc->saveXML($tabContentDomDocBody)));
}
?>
the utf-8 is for Hebrew support.

Alex answer is correct, but might cause following error on empty nodes:
Argument 1 passed to DOMNode::removeChild() must be an instance of
DOMNode
Here comes my little mod:
$output = '';
$doc = new DOMDocument();
$doc->loadHTML($htmlString); //feed with html here
if (isset($doc->firstChild)) {
/* remove doctype */
$doc->removeChild($doc->firstChild);
/* remove html and body */
if (isset($doc->firstChild->firstChild->firstChild)) {
$doc->replaceChild($doc->firstChild->firstChild->firstChild, $doc->firstChild);
$output = trim($doc->saveHTML());
}
}
return $output;
Adding the trim() is also a good idea to remove whitespace.

For anyone using Drupal, there's a built in function to do this:
https://api.drupal.org/api/drupal/modules!filter!filter.module/function/filter_dom_serialize/7.x
Code for reference:
function filter_dom_serialize($dom_document) {
$body_node = $dom_document->getElementsByTagName('body')->item(0);
$body_content = '';
if ($body_node !== NULL) {
foreach ($body_node->getElementsByTagName('script') as $node) {
filter_dom_serialize_escape_cdata_element($dom_document, $node);
}
foreach ($body_node->getElementsByTagName('style') as $node) {
filter_dom_serialize_escape_cdata_element($dom_document, $node, '/*', '*/');
}
foreach ($body_node->childNodes as $child_node) {
$body_content .= $dom_document->saveXML($child_node);
}
return preg_replace('|<([^> ]*)/>|i', '<$1 />', $body_content);
}
else {
return $body_content;
}
}

You can use tidy with show-body-only:
$tidy = new tidy();
$htmlBody = $tidy->repairString($html, [
'indent' => true,
'output-xhtml' => true,
'show-body-only' => true
], 'utf8');
But, remeber: tidy remove some tags like Font Awesome icons: Problems Indenting HTML(5) with PHP

This is the solution that helped me:
$content = str_replace(array('<html>','</html>') , '' , $doc->saveHTML());

#remove doctype tag
$doc->removeChild($doc->doctype);
#remove html & body tags
$html = $doc->getElementsByTagName('html')[0];
$body = $html->getElementsByTagName('body')[0];
foreach($body->childNodes as $child) {
$doc->appendChild($child);
}
$doc->removeChild($html);

This library makes it simple to traverse / modify the DOM and also takes care of removing the doctype / html wrappers for you:
https://github.com/sunra/php-simple-html-dom-parser

Related

How to retrieve string of resulting DOM markup after xpath and DOM operations (PHP)? [duplicate]

I'm the function below, I'm struggling to output the DOMDocument without it appending the XML, HTML, body and p tag wrappers before the output of the content. The suggested fix:
$postarray['post_content'] = $d->saveXML($d->getElementsByTagName('p')->item(0));
Only works when the content has no block level elements inside it. However, when it does, as in the example below with the h1 element, the resulting output from saveXML is truncated to...
<p>If you like</p>
I've been pointed to this post as a possible workaround, but I can't understand how to implement it into this solution (see commented out attempts below).
Any suggestions?
function rseo_decorate_keyword($postarray) {
global $post;
$keyword = "Jasmine Tea"
$content = "If you like <h1>jasmine tea</h1> you will really like it with Jasmine Tea flavors. This is the last ocurrence of the phrase jasmine tea within the content. If there are other instances of the keyword jasmine tea within the text what happens to jasmine tea."
$d = new DOMDocument();
#$d->loadHTML($content);
$x = new DOMXpath($d);
$count = $x->evaluate("count(//text()[contains(translate(., 'ABCDEFGHJIKLMNOPQRSTUVWXYZ', 'abcdefghjiklmnopqrstuvwxyz'), '$keyword') and (ancestor::b or ancestor::strong)])");
if ($count > 0) return $postarray;
$nodes = $x->query("//text()[contains(translate(., 'ABCDEFGHJIKLMNOPQRSTUVWXYZ', 'abcdefghjiklmnopqrstuvwxyz'), '$keyword') and not(ancestor::h1) and not(ancestor::h2) and not(ancestor::h3) and not(ancestor::h4) and not(ancestor::h5) and not(ancestor::h6) and not(ancestor::b) and not(ancestor::strong)]");
if ($nodes && $nodes->length) {
$node = $nodes->item(0);
// Split just before the keyword
$keynode = $node->splitText(strpos($node->textContent, $keyword));
// Split after the keyword
$node->nextSibling->splitText(strlen($keyword));
// Replace keyword with <b>keyword</b>
$replacement = $d->createElement('strong', $keynode->textContent);
$keynode->parentNode->replaceChild($replacement, $keynode);
}
$postarray['post_content'] = $d->saveXML($d->getElementsByTagName('p')->item(0));
// $postarray['post_content'] = $d->saveXML($d->getElementsByTagName('body')->item(1));
// $postarray['post_content'] = $d->saveXML($d->getElementsByTagName('body')->childNodes);
return $postarray;
}

All of these answers are now wrong, because as of PHP 5.4 and Libxml 2.6 loadHTML now has a $option parameter which instructs Libxml about how it should parse the content.
Therefore, if we load the HTML with these options
$html->loadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
when doing saveHTML() there will be no doctype, no <html>, and no <body>.
LIBXML_HTML_NOIMPLIED turns off the automatic adding of implied html/body elements
LIBXML_HTML_NODEFDTD prevents a default doctype being added when one is not found.
Full documentation about Libxml parameters is here
(Note that loadHTML docs say that Libxml 2.6 is needed, but LIBXML_HTML_NODEFDTD is only available in Libxml 2.7.8 and LIBXML_HTML_NOIMPLIED is available in Libxml 2.7.7)

Just remove the nodes directly after loading the document with loadHTML():
# remove <!DOCTYPE
$doc->removeChild($doc->doctype);
# remove <html><body></body></html>
$doc->replaceChild($doc->firstChild->firstChild->firstChild, $doc->firstChild);

The issue with the top answer is that LIBXML_HTML_NOIMPLIED is unstable.
It can reorder elements (particularly, moving the top element's closing tag to the bottom of the document), add random p tags, and perhaps a variety of other issues[1]. It may remove the html and body tags for you, but at the cost of unstable behavior. In production, that's a red flag. In short:
Don't use LIBXML_HTML_NOIMPLIED. Instead, use substr.
Think about it. The lengths of <html><body> and </body></html> are fixed and at both ends of the document - their sizes never change, and neither do their positions. This allows us to use substr to cut them away:
$dom = new domDocument;
$dom->loadHTML($html, LIBXML_HTML_NODEFDTD);
echo substr($dom->saveHTML(), 12, -15); // the star of this operation
(THIS IS NOT THE FINAL SOLUTION HOWEVER! See below for the complete answer, keep reading for context)
We cut 12 away from the start of the document because <html><body> = 12 characters (<<>>+html+body = 4+4+4), and we go backwards and cut 15 off the end because \n</body></html> = 15 characters (\n+//+<<>>+body+html = 1 + 2 + 4 + 4 + 4)
Notice that I still use LIBXML_HTML_NODEFDTD omit the !DOCTYPE from being included. First, this simplifies the substr removal of the HTML/BODY tags. Second, we don't remove the doctype with substr because we don't know if the 'default doctype' will always be something of a fixed length. But, most importantly, LIBXML_HTML_NODEFDTD stops the DOM parser from applying a non-HTML5 doctype to the document - which at least prevents the parser from treating elements it doesn't recognize as loose text.
We know for a fact that the HTML/BODY tags are of fixed lengths and positions, and we know that constants like LIBXML_HTML_NODEFDTD are never removed without some type of deprecation notice, so the above method should roll well into the future, BUT...
...the only caveat is that the DOM implementation could change the way in HTML/BODY tags are placed within the document - for instance, removing the newline at the end of the document, adding spaces between the tags, or adding newlines.
This can be remedied by searching for the positions of the opening and closing tags for body, and using those offsets as for our lengths to trim off. We use strpos and strrpos to find the offsets from the front and back, respectively:
$dom = new domDocument;
$dom->loadHTML($html, LIBXML_HTML_NODEFDTD);
$trim_off_front = strpos($dom->saveHTML(),'<body>') + 6;
// PositionOf<body> + 6 = Cutoff offset after '<body>'
// 6 = Length of '<body>'
$trim_off_end = (strrpos($dom->saveHTML(),'</body>')) - strlen($dom->saveHTML());
// ^ PositionOf</body> - LengthOfDocument = Relative-negative cutoff offset before '</body>'
echo substr($dom->saveHTML(), $trim_off_front, $trim_off_end);
In closing, a repeat of the final, future-proof answer:
$dom = new domDocument;
$dom->loadHTML($html, LIBXML_HTML_NODEFDTD);
$trim_off_front = strpos($dom->saveHTML(),'<body>') + 6;
$trim_off_end = (strrpos($dom->saveHTML(),'</body>')) - strlen($dom->saveHTML());
echo substr($dom->saveHTML(), $trim_off_front, $trim_off_end);
No doctype, no html tag, no body tag. We can only hope the DOM parser will receive a fresh coat of paint soon and we can more directly eliminate these unwanted tags.

Use saveXML() instead, and pass the documentElement as an argument to it.
$innerHTML = '';
foreach ($document->getElementsByTagName('p')->item(0)->childNodes as $child) {
$innerHTML .= $document->saveXML($child);
}
echo $innerHTML;
http://php.net/domdocument.savexml

use DOMDocumentFragment
$html = 'what you want';
$doc = new DomDocument();
$fragment = $doc->createDocumentFragment();
$fragment->appendXML($html);
$doc->appendChild($fragment);
echo $doc->saveHTML();

A neat trick is to use loadXML and then saveHTML. The html and body tags are inserted at the load stage, not the save stage.
$dom = new DOMDocument;
$dom->loadXML('<p>My DOMDocument contents are here</p>');
echo $dom->saveHTML();
NB that this is a bit hacky and you should use Jonah's answer if you can get it to work.

It's 2017, and for this 2011 Question I don't like any of the answers.
Lots of regex, big classes, loadXML etc...
Easy solution which solves the known problems:
$dom = new DOMDocument();
$dom->loadHTML( '<html><body>'.mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8').'</body></html>' , LIBXML_HTML_NODEFDTD);
$html = substr(trim($dom->saveHTML()),12,-14);
Easy, Simple, Solid, Fast. This code will work regarding HTML tags and encoding like:
$html = '<p>äöü</p><p>ß</p>';
If anybody finds an error , please tell, I will use this myself.
Edit, Other valid options that work without errors (very similar to ones already given):
#$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
$saved_dom = trim($dom->saveHTML());
$start_dom = stripos($saved_dom,'<body>')+6;
$html = substr($saved_dom,$start_dom,strripos($saved_dom,'</body>') - $start_dom );
You could add body yourself to prevent any strange thing on the furure.
Thirt option:
$mock = new DOMDocument;
$body = $dom->getElementsByTagName('body')->item(0);
foreach ($body->childNodes as $child){
$mock->appendChild($mock->importNode($child, true));
}
$html = trim($mock->saveHTML());

I'm a bit late in the club but didn't want to not share a method I've found out about. First of all I've got the right versions for loadHTML() to accept these nice options, but LIBXML_HTML_NOIMPLIED didn't work on my system. Also users report problems with the parser (for example here and here).
The solution I created actually is pretty simple.
HTML to be loaded is put in a <div> element so it has a container containing all nodes to be loaded.
Then this container element is removed from the document (but the DOMElement of it still exists).
Then all direct children from the document are removed. This includes any added <html>, <head> and <body> tags (effectively LIBXML_HTML_NOIMPLIED option) as well as the <!DOCTYPE html ... loose.dtd"> declaration (effectively LIBXML_HTML_NODEFDTD).
Then all direct children of the container are added to the document again and it can be output.
$str = '<p>Lorem ipsum dolor sit amet.</p><p>Nunc vel vehicula ante.</p>';
$doc = new DOMDocument();
$doc->loadHTML("<div>$str</div>");
$container = $doc->getElementsByTagName('div')->item(0);
$container = $container->parentNode->removeChild($container);
while ($doc->firstChild) {
$doc->removeChild($doc->firstChild);
}
while ($container->firstChild ) {
$doc->appendChild($container->firstChild);
}
$htmlFragment = $doc->saveHTML();
XPath works as usual, just take care that there are multiple document elements now, so not a single root node:
$xpath = new DOMXPath($doc);
foreach ($xpath->query('/p') as $element)
{ # ^- note the single slash "/"
# ... each of the two <p> element
PHP 5.4.36-1+deb.sury.org~precise+2 (cli) (built: Dec 21 2014 20:28:53)

Okay I found a more elegant solution, but it's just tedious:
$d = new DOMDocument();
#$d->loadHTML($yourcontent);
...
// do your manipulation, processing, etc of it blah blah blah
...
// then to save, do this
$x = new DOMXPath($d);
$everything = $x->query("body/*"); // retrieves all elements inside body tag
if ($everything->length > 0) { // check if it retrieved anything in there
$output = '';
foreach ($everything as $thing) {
$output .= $d->saveXML($thing);
}
echo $output; // voila, no more annoying html wrappers or body tag
}
Alright, hopefully this does not omit anything and helps somebody?

None of the other solutions at the time of this writing (June, 2012) were able to completely meet my needs, so I wrote one which handles the following cases:
Accepts plain-text content which has no tags, as well as HTML content.
Does not append any tags (including <doctype>, <xml>, <html>, <body>, and <p> tags)
Leaves anything wrapped in <p> alone.
Leaves empty text alone.
So here is a solution which fixes those issues:
class DOMDocumentWorkaround
{
/**
* Convert a string which may have HTML components into a DOMDocument instance.
*
* #param string $html - The HTML text to turn into a string.
* #return \DOMDocument - A DOMDocument created from the given html.
*/
public static function getDomDocumentFromHtml($html)
{
$domDocument = new DOMDocument();
// Wrap the HTML in <div> tags because loadXML expects everything to be within some kind of tag.
// LIBXML_NOERROR and LIBXML_NOWARNING mean this will fail silently and return an empty DOMDocument if it fails.
$domDocument->loadXML('<div>' . $html . '</div>', LIBXML_NOERROR | LIBXML_NOWARNING);
return $domDocument;
}
/**
* Convert a DOMDocument back into an HTML string, which is reasonably close to what we started with.
*
* #param \DOMDocument $domDocument
* #return string - The resulting HTML string
*/
public static function getHtmlFromDomDocument($domDocument)
{
// Convert the DOMDocument back to a string.
$xml = $domDocument->saveXML();
// Strip out the XML declaration, if one exists
$xmlDeclaration = "<?xml version=\"1.0\"?>\n";
if (substr($xml, 0, strlen($xmlDeclaration)) == $xmlDeclaration) {
$xml = substr($xml, strlen($xmlDeclaration));
}
// If the original HTML was empty, loadXML collapses our <div></div> into <div/>. Remove it.
if ($xml == "<div/>\n") {
$xml = '';
}
else {
// Remove the opening <div> tag we previously added, if it exists.
$openDivTag = "<div>";
if (substr($xml, 0, strlen($openDivTag)) == $openDivTag) {
$xml = substr($xml, strlen($openDivTag));
}
// Remove the closing </div> tag we previously added, if it exists.
$closeDivTag = "</div>\n";
$closeChunk = substr($xml, -strlen($closeDivTag));
if ($closeChunk == $closeDivTag) {
$xml = substr($xml, 0, -strlen($closeDivTag));
}
}
return $xml;
}
}
I also wrote some tests which would live in that same class:
public static function testHtmlToDomConversions($content)
{
// test that converting the $content to a DOMDocument and back does not change the HTML
if ($content !== self::getHtmlFromDomDocument(self::getDomDocumentFromHtml($content))) {
echo "Failed\n";
}
else {
echo "Succeeded\n";
}
}
public static function testAll()
{
self::testHtmlToDomConversions('<p>Here is some sample text</p>');
self::testHtmlToDomConversions('<div>Lots of <div>nested <div>divs</div></div></div>');
self::testHtmlToDomConversions('Normal Text');
self::testHtmlToDomConversions(''); //empty
}
You can check that it works for yourself. DomDocumentWorkaround::testAll() returns this:
Succeeded
Succeeded
Succeeded
Succeeded

I am struggling with this on RHEL7 running PHP 5.6.25 and LibXML 2.9. (Old stuff in 2018, I know, but that is Red Hat for you.)
I have found that the much upvoted solution suggested by Alessandro Vendruscolo breaks the HTML by rearranging tags. I.e.:
<p>First.</p><p>Second.</p>'
becomes:
<p>First.<p>Second.</p></p>'
This goes for both the options he suggests you use: LIBXML_HTML_NOIMPLIED and LIBXML_HTML_NODEFDTD.
The solution suggested by Alex goes half way to solve it, but it does not work if <body> has more than one child node.
The solution that works for me is the follwing:
First, to load the DOMDocument, I use:
$doc = new DOMDocument()
$doc->loadHTML($content);
To save the document after massaging the DOMDocument, I use:
// remove <!DOCTYPE
$doc->removeChild($doc->doctype);
$content = $doc->saveHTML();
// remove <html><body></body></html>
$content = str_replace('<html><body>', '', $content);
$content = str_replace('</body></html>', '', $content);
I am the first to agree that this this is not a very elegant solution - but it works.

Use this function
$layout = preg_replace('~<(?:!DOCTYPE|/?(?:html|head|body))[^>]*>\s*~i', '', $layout);

Much like other members, I first revelled in the simplicity and awesome power of #Alessandro Vendruscolo answer. The ability to simply pass in some flagged constants to the constructor seemed too good to be true. For me it was. I have the correct versions of both LibXML as well as PHP however no matter what it still would add the HTML tag to the node structure of the Document object.
My solution worked way better than using the...
$html->loadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
Flags or....
# remove <!DOCTYPE
$doc->removeChild($doc->firstChild);
# remove <html><body></body></html>
$doc->replaceChild($doc->firstChild->firstChild->firstChild, $doc->firstChild);
Node Removal, which gets messy without a structured order in the DOM. Again code fragments have no way to predetermine DOM structure.
I started this journey wanting a simple way to do DOM traversal how JQuery does it or at least in some fashion that had a structured data set either singly linked, doubly linked or tree'd node traversal. I didn't care how as long as I could parse a string the way HTML does and also have the amazing power of the node entity class properties to use along the way.
So far DOMDocument Object has left me wanting... As with many other programmers it seems... I know I have seen a lot of frustration in this question so since I FINALLY.... (after roughly 30 hours of try and fail type testing) I have found a way to get it all. I hope this helps someone...
First off, I am cynical of EVERYTHING... lol...
I would have went a lifetime before agreeing with anyone that a third party class is in anyway needed in this use case. I very much was and am NOT a fan of using any third party class structure however I stumbled onto a great parser. (about 30 times in Google before I gave in so don't feel alone if you avoided it because it looked lame of unofficial in any way...)
If you are using code fragments and need the, code clean and unaffected by the parser in any way, without extra tags being used then use simplePHPParser.
It's amazing and acts a lot like JQuery. I not often impressed but this class makes use of a lot of good tools and I have had no parsing errors as of yet. I am a huge fan of being able to do what this class does.
You can find its files to download here, its startup instructions here, and its API here. I highly recommend using this class with its simple methods that can do a .find(".className") the same way a JQuery find method would be used or even familiar methods such as getElementByTagName() or getElementById()...
When you save out a node tree in this class it doesn't add anything at all. You can simply say $doc->save(); and it outputs the entire tree to a string without any fuss.
I will now be using this parser for all, non-capped-bandwidth, projects in the future.

If the flags solution answered by Alessandro Vendruscolo doesn't works, you may try this:
$dom = new DOMDocument();
$dom->loadHTML($content);
//do your stuff..
$finalHtml = '';
$bodyTag = $dom->documentElement->getElementsByTagName('body')->item(0);
foreach ($bodyTag->childNodes as $rootLevelTag) {
$finalHtml .= $dom->saveHTML($rootLevelTag);
}
echo $finalHtml;
$bodyTag will contain your full processed HTML code without all those HTML wraps, except for the <body> tag, which is the root of your content. Then you can use a regex or a trim function to remove it from the final string (after saveHTML) or, like in the case above, iterate over all of its childen, saving their content into a temporary variable $finalHtml and return it (what i believe being safer).

I came across this topic to find a way to remove HTML wrapper. Using LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD works great, but I have a problem with utf-8. After much effort I found a solution. I post it bellow for anyone has the same problem.
The problem caused because of <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
The problem:
$dom = new DOMDocument();
$dom->loadHTML('<meta http-equiv="Content-Type" content="text/html; charset=utf-8">' . $document, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$dom->saveHTML();
Solution 1:
$dom->loadHTML(mb_convert_encoding($document, 'HTML-ENTITIES', 'UTF-8'), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$dom->saveHTML($dom->documentElement));
Solution 2:
$dom->loadHTML($document, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
utf8_decode($dom->saveHTML($dom->documentElement));

Adding the <meta> tag will trigger the fixing behavior of DOMDocument. The good part is that you don't need to add that tag at all. If you wan't to use an encoding of your choosing just pass it as a constructor argument.
http://php.net/manual/en/domdocument.construct.php
$doc = new DOMDocument('1.0', 'UTF-8');
$node = $doc->createElement('div', 'Hello World');
$doc->appendChild($node);
echo $doc->saveHTML();
Output
<div>Hello World</div>
Thanks to #Bart

I had this requirement, too, and liked the solution posted by Alex above. There are a couple of issues, though - if the <body> element contains more than one child element, the resulting document will only contain only the first child element of <body>, not all of them. Also, I needed the stripping to handle things conditionally - only when you had document with the HTML headings. So I refined it as follows. Instead of removing <body>, I transformed it to a <div>, and stripped out the XML declaration and <html>.
function strip_html_headings($html_doc)
{
if (is_null($html_doc))
{
// might be better to issue an exception, but we silently return
return;
}
// remove <!DOCTYPE
if (!is_null($html_doc->firstChild) &&
$html_doc->firstChild->nodeType == XML_DOCUMENT_TYPE_NODE)
{
$html_doc->removeChild($html_doc->firstChild);
}
if (!is_null($html_doc->firstChild) &&
strtolower($html_doc->firstChild->tagName) == 'html' &&
!is_null($html_doc->firstChild->firstChild) &&
strtolower($html_doc->firstChild->firstChild->tagName) == 'body')
{
// we have 'html/body' - replace both nodes with a single "div"
$div_node = $html_doc->createElement('div');
// copy all the child nodes of 'body' to 'div'
foreach ($html_doc->firstChild->firstChild->childNodes as $child)
{
// deep copies each child node, with attributes
$child = $html_doc->importNode($child, true);
// adds node to 'div''
$div_node->appendChild($child);
}
// replace 'html/body' with 'div'
$html_doc->removeChild($html_doc->firstChild);
$html_doc->appendChild($div_node);
}
}

I have PHP 5.3 and the answers here did not work for me.
$doc->replaceChild($doc->firstChild->firstChild->firstChild, $doc->firstChild); replaced all the document with only the first child, I had many paragraphs and only the first was being saved, but the solution gave me a good starting point to write something without regex I left some comments and I am pretty sure this can be improved but if someone has the same problem as me it can be a good starting point.
function extractDOMContent($doc){
# remove <!DOCTYPE
$doc->removeChild($doc->doctype);
// lets get all children inside the body tag
foreach ($doc->firstChild->firstChild->childNodes as $k => $v) {
if($k !== 0){ // don't store the first element since that one will be used to replace the html tag
$doc->appendChild( clone($v) ); // appending element to the root so we can remove the first element and still have all the others
}
}
// replace the body tag with the first children
$doc->replaceChild($doc->firstChild->firstChild->firstChild, $doc->firstChild);
return $doc;
}
Then we could use it like this:
$doc = new DOMDocument();
$doc->encoding = 'UTF-8';
$doc->loadHTML('<p>Some html here</p><p>And more html</p><p>and some html</p>');
$doc = extractDOMContent($doc);
Note that appendChild accepts a DOMNode so we do not need to create new elements, we can just reuse existing ones that implement DOMNodesuch as DOMElement this can be important to keep code "sane" when manipulating multiple HTML/XML documents

I face 3 problems with DOMDocument class.
1- This class loads html with ISO encoding and utf-8 characters not showing in output.
2- Even if we give ‍‍‍LIBXML_HTML_NOIMPLIED flag to loadHtml method, until our input html does not contain a root tag, it will not be parse correctly.
3- This class considers the HTML5 tags invalid.
So I've override this class to solve these problems and I changed some of the methods.
class DOMEditor extends DOMDocument
{
/**
* Temporary wrapper tag , It should be an unusual tag to avoid problems
*/
protected $tempRoot = 'temproot';
public function __construct($version = '1.0', $encoding = 'UTF-8')
{
//turn off html5 errors
libxml_use_internal_errors(true);
parent::__construct($version, $encoding);
}
public function loadHTML($source, $options = LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD)
{
// this is a bitwise check if LIBXML_HTML_NOIMPLIED is set
if ($options & LIBXML_HTML_NOIMPLIED) {
// it loads the content with a temporary wrapper tag and utf-8 encoding
parent::loadHTML("<{$this->tempRoot}>" . mb_convert_encoding($source, 'HTML', 'UTF-8') . "</{$this->tempRoot}>", $options);
} else {
// it loads the content with utf-8 encoding and default options
parent::loadHTML(mb_convert_encoding($source, 'HTML', 'UTF-8'), $options);
}
}
private function unwrapTempRoot($output)
{
if ($this->firstChild->nodeName === $this->tempRoot) {
return substr($output, strlen($this->tempRoot) + 2, -strlen($this->tempRoot) - 4);
}
return $output;
}
public function saveHTML(DOMNode $node = null)
{
$html = html_entity_decode(parent::saveHTML($node));
if (is_null($node)) {
$html = $this->unwrapTempRoot($html);
}
return $html;
}
public function saveXML(DOMNode $node = null, $options = null)
{
if (is_null($node)) {
return '<?xml version="1.0" encoding="UTF-8" standalone="yes"?>' . PHP_EOL . $this->saveHTML();
}
return parent::saveXML($node);
}
}
Now im using DOMEditor instead of DOMDocument and it has worked well for me so far
$editor = new DOMEditor();
$editor->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
// works like a charm!
echo $editor->saveHTML();

My universal solution independent of how the HTML was loaded:
function getNodeHtml(DOMNode $node, $outer = true) {
$doc = new DOMDocument();
$node = $node instanceof DOMDocument ? $node->documentElement : $node;
foreach(($outer ? array($node) : $node->childNodes) as $n) {
$doc->appendChild($doc->importNode($n->cloneNode(true), true));
}
return $doc->saveHTML();
}
Sample results:
<p>foo bar </p> ━▶ <p>foo bar </p>
<p>foo</p><p>bar</p> ━▶ <p>foo</p><p>bar</p>
<p>foo </p> <p> bar</p> ━▶ <p>foo </p> <p> bar</p>
Hello! ━▶ Hello!
<html><body><b>foo</b></body></html> ━▶ <html><body><b>foo</b></body></html>

After reading lots of code about this topic, I ended up with the following solution that works very well for me and is easy to understand.
It fixes unwanted Doctype and <html> and <body> as well as encoding issues.
This code assumes that $htmlContent is encoded in utf-8.
$htmlContent = "<h1>This is a heading</h1><p>This is a paragraph</p>";
// 1.) Load the html
$dom = new DOMDocument();
$dom->loadHTML("<meta http-equiv='Content-Type' content='charset=utf-8' /><div>$htmlContent</div>");
// 2.) Do you logic
$dom->getElementsByTagName('h1')[0]->setAttribute('class', 'happy');
// 3.) Render the html
$wrapperNode = $dom->getElementsByTagName('div')[0];
$renderedHtml = $dom->saveHTML($wrapperNode);
// If you want to keep the wrapper div
echo $renderedHtml;
// Or remove the wrapper <div>
echo substr(trim($renderedHtml), 5, -6);
The key take aways are:
loadHTML assumes content to be iso-8859-1, if this is not the case, you need to add encoding information.
Wrap your html code in a div and render just this div, you can remove it with substring if you don’t want to keep it.

I maybe too late. But maybe somebody (like me) still has this issue. So, none of the above worked for me. Because $dom->loadHTML also close open tags as well, not only add html and body tags.
So add a < div > element is not working for me, because I have sometimes like 3-4 unclosed div in the html piece.
My solution:
1.) Add marker to cut, then load the html piece
$html_piece = "[MARK]".$html_piece."[/MARK]";
$dom->loadHTML($html_piece);
2.) do whatever you want with the document
3.) save html
$new_html_piece = $dom->saveHTML();
4.) before you return it, remove < p >< /p > tags from marker, strangely it is only appear on [MARK] but not on [/MARK]...!?
$new_html_piece = preg_replace( "/<p[^>]*?>(\[MARK\]|\s)*?<\/p>/", "[MARK]" , $new_html_piece );
5.) remove everything before and after marker
$pattern_contents = '{\[MARK\](.*?)\[\/MARK\]}is';
if (preg_match($pattern_contents, $new_html_piece, $matches)) {
$new_html_piece = $matches[1];
}
6.) return it
return $new_html_piece;
It would be a lot easier if LIBXML_HTML_NOIMPLIED worked for me. It schould, but it is not. PHP 5.4.17, libxml Version 2.7.8.
I find really strange, I use the HTML DOM parser and then, to fix this "thing" I have to use regex... The whole point was, not to use regex ;)

I came upon this issue as well.
Unfortunately, I did not felt comfortably using any of the solutions provided in this thread, so I went to check one that would satisfy me.
Here's what I made up and it works without issues:
$domxpath = new \DOMXPath($domDocument);
/** #var \DOMNodeList $subset */
$subset = $domxpath->query('descendant-or-self::body/*');
$html = '';
foreach ($subset as $domElement) {
/** #var $domElement \DOMElement */
$html .= $domDocument->saveHTML($domElement);
}
In essense it works in similar way to most of the solutions provided here, but instead of doing manual labor it uses xpath selector to select all the elements within the body and concatenates their html code.

my server got php 5.3 and can't upgrade so those options
LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD
are not for me.
To solve this i tell to the SaveXML Function to print the Body element and then just replace the "body" with "div"
here is my code, hope it's helping someone:
<?
$html = "your html here";
$tabContentDomDoc = new DOMDocument();
$tabContentDomDoc->loadHTML('<?xml encoding="UTF-8">'.$html);
$tabContentDomDoc->encoding = 'UTF-8';
$tabContentDomDocBody = $tabContentDomDoc->getElementsByTagName('body')->item(0);
if(is_object($tabContentDomDocBody)){
echo (str_replace("body","div",$tabContentDomDoc->saveXML($tabContentDomDocBody)));
}
?>
the utf-8 is for Hebrew support.

Alex answer is correct, but might cause following error on empty nodes:
Argument 1 passed to DOMNode::removeChild() must be an instance of
DOMNode
Here comes my little mod:
$output = '';
$doc = new DOMDocument();
$doc->loadHTML($htmlString); //feed with html here
if (isset($doc->firstChild)) {
/* remove doctype */
$doc->removeChild($doc->firstChild);
/* remove html and body */
if (isset($doc->firstChild->firstChild->firstChild)) {
$doc->replaceChild($doc->firstChild->firstChild->firstChild, $doc->firstChild);
$output = trim($doc->saveHTML());
}
}
return $output;
Adding the trim() is also a good idea to remove whitespace.

For anyone using Drupal, there's a built in function to do this:
https://api.drupal.org/api/drupal/modules!filter!filter.module/function/filter_dom_serialize/7.x
Code for reference:
function filter_dom_serialize($dom_document) {
$body_node = $dom_document->getElementsByTagName('body')->item(0);
$body_content = '';
if ($body_node !== NULL) {
foreach ($body_node->getElementsByTagName('script') as $node) {
filter_dom_serialize_escape_cdata_element($dom_document, $node);
}
foreach ($body_node->getElementsByTagName('style') as $node) {
filter_dom_serialize_escape_cdata_element($dom_document, $node, '/*', '*/');
}
foreach ($body_node->childNodes as $child_node) {
$body_content .= $dom_document->saveXML($child_node);
}
return preg_replace('|<([^> ]*)/>|i', '<$1 />', $body_content);
}
else {
return $body_content;
}
}

You can use tidy with show-body-only:
$tidy = new tidy();
$htmlBody = $tidy->repairString($html, [
'indent' => true,
'output-xhtml' => true,
'show-body-only' => true
], 'utf8');
But, remeber: tidy remove some tags like Font Awesome icons: Problems Indenting HTML(5) with PHP

This is the solution that helped me:
$content = str_replace(array('<html>','</html>') , '' , $doc->saveHTML());

#remove doctype tag
$doc->removeChild($doc->doctype);
#remove html & body tags
$html = $doc->getElementsByTagName('html')[0];
$body = $html->getElementsByTagName('body')[0];
foreach($body->childNodes as $child) {
$doc->appendChild($child);
}
$doc->removeChild($html);

This library makes it simple to traverse / modify the DOM and also takes care of removing the doctype / html wrappers for you:
https://github.com/sunra/php-simple-html-dom-parser

Add html tag to string in PHP

I would like to add html tag to string of HTML in PHP, for example:
<h2><b>Hello World</b></h2>
<p>First</p>
Second
<p>Third</p>
Second is not wrapped with any html element, so system will add p tag into it, expected result:
<h2><b>Hello World</b></h2>
<p>First</p>
<p>Second</p>
<p>Third</p>
Tried with PHP Simple HTML DOM Parser but have no clue how to deal with it, here is my example of idea:
function htmlParser($html)
{
foreach ($html->childNodes() as $node) {
if ($node->childNodes()) {
htmlParser($node);
}
// Ideally: add p tag to node innertext if it does not wrapped with any tag
}
return $html;
}
But childNode will not loop into Second because it has no element wrapped inside, and regex is not recommended to deal with html tag, any idea on it?
Much appreciate, thanks.

This was a cool question because it promoted thought about the DoM.
I raised a question How do HTML Parsers process untagged text which was commented generously by #sideshowbarker, which made me think, and improved my knowledge of the DoM, especially about text nodes.
Below is a DoM based way of finding candidate text nodes and padding them with 'p' tags. There are lots of text nodes that we should leave alone, like the spaces, carriage returns and line feeds we use for formatting (which an "uglifier" may strip out).
<?php
$html = file_get_contents("nodeTest.html"); // read the test file
$dom = new domDocument; // a new dom object
$dom->loadHTML($html); // build the DoM
$bodyNodes = $dom->getElementsByTagName('body'); // returns DOMNodeList object
foreach($bodyNodes[0]->childNodes as $child) // assuming 1 <body> node
{
$text="";
// this tests for an untagged text node that has more than non-formatting characters
if ( ($child->nodeType == 3) && ( strlen( $text = trim($child->nodeValue)) > 0 ) )
{ // its a candidate for adding tags
$newText = "<p>".$text."</p>";
echo str_replace($text,$newText,$child->nodeValue);
}
else
{ // not a candidate for adding tags
echo $dom->saveHTML($child);
}
}
nodeTest.html contains this.
<!DOCTYPE HTML>
<html>
<body>
<h2><b>Hello World</b></h2>
<p>First</p>
Second
<p>Third</p>
fourth
<p>Third</p>
<!-- comment -->
</body>
</html>
and the output is this.... I did not bother echoing the outer tags. Notice that comments and formatting are properly treated.
<h2><b>Hello World</b></h2>
<p>First</p>
<p>Second</p>
<p>Third</p>
<p>fourth</p>
<p>Third</p>
<!-- comment -->
Obviously you need to traverse the DoM and repeat the search/replace at each element node if you wish to make the thing more general. We are only stopping at the Body node in this example and processing each direct child node.
I'm not 100% sure the code is the most efficient possible and I may think some more on that and update if I find a better way.

Used a stupid way to solve this problem, here is my code:
function addPTag($html)
{
$contents = preg_split("/(<\/.*?>)/", $html, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
foreach ($contents as &$content) {
if (substr($content, 0, 1) != '<') {
$chars = preg_split("/(<)/", $content, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
$chars[0] = '<p>' . $chars[0] . '</p>';
$content = implode($chars);
}
}
return implode($contents);
}
Hope there is other elegant way rather than this, thanks.

You can try Simple HTML Dom Parser
$stringHtml = 'Your received html';
$html = str_get_html(stringHtml);
//Find necessary element and edit it
$exampleText = $html->find('Your selector here', 0)->last_child()->innertext

php dom document remove some html tags but keep inner tags and text

I need to remove some tags (e.g. <div></div>) in HTML document and keep inner tags and text.
I managed to do that with Simple HTML Dom Parser. But it can't process big files due to huge memory requirements.
I would prefer to use native PHP tools like DOMDocument cause I read that it's more optimized and quicker in processing HTML documents.
But I struggle at the first stage - how to remove some tags while keeping inner text and tags.
Source HTML sample is:
<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>
I try this code:
$htmltext='<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>';
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($htmltext);
$oldnodes = $doc->getElementsByTagName('div');
foreach ($oldnodes as $node) {
$fragment = $doc->createDocumentFragment();
while($node->childNodes->length > 0) {
$fragment->appendChild($node->childNodes->item(0));
}
$node->parentNode->replaceChild($fragment, $node);
}
echo $doc->saveHTML();
It produces the output:
<html><body>00000aaaaa<div>bbbbbbccc<a>link</a>cccdddddd</div>eeeee<div>1111</div></body></html>
I need the following:
<html><body>00000aaaaabbbbbbccc<a>link</a>cccddddddeeeee1111</body></html>
Could someone please help me with proper code for the task?

You can use strip_tags function in PHP.
$thmltext = '<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>';
strip_tags($htmltext, '<html>,<body>,<a>');
This remove all tags except html,body,a
And output is:
<html><body>00000aaaaabbbbbbccc<a>link</a>cccddddddeeeee1111</body></html>
EDIT:
If it is input from user, it's better for security reason to use whitelist tags and not blacklist.

If your code only contains simple HTML tags without any attributes you can keep it simple like:
$value = '<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>';
$pattern = '/<[\/]*(div|h1)>/';
$removedTags = preg_replace($pattern, '', $value);
Since you wrote in your comment that there are more than just div tags you want to remove, I added a h1 tag to the pattern in case you also want to remove h1 tags.
This code snippet is only for simple code, but fits to your HTML input and output example.

Try this..
Just replace the for loop with the below code.
foreach ($oldnodes as $node) {
$children = $node->childNodes;
$string = "";
foreach($children as $child) {
$childString = $doc->saveXML($child);
$string = $string."".$childString;
}
$fragment = $doc->createDocumentFragment();
$fragment->appendXML($string);
$node->parentNode->insertBefore($fragment,$node);
$node->parentNode->removeChild($node);
}

I found a way to make it work.
The reason code in question not working is the manipulation with nodes in nodelist ruin nodelist. So "foreach" function wents through only 2 out of 4 items in nodelist - the rest 2 become distorted.
So I had to deal with only the 1st element of the list and then rebuild list until there are some items in the list left.
The code is:
$htmltext='<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>';
echo "<!--
".$htmltext."
-->
";
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($htmltext);
$oldnodes = $doc->getElementsByTagName('div');
while ($oldnodes->length>0){
$node=$oldnodes->item(0);
$fragment = $doc->createDocumentFragment();
while($node->childNodes->length > 0) {
$fragment->appendChild($node->childNodes->item(0));
}
$node->parentNode->replaceChild($fragment, $node);
$oldnodes = $doc->getElementsByTagName('div');
}
echo $doc->saveHTML();
I hope that will be helpful for someone who finds same difficulties.

Get contents of BODY without DOCTYPE, HTML, HEAD and BODY tags

What I am trying to do is include an HTML file within a PHP system (not a problem) but that HTML file also needs to be usable on its own, for various reasons, so I need to know how I can strip the doctype, html, head and body tags in the context of the PHP include, if that's possible.
I'm not particularly good at PHP (doh!) so my searches of the php manual and on the web hasn't made me figure this out. Meaning that any help or reading tips, or both, are much appreciated.

Since the substr() method seemed to be too much for some to swallow, here is a DOM parser method:
$d = new DOMDocument;
$mock = new DOMDocument;
$d->loadHTML(file_get_contents('/path/to/my.html'));
$body = $d->getElementsByTagName('body')->item(0);
foreach ($body->childNodes as $child){
$mock->appendChild($mock->importNode($child, true));
}
echo $mock->saveHTML();
http://codepad.org/MQVQ3XQP
Anybody wish to see that "other one", see the revisions.

$site = file_get_contents("http://www.google.com/");
preg_match("/<body[^>]*>(.*?)<\/body>/is", $site, $matches);
echo($matches[1]);

Use DOMDocument to keep what you need rather than strip what you don't need (PHP >= 5.3.6)
$d = new DOMDocument;
$d->loadHTMLFile($fileLocation);
$body = $d->getElementsByTagName('body')->item(0);
// perform innerhtml on $body by enumerating child nodes
// and saving them individually
foreach ($body->childNodes as $childNode) {
echo $d->saveHTML($childNode);
}

As miken32 said:
Hey why not answer a 9 year old question? PHP version 5.4 (released 3
years after this question was asked) added the options parameter to
DomDocument::loadHTML(). With it you can do this:
$dom = new DomDocument();
$dom->loadHTML($string, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
// do stuff
echo $dom->saveHTML();
We pass two constants: LIBXML_HTML_NODEFDTD says not to add a document type definition, and LIBXML_HTML_NOIMPLIED says not to add implied elements like <html> and <body>.

You may want to use PHP tidy extension which can fix invalid XHTML structures (in which case DOMDocument load crashes) and also extract body only:
$tidy = new tidy();
$htmlBody = $tidy->repairString($html, array(
'output-xhtml' => true,
'show-body-only' => true,
), 'utf8');
Then load extracted body into DOMDocument:
$xml = new DOMDocument();
$xml->loadHTML($htmlBody);
Then traverse, extract, move around XML nodes etc .. and save:
$output = $xml->saveXML();

Use a DOM parser. this is not tested but ought to do what you want
$domDoc = new DOMDocument();
$domDoc.loadHTMLFile('/path/to/file');
$body = $domDoc->GetElementsByTagName('body')->item(0);
foreach ($body->childNodes as $child){
echo $child->C14N(); //Note this cannonicalizes the representation of the node, but that's not necessarily a bad thing
}
If you want to avoid cannonicalization, you can use this version (thanks to #Jared Farrish)

A solution with only one instance of DOMDocument and without loops
$d = new DOMDocument();
$d->loadHTML(file_get_contents('/path/to/my.html'));
$body = $d->getElementsByTagName('body')->item(0);
echo $d->saveHTML($body);

This may be a solution. I tried it and it works fine.
function parseHTML(string) {
var parser = new DOMParser
, result = parser.parseFromString(string, "text/html");
return result.firstChild.lastChild.firstChild;
}

php DOMDocument adds <html> headers with DOCTYPE declaration

I'm adding a #b hash to each link via the DOMDocument class.
$dom = new DOMDocument();
$dom->loadHTML($output);
$a_tags = $dom->getElementsByTagName('a');
foreach($a_tags as $a)
{
$value = $a->getAttribute('href');
$a->setAttribute('href', $value . '#b');
}
return $dom->saveHTML();
That works fine, however the returned output includes a DOCTYPE declaration and a <head> and <body> tag. Any idea why that happens or how I can prevent that?

That's what DOMDocument::saveHTML() generally does, yes : generate a full HTML Document, with the Doctype declaration, the <head> tag, ...
Two possible solutions :
If you are working with PHP >= 5.3, saveHTML() accepts one additional parameter that might help you
see The DOM Goodie in PHP 5.3.6 for more informations.
If you need your code to work with PHP < 5.3.6, you'll have to use some str_replace() or regex or whatever equivalent you can think of to remove the portions of HTML code you don't need.
For an example, see this note in the manual's users notes.

The real problem is the way the DOM is loaded. Use this instead:
$html->loadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
Please upvote the original answer here.

Adding $doc->saveHTML(false); will not work and it will return a error because it expects a node and not bool.
The solution I used:
return preg_replace('/^<!DOCTYPE.+?>/', '', str_replace( array('<html>', '</html>', '<body>', '</body>'), array('', '', '', ''), $doc->saveHTML()));
I`m using PHP >5.4

I solved this problem by creating new DOMDocument and copying child nodes from original to new one.
function removeDocType($oldDom) {
$node = $oldDom->documentElement->firstChild
$dom = new DOMDocument();
foreach ($node->childNodes as $child) {
$dom->appendChild($doc->importNode($child, true));
}
return $dom->saveHTML();
}
So insted of using
return $dom->saveHTML();
I use:
return removeDocType($dom);

I was in the case where I want the html wrapper but not the DOCTYPE, the solution was in line with Tiago A.:
// Avoid adding the DOCTYPE header
$dom->loadHTML($bodyContent, LIBXML_HTML_NODEFDTD);
// Avoid adding the DOCTYPE header AND html/body wrapper
$dom->loadHTML($bodyContent, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to saveHTML of DOMDocument without HTML wrapper? - php

Just remove the nodes directly after loading the document with loadHTML(): # remove <!DOCTYPE $doc->removeChild($doc->doctype); # remove <html><body></body></html> $doc->replaceChild($doc->firstChild->firstChild->firstChild, $doc->firstChild);

Use saveXML() instead, and pass the documentElement as an argument to it. $innerHTML = ''; foreach ($document->getElementsByTagName('p')->item(0)->childNodes as $child) { $innerHTML .= $document->saveXML($child); } echo $innerHTML; http://php.net/domdocument.savexml

use DOMDocumentFragment $html = 'what you want'; $doc = new DomDocument(); $fragment = $doc->createDocumentFragment(); $fragment->appendXML($html); $doc->appendChild($fragment); echo $doc->saveHTML();

Use this function $layout = preg_replace('~<(?:!DOCTYPE|/?(?:html|head|body))[^>]>\s~i', '', $layout);

You can use tidy with show-body-only: $tidy = new tidy(); $htmlBody = $tidy->repairString($html, [ 'indent' => true, 'output-xhtml' => true, 'show-body-only' => true ], 'utf8'); But, remeber: tidy remove some tags like Font Awesome icons: Problems Indenting HTML(5) with PHP

This is the solution that helped me: $content = str_replace(array('<html>','</html>') , '' , $doc->saveHTML());

#remove doctype tag $doc->removeChild($doc->doctype); #remove html & body tags $html = $doc->getElementsByTagName('html')[0]; $body = $html->getElementsByTagName('body')[0]; foreach($body->childNodes as $child) { $doc->appendChild($child); } $doc->removeChild($html);

This library makes it simple to traverse / modify the DOM and also takes care of removing the doctype / html wrappers for you: https://github.com/sunra/php-simple-html-dom-parser

Related

How to retrieve string of resulting DOM markup after xpath and DOM operations (PHP)? [duplicate]

Add html tag to string in PHP

php dom document remove some html tags but keep inner tags and text

Get contents of BODY without DOCTYPE, HTML, HEAD and BODY tags

php DOMDocument adds <html> headers with DOCTYPE declaration

Categories

Resources

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to saveHTML of DOMDocument without HTML wrapper? - php

Just remove the nodes directly after loading the document with loadHTML(): # remove <!DOCTYPE $doc->removeChild($doc->doctype); # remove <html><body></body></html> $doc->replaceChild($doc->firstChild->firstChild->firstChild, $doc->firstChild);

Use saveXML() instead, and pass the documentElement as an argument to it. $innerHTML = ''; foreach ($document->getElementsByTagName('p')->item(0)->childNodes as $child) { $innerHTML .= $document->saveXML($child); } echo $innerHTML; http://php.net/domdocument.savexml

use DOMDocumentFragment $html = 'what you want'; $doc = new DomDocument(); $fragment = $doc->createDocumentFragment(); $fragment->appendXML($html); $doc->appendChild($fragment); echo $doc->saveHTML();

Use this function $layout = preg_replace('~<(?:!DOCTYPE|/?(?:html|head|body))[^>]*>\s*~i', '', $layout);

You can use tidy with show-body-only: $tidy = new tidy(); $htmlBody = $tidy->repairString($html, [ 'indent' => true, 'output-xhtml' => true, 'show-body-only' => true ], 'utf8'); But, remeber: tidy remove some tags like Font Awesome icons: Problems Indenting HTML(5) with PHP

This is the solution that helped me: $content = str_replace(array('<html>','</html>') , '' , $doc->saveHTML());

#remove doctype tag $doc->removeChild($doc->doctype); #remove html & body tags $html = $doc->getElementsByTagName('html')[0]; $body = $html->getElementsByTagName('body')[0]; foreach($body->childNodes as $child) { $doc->appendChild($child); } $doc->removeChild($html);

This library makes it simple to traverse / modify the DOM and also takes care of removing the doctype / html wrappers for you: https://github.com/sunra/php-simple-html-dom-parser

Related

How to retrieve string of resulting DOM markup after xpath and DOM operations (PHP)? [duplicate]

Add html tag to string in PHP

php dom document remove some html tags but keep inner tags and text

Get contents of BODY without DOCTYPE, HTML, HEAD and BODY tags

php DOMDocument adds <html> headers with DOCTYPE declaration

Categories

Resources

Use this function $layout = preg_replace('~<(?:!DOCTYPE|/?(?:html|head|body))[^>]>\s~i', '', $layout);