Remove parent element, keep all inner children in DOMDocument with saveHTML - php

I'm manipulating a short HTML snippet with XPath; when I output the changed snippet back with $doc->saveHTML(), DOCTYPE gets added, and HTML / BODY tags wrap the output. I want to remove those, but keep all the children inside by only using the DOMDocument functions. For example:
$doc = new DOMDocument();
$doc->loadHTML('<p><strong>Title...</strong></p>
<img src="http://" alt="">
<p>...to be one of those crowning achievements...</p>');
// manipulation goes here
echo htmlentities( $doc->saveHTML() );
This produces:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" ...>
<html><body>
<p><strong>Title...</strong></p>
<img src="http://" alt="">
<p>...to be one of those crowning achievements...</p>
</body></html>
I've attempted some of the simple tricks, such as:
# removes doctype
$doc->removeChild($doc->firstChild);
# <body> replaces <html>
$doc->replaceChild($doc->firstChild->firstChild, $doc->firstChild);
So far that only removes DOCTYPE and replaces HTML with BODY. However, what remains is body > variable number of elements at this point.
How do I remove the <body> tag but keep all of its children, given that they will be structured variably, in a neat - clean way with PHP's DOM manipulation?

UPDATE
Here's a version that doesn't extend DOMDocument, though I think extending is the proper approach, since you're trying to achieve functionality that isn't built-in to the DOM API.
Note: I'm interpreting "clean" and "without workarounds" as keeping all manipulation to the DOM API. As soon as you hit string manipulation, that's workaround territory.
What I'm doing, just as in the original answer, is leveraging DOMDocumentFragment to manipulate multiple nodes all sitting at the root level. There is no string manipulation going on, which to me qualifies as not being a workaround.
$doc = new DOMDocument();
$doc->loadHTML('<p><strong>Title...</strong></p><img src="http://" alt=""><p>...to be one of those crowning achievements...</p>');
// Remove doctype node
$doc->doctype->parentNode->removeChild($doc->doctype);
// Remove html element, preserving child nodes
$html = $doc->getElementsByTagName("html")->item(0);
$fragment = $doc->createDocumentFragment();
while ($html->childNodes->length > 0) {
$fragment->appendChild($html->childNodes->item(0));
}
$html->parentNode->replaceChild($fragment, $html);
// Remove body element, preserving child nodes
$body = $doc->getElementsByTagName("body")->item(0);
$fragment = $doc->createDocumentFragment();
while ($body->childNodes->length > 0) {
$fragment->appendChild($body->childNodes->item(0));
}
$body->parentNode->replaceChild($fragment, $body);
// Output results
echo htmlentities($doc->saveHTML());
ORIGINAL ANSWER
This solution is rather lengthy, but it's because it goes about it by extending the DOM in order to keep your end code as short as possible.
sliceOutNode is where the magic happens. Let me know if you have any questions:
<?php
class DOMDocumentExtended extends DOMDocument
{
public function __construct( $version = "1.0", $encoding = "UTF-8" )
{
parent::__construct( $version, $encoding );
$this->registerNodeClass( "DOMElement", "DOMElementExtended" );
}
// This method will need to be removed once PHP supports LIBXML_NOXMLDECL
public function saveXML( DOMNode $node = NULL, $options = 0 )
{
$xml = parent::saveXML( $node, $options );
if( $options & LIBXML_NOXMLDECL )
{
$xml = $this->stripXMLDeclaration( $xml );
}
return $xml;
}
public function stripXMLDeclaration( $xml )
{
return preg_replace( "|<\?xml(.+?)\?>[\n\r]?|i", "", $xml );
}
}
class DOMElementExtended extends DOMElement
{
public function sliceOutNode()
{
$nodeList = new DOMNodeListExtended( $this->childNodes );
$this->replaceNodeWithNode( $nodeList->toFragment( $this->ownerDocument ) );
}
public function replaceNodeWithNode( DOMNode $node )
{
return $this->parentNode->replaceChild( $node, $this );
}
}
class DOMNodeListExtended extends ArrayObject
{
public function __construct( $mixedNodeList )
{
parent::__construct( array() );
$this->setNodeList( $mixedNodeList );
}
private function setNodeList( $mixedNodeList )
{
if( $mixedNodeList instanceof DOMNodeList )
{
$this->exchangeArray( array() );
foreach( $mixedNodeList as $node )
{
$this->append( $node );
}
}
elseif( is_array( $mixedNodeList ) )
{
$this->exchangeArray( $mixedNodeList );
}
else
{
throw new DOMException( "DOMNodeListExtended only supports a DOMNodeList or array as its constructor parameter." );
}
}
public function toFragment( DOMDocument $contextDocument )
{
$fragment = $contextDocument->createDocumentFragment();
foreach( $this as $node )
{
$fragment->appendChild( $contextDocument->importNode( $node, true ) );
}
return $fragment;
}
// Built-in methods of the original DOMNodeList
public function item( $index )
{
return $this->offsetGet( $index );
}
public function __get( $name )
{
switch( $name )
{
case "length":
return $this->count();
break;
}
return false;
}
}
// Load HTML/XML using our fancy DOMDocumentExtended class
$doc = new DOMDocumentExtended();
$doc->loadHTML('<p><strong>Title...</strong></p><img src="http://" alt=""><p>...to be one of those crowning achievements...</p>');
// Remove doctype node
$doc->doctype->parentNode->removeChild( $doc->doctype );
// Slice out html node
$html = $doc->getElementsByTagName("html")->item(0);
$html->sliceOutNode();
// Slice out body node
$body = $doc->getElementsByTagName("body")->item(0);
$body->sliceOutNode();
// Pick your poison: XML or HTML output
echo htmlentities( $doc->saveXML( NULL, LIBXML_NOXMLDECL ) );
echo htmlentities( $doc->saveHTML() );

saveHTML can output a subset of document, meaning we can ask it to output every child node one by one, by traversing body.
$doc = new DOMDocument();
$doc->loadHTML('<p><strong>Title...</strong></p>
<img src="http://google.com/img.jpeg" alt="">
<p>...to be one of those crowning achievements...</p>');
// manipulation goes here
// Let's traverse the body and output every child node
$bodyNode = $doc->getElementsByTagName('body')->item(0);
foreach ($bodyNode->childNodes as $childNode) {
echo $doc->saveHTML($childNode);
}
This might not be a most elegant solution, but it works. Alternatively, we can wrap all children nodes inside some container element (say a div) and output only that container (but container tag will be included in the output).

Here how I've done it:
-- Quick helper function that gives you HTML contents for specific DOM element
function nodeContent($n, $outer=false) {
$d = new DOMDocument('1.0');
$b = $d->importNode($n->cloneNode(true),true);
$d->appendChild($b); $h = $d->saveHTML();
// remove outter tags
if (!$outer) $h = substr($h,strpos($h,'>')+1,-(strlen($n->nodeName)+4));
return $h;
}
-- Find body node in your doc and get its contents
$query = $xpath->query("//body")->item(0);
if($query)
{
echo nodeContent($query);
}
UPDATE 1:
Some extra info: Since PHP/5.3.6, DOMDocument->saveHTML() accepts an optional DOMNode parameter similarly to DOMDocument->saveXML(). You can do
$xpath = new DOMXPath($doc);
$query = $xpath->query("//body")->item(0);
echo $doc->saveHTML($query);
for others, the helper function will help

tl;dr
requires: PHP 5.4.0 and Libxml 2.6.0
$doc->loadHTML("<p>test</p>", LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
explanation
http://php.net/manual/en/domdocument.loadhtml.php
"Since PHP 5.4.0 and Libxml 2.6.0, you may also use the options parameter to specify additional Libxml parameters."
LIBXML_HTML_NOIMPLIED Sets HTML_PARSE_NOIMPLIED flag, which turns off the automatic adding of implied html/body... elements.
LIBXML_HTML_NODEFDTD Sets HTML_PARSE_NODEFDTD flag, which prevents a default doctype being added when one is not found.

You have 2 ways to accomplish this:
$content = substr($content, strpos($content, '<html><body>') + 12); // Remove Everything Before & Including The Opening HTML & Body Tags.
$content = substr($content, 0, -14); // Remove Everything After & Including The Closing HTML & Body Tags.
Or even better is this way:
$dom->normalizeDocument();
$content = $dom->saveHTML();

Related

PHP XML generation, chain `appendChild()`

I'm generating an XML file via PHP and I'm doing it this way:
$dom = new DOMDocument();
$root = $dom->createElement('Root');
...
// some node definitions here etc
$root->appendChild($product);
$root->appendChild($quantity);
$root->appendChild($measureUnit);
$root->appendChild($lineNumber);
...
$dom->appendChild($root);
$dom->save( '/some/dir/some-name.xml');
It all works well until I encountered some problem, when I get to the part that I needed to append lets say N child nodes. This meant that I would be calling the function appendChild() 'N' times too - and that resulted on a very long php script which is a little hard to maintain.
I know we can split the main script on smaller files for better maintenance but are there better ways to just 'chain' the 'appendChild' calls so it would save as a lot of written lines or is there a somewhat magic function such as 'appendChildren' available?
This is my first time using the DOMDocument() class, I hope someone can shed me some light.
Thank you
You can nest the DOMDocument::createElement() into DOMNode::appendChild() calls and chain child nodes or text content assignments.
Since PHP 8.0 DOMNode::append() can be used to append multiple nodes and strings.
$document = new DOMDocument();
// nest createElement inside appendChild
$document->appendChild(
// store node in variable
$root = $document->createElement('root')
);
// chain textContent assignment to appendChild
$root
->appendChild($document->createElement('product'))
->textContent = 'Example';
// use append to add multiple nodes
$root->append(
$product = $document->createElement('measureUnit'),
$quantity = $document->createElement('quantity'),
);
$product->textContent = 'cm';
$quantity->textContent = '42';
$document->formatOutput= true;
echo $document->saveXML();
Output:
<?xml version="1.0"?>
<root>
<product>Example</product>
<measureUnit>cm</measureUnit>
<quantity>42</quantity>
</root>
I am using an interface for reusable and maintainable parts, usually:
interface XMLAppendable {
public function appendTo(DOMElement $parent): void;
}
class YourXMLPart implements XMLAppendable {
private $_product;
private $_unit;
private $_quantity;
public function __construct(string $product, string $unit, int $quantity) {
$this->_product = $product;
$this->_unit = $unit;
$this->_quantity = $quantity;
}
public function appendTo(DOMElement $parent): void {
$document = $parent->ownerDocument;
$parent
->appendChild($document->createElement('product'))
->textContent = $this->_product;
$parent
->appendChild($document->createElement('measureUnit'))
->textContent = $this->_unit;
$parent
->appendChild($document->createElement('quantity'))
->textContent = $this->_quantity;
}
}
$document = new DOMDocument();
// nest createElement inside appendChild
$document->appendChild(
// store node in variable
$root = $document->createElement('root')
);
$part = new YourXMLPart('Example', 'cm', 42);
$part->appendTo($root);
$document->formatOutput= true;
echo $document->saveXML();

PHP DOM: How do I replace the DocumentType node in my XML document?

I'm writing a script to clean up the so-called HTML document that MS Word creates when you Save As "Web Page, Filtered". I want the resulting document to be valid XHTML1.
The first thing I want to do is to change the !DOCTYPE so it will be XHTML 1.0 Strict instead of ...4.0 Transitional.
I wrote code that looked as if it should work, but when I run it I get a Segmentation fault from PHP. At first, I thought this was occurring in the save function, but after adding some echo statements for debugging I now think that the problem is at the places marked {{{1}}} and {{{2}}} in the code (below).
Here's what I think is going on: at {{{1}}} I am iterating through the DOMNodeList, treating it as if it were an ordinary array that I can traverse with foreach.
But at {{{2}}} I change the parent's subnode list. I suspect this breaks my foreach: either the DOMNodeList or my foreach pointer becomes invalid.
So what is the "right" way to make changes to a a DOM tree while you're traversing it? I came up with two possible options:
Copy the DOMNodeList into an ordinary array:
$nodelist = [];
foreach ($node->childNodes as $subnode) {
$nodelist[] = $subnode;
// Or perhaps an object that contains the appropriate code and parameters for the change I want to make
}
foreach ($nodelist as $subnode) {
// make the appropriate change
}
Traverse the DOM tree, but do not make any changes. Instead, create an array of all the places where I want to make changes. When finished, go through that array and make the changes.
Maybe there's some "official" way of doing this????
The relevant parts of my code below:
<?
$dom = new DOMDocument();
$dom->loadHTMLFile($htmFName);
$trav = new DOMTraverser($dom);
$storyParms = new StoryParams("some string");
$callback = new StoryDocCallback($htmFName);
$trav->traverse($callback, $storyParms);
$dom->save("y");
class DOMTraverser
{
private $docNode;
private $callback;
private $param;
public function __construct(DOMNode $node)
{
$this->docNode = $node;
}
public function traverse(GeneralCallBack $cb, $param)
{
$this->callback = $cb;
$this->param = $param;
$this->traverseNode($this->docNode);
}
public function traverseNode($node)
{
$this->callback->callBefore($node, $this->param);
if ($node->hasChildNodes()) {
{{1}} foreach ($node->childNodes as $subnode) {
if($subnode != null) {
$this->traverseNode($subnode);
}
}
}
}
}
class StoryDocCallback implements GeneralCallback
{
public function callbefore($node, $param)
{
$name = $node->nodeName;
if (is_a($node, "DOMDocumentType")) {
$this->repairDocType($node);
return;
}
...
}
protected function repairDocType(DOMNode $node)
{
$impl = new DomImplementation();
$rootName = "html";
$pubID = "-//W3C//DTD XHTML 1.0 Strict//EN";
$sysID = "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd";
$newDocType = $impl->createDocumentType($rootName, $pubID, $sysID);
$parent = $node->parentNode;
{{2}} $rc = $parent->replaceChild($newDocType, $node);
assert($rc != false);
}
...
}

How to remove unwanted HTML tags from user input but keep text inside the tags in PHP using DOMDocument

I have around ~2 Million stored HTML pages in S3 that contain various HTML. I'm trying to extract only the content from those stored pages, but I wish to retain the HTML structure with certain constraints. This HTML is all user-supplied input and should be considered unsafe. So for display purposes, I want to retain only some of the HTML tags with a constraint on attributes and attribute values, but still retain all of the properly encoded text content inside even disallowed tags.
For example, I'd like to allow only specific tags like <p>, <h1>, <h2>, <h3>, <ul>, <ol>, <li>, etc.. But I also want to keep whatever text is found between disallowed tags and maintain its structure. I also want to be able to restrict attributes in each tag or force certain attributes to be applied to specific tags.
For example, in the following HTML...
<div id="content">
Some text...
<p class="someclass">Hello <span style="color: purple;">PHP</span>!</p>
</div>
I'd like the result to be...
Some text...
<p>Hello PHP!</p>
Thus stripping out the unwanted <div> and <span> tags, the unwanted attributes of all tags, and still maintaining the text inside <div> and <span>.
Simply using strip_tags() won't work here. So I tried doing the following with DOMDocuemnt.
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
foreach($dom->childNodes as $node) {
if ($node->nodeName != "p") { // only allow paragraph tags
$text = $node->nodeValue;
$node->parentNode->nodeValue .= $text;
$node->parentNode->removeChild($node);
}
}
echo $dom->saveHTML();
Which would work on simple cases where there aren't nested tags, but obviously fails when the HTML is complex.
I can't exactly call this function recursively on each of the node's child nodes because if I delete the node I lose all further nested children. Even if I defer node deletion until after the recursion the order of text insertion becomes tricky. Because I try to go deep and return all valid nodes then start concatenating the values of the invalid child nodes together and the result is really messy.
For example, let's say I want to allow <p> and <em> in the following HTML
<p>Hello <strong>there <em>PHP</em>!</strong></p>
But I don't want to allow <strong>. If the <strong> has nested <em> my approach gets really confusing. Because I'd get something like ...
<p>Hello there !<em>PHP</em></p>
Which is obviously wrong. I realized getting the entire nodeValue is a bad way of doing this. So instead I started digging into other ways to go through the entire tree one node at a time. Just finding it very difficult to generalize this solution so that it works sanely every time.
Update
A solution to use strip_tags() or the answer provided here isn't helpful to my use case, because the former does not allow me to control the attributes and the latter removes any tag that has attributes. I don't want to remove any tag that has an attribute. I want to explicitly allow certain tags but still have extensible control over what attributes can be kept/modified in the HTML.
It seems this problem needs to be broken down into two smaller steps in order to generalize the solution.
First, Walking the DOM Tree
In order to get to a working solution I found I need to have a sensible way to traverse every node in the DOM tree and inspect it in order to determine if it should be kept as-is or modified.
So I used wrote the following method as a simple generator extending from DOMDocument.
class HTMLFixer extends DOMDocument {
public function walk(DOMNode $node, $skipParent = false) {
if (!$skipParent) {
yield $node;
}
if ($node->hasChildNodes()) {
foreach ($node->childNodes as $n) {
yield from $this->walk($n);
}
}
}
}
This way doing something like foreach($dom->walk($dom) as $node) gives me a simple loop to traverse the entire tree. Of course this is a PHP 7 only solution because of the yield from syntax, but I'm OK with that.
Second, Removing Tags but Keeping their Text
The tricky part was figuring out how to keep the text and not the tag while making modifications inside the loop. So after struggling with a few different approaches I found the simplest way was to build a list of tags to be removed from inside the loop and then remove them later using DOMNode::insertBefore() to append the text nodes up the tree. That way removing those nodes later has no side effects.
So I added another generalized stripTags method to this child class for DOMDocument.
public function stripTags(DOMNode $node) {
$change = $remove = [];
/* Walk the entire tree to build a list of things that need removed */
foreach($this->walk($node) as $n) {
if ($n instanceof DOMText || $n instanceof DOMDocument) {
continue;
}
$this->stripAttributes($n); // strips all node attributes not allowed
$this->forceAttributes($n); // forces any required attributes
if (!in_array($n->nodeName, $this->allowedTags, true)) {
// track the disallowed node for removal
$remove[] = $n;
// we take all of its child nodes for modification later
foreach($n->childNodes as $child) {
$change[] = [$child, $n];
}
}
}
/* Go through the list of changes first so we don't break the
referential integrity of the tree */
foreach($change as list($a, $b)) {
$b->parentNode->insertBefore($a, $b);
}
/* Now we can safely remove the old nodes */
foreach($remove as $a) {
if ($a->parentNode) {
$a->parentNode->removeChild($a);
}
}
}
The trick here is because we use insertBefore, on the child nodes (i.e. text node) of the disallowed tags, to move them up to the parent tag, we could easily break the tree (we're copying). This confused me a lot at first, but looking at the way the method works, it makes sense. Deferring the move of the node makes sure we don't break parentNode reference when the deeper node is the one that's allowed, but its parent is not in the allowed tags list for example.
Complete Solution
Here's the complete solution I came up with to more generally solve this problem. I'll include in my answer since I struggled to find a lot of the edge cases in doing this with DOMDocument elsewhere. It allows you to specify which tags to allow, and all other tags are removed. It also allows you to specify which attributes are allowed and all other attributes can be removed (even forcing certain attributes on certain tags).
class HTMLFixer extends DOMDocument {
protected static $defaultAllowedTags = [
'p',
'h1',
'h2',
'h3',
'h4',
'h5',
'h6',
'pre',
'code',
'blockquote',
'q',
'strong',
'em',
'del',
'img',
'a',
'table',
'thead',
'tbody',
'tfoot',
'tr',
'th',
'td',
'ul',
'ol',
'li',
];
protected static $defaultAllowedAttributes = [
'a' => ['href'],
'img' => ['src'],
'pre' => ['class'],
];
protected static $defaultForceAttributes = [
'a' => ['target' => '_blank'],
];
protected $allowedTags = [];
protected $allowedAttributes = [];
protected $forceAttributes = [];
public function __construct($version = null, $encoding = null, $allowedTags = [],
$allowedAttributes = [], $forceAttributes = []) {
$this->setAllowedTags($allowedTags ?: static::$defaultAllowedTags);
$this->setAllowedAttributes($allowedAttributes ?: static::$defaultAllowedAttributes);
$this->setForceAttributes($forceAttributes ?: static::$defaultForceAttributes);
parent::__construct($version, $encoding);
}
public function setAllowedTags(Array $tags) {
$this->allowedTags = $tags;
}
public function setAllowedAttributes(Array $attributes) {
$this->allowedAttributes = $attributes;
}
public function setForceAttributes(Array $attributes) {
$this->forceAttributes = $attributes;
}
public function getAllowedTags() {
return $this->allowedTags;
}
public function getAllowedAttributes() {
return $this->allowedAttributes;
}
public function getForceAttributes() {
return $this->forceAttributes;
}
public function saveHTML(DOMNode $node = null) {
if (!$node) {
$node = $this;
}
$this->stripTags($node);
return parent::saveHTML($node);
}
protected function stripTags(DOMNode $node) {
$change = $remove = [];
foreach($this->walk($node) as $n) {
if ($n instanceof DOMText || $n instanceof DOMDocument) {
continue;
}
$this->stripAttributes($n);
$this->forceAttributes($n);
if (!in_array($n->nodeName, $this->allowedTags, true)) {
$remove[] = $n;
foreach($n->childNodes as $child) {
$change[] = [$child, $n];
}
}
}
foreach($change as list($a, $b)) {
$b->parentNode->insertBefore($a, $b);
}
foreach($remove as $a) {
if ($a->parentNode) {
$a->parentNode->removeChild($a);
}
}
}
protected function stripAttributes(DOMNode $node) {
$attributes = $node->attributes;
$len = $attributes->length;
for ($i = $len - 1; $i >= 0; $i--) {
$attr = $attributes->item($i);
if (!isset($this->allowedAttributes[$node->nodeName]) ||
!in_array($attr->name, $this->allowedAttributes[$node->nodeName], true)) {
$node->removeAttributeNode($attr);
}
}
}
protected function forceAttributes(DOMNode $node) {
if (isset($this->forceAttributes[$node->nodeName])) {
foreach ($this->forceAttributes[$node->nodeName] as $attribute => $value) {
$node->setAttribute($attribute, $value);
}
}
}
protected function walk(DOMNode $node, $skipParent = false) {
if (!$skipParent) {
yield $node;
}
if ($node->hasChildNodes()) {
foreach ($node->childNodes as $n) {
yield from $this->walk($n);
}
}
}
}
So if we have the following HTML
<div id="content">
Some text...
<p class="someclass">Hello <span style="color: purple;">P<em>H</em>P</span>!</p>
</div>
And we only want to allow <p>, and <em>.
$html = <<<'HTML'
<div id="content">
Some text...
<p class="someclass">Hello <span style="color: purple;">P<em>H</em>P</span>!</p>
</div>
HTML;
$dom = new HTMLFixer(null, null, ['p', 'em']);
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
echo $dom->saveHTML($dom);
We'd get something like this...
Some text...
<p>Hello P<em>H</em>P!</p>
Since you can limit this to a specific subtree in the DOM as well the solution could be generalized even more.
You can use strip_tags() like this:
$html = '<div id="content">
Some text...
<p class="someclass">Hello <span style="color: purple;">PHP</span>!</p>
</div>';
$updatedHTML = strip_tags($text,"<p><h1><h2><h3><ul><ol><li>");
//in second parameter we need to provide which html tag we need to retain.
You can get more information here: http://php.net/manual/en/function.strip-tags.php

DOMDocument::loadXML() for parts of XML

Simple XML templates like these ones :
structure.xml :
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<document>
<book>first book</book>
<book>second book</book>
((other_books))
</document>
book_element.xml :
<book>((name))</book>
And this test :
<?php
Header("Content-type: text/xml; charset=UTF-8");
class XMLTemplate extends DOMDocument
{
private $_content_storage;
private $_filepath;
private $_tags;
public function XMLTemplate( $sFilePath )
{
if( !file_exists( $sFilePath ) ) throw new Exception("file not found");
$this->_filepath = $sFilePath;
$this->_tags = [];
$this->_content_storage = file_get_contents( $this->_filepath );
}
public function Get()
{
$this->merge();
$this->loadXML( $this->_content_storage );
return $this->saveXML();
}
public function SetTag( $sTagName, $sReplacement )
{
$this->_tags[ $sTagName ] = $sReplacement;
}
private function merge()
{
foreach( $this->_tags as $k=>$v)
{
$this->_content_storage = preg_replace(
"/\({2}". $k ."\){2}/i",
$v,
$this->_content_storage
);
}
$this->_content_storage = preg_replace(
"/\({2}[a-z0-9_\-]+\){2}/i",
"",
$this->_content_storage
);
}
}
$aBooks = [
"troisième livre",
"quatrième livre"
];
$Books = "";
foreach( $aBooks as $bookName )
{
$XMLBook = new XMLTemplate("book_element.xml");
$XMLBook->SetTag( "name", $bookName );
$Books .= $XMLBook->Get();
}
$XMLTemplate = new XMLTemplate("test.xml");
$XMLTemplate->SetTag("other_books", $Books);
echo $XMLTemplate->Get();
?>
Give me error :
Warning: DOMDocument::loadXML(): XML declaration allowed only at the start of the document in Entity, line: 5
Because loadXML() method add automatically the declaration to the content, but i need to inject parts of xml in the final template like above. How to disable this annoying auto adding and let me use my declaration ? Or another idea to conturn the problem ?
If you dislike the error and you want to save the document you'd like to merge without the XML declaration, just save the document element instead of the whole document.
See both variants in the following example-code (online-demo):
$doc = new DOMDocument();
$doc->loadXML('<root><child/></root>');
echo "The whole doc:\n\n";
echo $doc->saveXML();
echo "\n\nThe root element only:\n\n";
echo $doc->saveXML($doc->documentElement);
The output is as followed:
The whole doc:
<?xml version="1.0"?>
<root><child/></root>
The root element only:
<root><child/></root>
This probably should be already helpful for you. Additionally there is a constant for libxml which is said can be used to control whether or not the XML declaration is output. But I never used it:
LIBXML_NOXMLDECL (integer)
Drop the XML declaration when saving a document
Note: Only available in Libxml >= 2.6.21
From: http://php.net/libxml.constants
See the link for additional options, you might want to use the one or the other in the future.

removing all divs with a certain class

I have an HTML string, and I want to remove from it all DIVs whose class is "toremove".
This is trivial to do on the client side with jQuery etc., but I want to do it on the server side with PHP.
A simple aegular expression won't work, because divs may be nested...
You could use the DOM object and xPath to remove the DIVs.
/** UNTESTED **/
$doc = new DOMDocument();
$doc->loadHTMLFile($file);
$xpath = new DOMXpath($doc);
$elements = $xpath->query("*/div[#class='yourClasshere']");
foreach($elements as $e){
$doc->removeChild($e);
}
$doc->saveHTMLFile($file);
You can replace the load from file and save to file with load from and save to string if you prefer.
Here's a snippet of code I'm using to remove content from pages:
/**
* A method to remove unwanted parts of an HTML-page. Can remove elements by
* id, tag name and/or class names.
*
* #param string $html The HTML to manipulate
* #param array $partsToRemove An array of arrays, with the keys specifying
* what type of values the array holds. The following keys are used:
* 'elements' - An array of element ids to remove from the html
* 'tags' - An array of tag names to remove from the html
* 'classNames' - An array of class names. Each tag that contains one of the
* class names will be removed from the html.
*
* Also, note that descendants of the removed document will also be removed.
*
* #return string The manipulated HTML content
*
* #example removeHtmlParts($html, array (
* 'elements' => array ('headerSection', 'nav', 'footerSection'),
* 'tags' => array ('form'),
* 'classNames' => array ('promotion')
* ));
*/
--
public function removeHtmlParts ($html, array $toRemove = array())
{
$document = new \DOMDocument('1.0', 'UTF-8');
$document->encoding = 'UTF-8';
// Hack to force DOMDocument to load the HTML using UTF-8.
#$document->loadHTML('<?xml encoding="UTF-8">' . $response->getBody());
$partsToRemove = array ();
if(isset($toRemove['elements']))
{
$partsToRemove['elements'] = $toRemove['element'];
}
if(isset($toRemove['tags']))
{
$partsToRemove['tags'] = $toRemove['tags'];
}
if(isset($toRemove['classNames']))
{
$partsToRemove['classNames'] = $toRemove['classNames'];
}
foreach ($partsToRemove as $type => $content)
{
if($type == 'elements')
{
foreach ($content as $elementId)
{
$element = $document->getElementById($elementId);
if($element)
{
$element->parentNode->removeChild($element);
}
}
}
elseif($type == 'tags')
{
foreach($content as $tagName)
{
$tags = $document->getElementsByTagName($tagName);
while($tags->length)
{
$tag = $tags->item(0);
if($tag)
{
$tag->parentNode->removeChild($tag);
}
}
}
}
elseif($type == 'classNames')
{
foreach ($content as $className)
{
$xpath = new \DOMXPath($document);
$xpathExpression = sprintf(
'//*[contains(#class,"%1")]',
$className
);
$domNodeList = $xpath->evaluate($xpathExpression);
for($i = 0; $i < $domNodeList->length; $i++)
{
$node = $domNodeList->item($i);
if($node && $node->parentNode)
{
$node->parentNode->removeChild($node);
}
}
}
}
}
return $document->saveHTML();
}
Note:
This code has not undergone proper unit testing and probably contains bugs in edge cases
This method should be refactored into a class, and the contents of the method split into separate methods to ease testing.
Based on the short answer of jebbench and the long answer of PatrikAkerstrand, I created a medium function that exactly solves my problem:
/**
* remove, from the given xhtml string, all divs with the given class.
*/
function remove_divs_with_class($xhtml, $class) {
$doc = new DOMDocument();
// Hack to force DOMDocument to load the HTML using UTF-8:
$doc->loadHTML('<?xml encoding="UTF-8">'.$xhtml);
$xpath = new DOMXpath($doc);
$elements = $xpath->query("//*[contains(#class,'$class')]");
foreach ($elements as $element)
$element->parentNode->removeChild($element);
return $doc->saveHTML();
}
/* UNIT TEST */
if (basename(__FILE__)==basename($_SERVER['PHP_SELF'])) {
$xhtml = "<div class='near future'>near future</div><div>start</div><div class='future'>future research</div><div class='summary'>summary</div><div class='a future b'>far future</div>";
$xhtml2 = remove_divs_with_class($xhtml, "future");
print "<h2>before</h2>$xhtml<h2>after</h2>$xhtml2";
}
/* OUTPUT:
before
near future
start
future research
summary
far future
after
start
summary
*/
Never, ever try and use regex to parse XML/HTML. Instead use a parsing library. Apparently, one for PHP is http://sourceforge.net/projects/simplehtmldom/files/

Categories