I am in the middle of a process that should extract something from a HTML page. I am fairly new to DomDocument in PHP, but I got this together from some tutorials and Stack Overflow.
Unfortunately, I need to know what element I am currently getting in the foreach loop below. As far as I know, the getName() function has something to do with XML, because it gives an Undefined Function Fatal error. Do you guys know any way to do this?
$rawdom = new DOMDocument();
$rawdom->loadHTML($page);
$finder = new DomXPath($rawdom);
$nodes = $finder->query("//dl[contains(#class, 'layout__definitionlist')]");
$tmp_dom = new DOMDocument();
foreach ($nodes as $node) {
echo $node->getName();
$tmp_dom->appendChild($tmp_dom->importNode($node,true));
}
$innerHTML = $tmp_dom->saveHTML();
echo $innerHTML;
With DOMElement objects, the element name is not accessible using a getName() function, but as property $tagName:
echo $node->tagName;
getName() is only available with SimpleXMLElement, which is another XML/XPath API for PHP.
Related
How to edit html of elements? I tried this, but i get this error.
Fatal error: Uncaught InvalidArgumentException: Attaching DOM nodes
from multiple documents in the same crawler is forbidden.
$crawler = new Crawler('<h1>The title</h1>');
$crawler
->filter('h1,h2,h3,h4,h5,h6')
->each(function (Crawler $crawler, $i) use (&$replace) {
$crawler->html('<span>test</span>' . $crawler->html());
});
Use this:
$doc = new DOMDocument;
$doc->loadHTML($html);
$crawler = new Crawler($doc);
$crawler
->filter('h1,h2,h3,h4,h5,h6')
->each(function (Crawler $crawler) use ($doc) {
foreach ($crawler as $node) {
$span = $doc->createElement('span', 'test');
$node->parentNode->insertBefore($span, $node);
}
});
Important: Use same DOMDocument object for creating new tag that used in Crawler object.
As explained in The DomCrawler Component docs:
An instance of the Crawler represents a set of DOMElement objects, which are nodes that can be traversed...
So, you need to traverse Crawler object before manipulate DOMElements.
How to only change root's tag name of a DOM node?
In the DOM-Document model we can not change the property documentElement of a DOMElement object, so, we need "rebuild" the node... But how to "rebuild" with childNodes property?
NOTE: I can do this by converting to string with saveXML and cuting root by regular expressions... But it is a workaround, not a DOM-solution.
Tried but not works, PHP examples
PHP example (not works, but WHY?):
Try-1
// DOMElement::documentElement can not be changed, so...
function DomElement_renameRoot1($ele,$ROOTAG='newRoot') {
if (gettype($ele)=='object' && $ele->nodeType==XML_ELEMENT_NODE) {
$doc = new DOMDocument();
$eaux = $doc->createElement($ROOTAG); // DOMElement
foreach ($ele->childNodes as $node)
if ($node->nodeType == 1) // DOMElement
$eaux->appendChild($node); // error!
elseif ($node->nodeType == 3) // DOMText
$eaux->appendChild($node); // error!
return $eaux;
} else
die("ERROR: invalid DOM object as input");
}
The appendChild($node) cause an error:
Fatal error: Uncaught exception 'DOMException'
with message 'Wrong Document Error'
Try-2
From #can suggestion (only pointing link) and my interpretation of the poor dom-domdocument-renamenode manual.
function DomElement_renameRoot2($ele,$ROOTAG='newRoot') {
$ele->ownerDocument->renameNode($ele,null,"h1");
return $ele;
}
The renameNode() method caused an error,
Warning: DOMDocument::renameNode(): Not yet implemented
Try-3
From PHP manual, comment 1.
function renameNode(DOMElement $node, $newName)
{
$newNode = $node->ownerDocument->createElement($newName);
foreach ($node->attributes as $attribute)
$newNode->setAttribute($attribute->nodeName, $attribute->nodeValue);
while ($node->firstChild)
$newNode->appendChild($node->firstChild); // changes firstChild to next!?
$node->ownerDocument->replaceChild($newNode, $node); // changes $node?
// not need return $newNode;
}
The replaceChild() method caused an error,
Fatal error: Uncaught exception 'DOMException' with message 'Not Found Error'
As this has not been really answered yet, the error you get about not found is because of a little error in the renameNode() function you've copied.
In a somewhat related question about renaming different elements in the DOM I've seen this problem as well and used an adoption of that function in my answer that does not have this error:
/**
* Renames a node in a DOM Document.
*
* #param DOMElement $node
* #param string $name
*
* #return DOMNode
*/
function dom_rename_element(DOMElement $node, $name) {
$renamed = $node->ownerDocument->createElement($name);
foreach ($node->attributes as $attribute) {
$renamed->setAttribute($attribute->nodeName, $attribute->nodeValue);
}
while ($node->firstChild) {
$renamed->appendChild($node->firstChild);
}
return $node->parentNode->replaceChild($renamed, $node);
}
You might have spotted it in the last line of the function body: This is using ->parentNode instead of ->ownerDocument. As $node was not a child of the document, you did get the error. And it also was wrong to assume that it should be. Instead use the parent element to search for the child in there to replace it ;)
This has not been outlined in the PHP manual usernotes so far, however, if you did follow the link to the blog-post that originally suggested the renameNode() function you could find a comment below it offering this solution as well.
Anyway, my variant here uses a slightly different variable naming and is more distinct about the types. Like the example in the PHP manual it misses the variant that deals with namespace nodes. I'm not yet booked what would be best, e.g. creating an additional function dealing with it, taking over namespace from the node to rename or changing the namespace explicitly in a different function.
First, you need to understand that the DOMDocument is only the hierarchical root of the document-tree. It's name is always #document. You want to rename the root-element, which is the $document->documentElement.
If you want to copy nodes form a document to another document, you'll need to use the importNode() function: $document->importNode($nodeInAnotherDocument)
Edit:
renameNode() is not implemented yet, so you should make another root, and simply replace it with the old one. If you use DOMDocument->createElement() you don't need to use importNode() on it later.
$oldRoot = $doc->documentElement;
$newRoot = $doc->createElement('new-root');
foreach ($oldRoot->attributes as $attr) {
$newRoot->setAttribute($attr->nodeName, $attr->nodeValue);
}
while ($oldRoot->firstChild) {
$newRoot->appendChild($oldRoot->firstChild);
}
$doc->replaceChild($newRoot, $oldRoot);
This is an variation of my "Try-3" (see question), and works fine!
function xml_renameNode(DOMElement $node, $newName, $cpAttr=true) {
$newNode = $node->ownerDocument->createElement($newName);
if ($cpAttr && is_array($cpAttr)) {
foreach ($cpAttr as $k=>$v)
$newNode->setAttribute($k, $v);
} elseif ($cpAttr)
foreach ($node->attributes as $attribute)
$newNode->setAttribute($attribute->nodeName, $attribute->nodeValue);
while ($node->firstChild)
$newNode->appendChild($node->firstChild);
return $newNode;
}
Of course, if you show how to use DOMDocument::renameNode (without errors!), the bounty goes for you!
ISTM in your approach you attempt to import nodes from another DOMDocument, so you need to use the importNode() method:
$d = new DOMDocument();
/* Make a `foo` element the root element of $d */
$root = $d->createElement("foo");
$d->appendChild($root);
/* Append a `bar` element as the child element of the root of $d */
$child = $d->createElement("bar");
$root->appendChild($child);
/* New document */
$d2 = new DOMDocument();
/* Make a `baz` element the root element of $d2 */
$root2 = $d2->createElement("baz");
$d2->appendChild($root2);
/*
* Import a clone of $child (from $d) into $d2,
* with its child nodes imported recursively
*/
$child2 = $d2->importNode($child, true);
/* Add the clone as the child node of the root of $d2 */
$root2->appendChild($child2);
However, it is far easier to append the child nodes to a new parent element (thereby moving them), and replace the old root with that parent element:
$d = new DOMDocument();
/* Make a `foo` element the root element of $d */
$root = $d->createElement("foo");
$d->appendChild($root);
/* Append a `bar` element as the child element of the root of $d */
$child = $d->createElement("bar");
$root->appendChild($child);
/* <?xml version="1.0"?>
<foo><bar/></foo> */
echo $d->saveXML();
$root2 = $d->createElement("baz");
/* Make the `bar` element the child element of `baz` */
$root2->appendChild($child);
/* Replace `foo` with `baz` */
$d->replaceChild($root2, $root);
/* <?xml version="1.0"?>
<baz><bar/></baz> */
echo $d->saveXML();
I hope I am not missing anything but I happened to have the similar problem and was able to solve it by using use DomDocument::replaceChild(...).
/* #var $doc DOMDocument */
$doc = DOMImplementation::createDocument(NULL, 'oldRoot');
/* #var $newRoot DomElement */
$newRoot = $doc->createElement('newRoot');
/* all the code to create the elements under $newRoot */
$doc->replaceChild($newRoot, $doc->documentElement);
$doc->documentElement->isSameNode($newRoot) === true;
What threw me off initially was that $doc->documentElement was readonly, but the above worked and seems to be much simpler solution IF the $newRoot was created with the same DomDocument, otherwise you'll need do the importNode solution as described above. From your question is appears that $newRoot could be created from the same $doc.
Let us know if this worked out for you. Cheers.
EDIT: Noticed in version 20031129 that the DomDocument::$formatOutput, if set, does not format $newRoot output when you finally call $doc->saveXML()
I'm pretty new to PHP, DOM, and the PHP DOM implementation. What I'm trying to do is save the root element of the DOMDocument in a $_SESSION variable so I can access it and modify it on subsequent page loads.
But I get an error in PHP when using $_SESSION to save state of DOMElement:
Warning: DOMNode::appendChild() [domnode.appendchild]: Couldn't fetch DOMElement
I have read that a PHP DOMDocument object cannot be saved to $_SESSION natively. However it can be saved by saving the serialization of the DOMDocument (e.g. $_SESSION['dom'] = $dom->saveXML()).
I don't know if the same holds true for saving a DOMElement to a $_SESSION variable as well, but that's what I was trying. My reason for wanting to do this is to use an extended class of DOMElement with one additional property. I was hoping that by saving the root DOMElement in $_SESSION that I could later retrieve the element and modify this additional property and perform a test like, if (additionalProperty === false) { do something; }. I've also read that by saving a DOMDocument, and later retrieving it, all elements are returned as objects from native DOM classes. That is to say, even if I used an extended class to create elements, the property that I subsequently need will not be accessible, because the variable holding reference to the extended-class object has gone out of scope--which is why I'm trying this other thing. I tried using the extended class (not included below) first, but got errors...so I reverted to using a DOMElement object to see if that was the problem, but I'm still getting the same errors. Here's the code:
<?php
session_start();
$rootTag = 'root';
$doc = new DOMDocument;
if (!isset($_SESSION[$rootTag])) {
$_SESSION[$rootTag] = new DOMElement($rootTag);
}
$root = $doc->appendChild($_SESSION[$rootTag]);
//$root = $doc->appendChild($doc->importNode($_SESSION[$rootTag], true));
$child = new DOMElement('child_element');
$n = $root->appendChild($child);
$ct = 0;
foreach ($root->childNodes as $ch) echo '<br/>'.$ch->tagName.' '.++$ct;
$_SESSION[$rootTag] = $doc->documentElement;
?>
This code gives the following errors (depending on whether I use appendChild directly or the commented line of code using importNode):
Warning: DOMNode::appendChild() [domnode.appendchild]: Couldn't fetch DOMElement in C:\Program Files\wamp_server_2.2\www\test2.php on line 11
Warning: DOMDocument::importNode() [domdocument.importnode]: Couldn't fetch DOMElement in C:\Program Files\wamp_server_2.2\www\test2.php on line 12
I have several questions. First, what is causing this error and how do I fix it? Also, if what I'm trying to do isn't possible, then how can I accomplish my general objective of saving the 'state' of a DOM tree while using a custom property for each element? Note that the additional property is only used in the program and is not an attribute to be saved in the XML file. Also, I can't just save the DOM back to file each time, because the DOMDocument, after a modification, may not be valid according to a schema I'm using until later when additional modificaitons/additions have been performed to the DOMDocument. That's why I need to save a temporarily invalid DOMDocument. Thanks for any advice!
EDITED:
After trying hakre's solution, the code worked. Then I moved on to trying to use an extended class of DOMElement, and, as I suspected, it did not work. Here's the new code:
<?php
session_start();
//$_SESSION = array();
$rootTag = 'root';
$doc = new DOMDocument;
if (!isset($_SESSION[$rootTag])) {
$root = new FreezableDOMElement($rootTag);
$doc->appendChild($root);
} else {
$doc->loadXML($_SESSION[$rootTag]);
$root = $doc->documentElement;
}
$child = new FreezableDOMElement('child_element');
$n = $root->appendChild($child);
$ct = 0;
foreach ($root->childNodes as $ch) {
$frozen = $ch->frozen ? 'is frozen' : 'is not frozen';
echo '<br/>'.$ch->tagName.' '.++$ct.': '.$frozen;
//echo '<br/>'.$ch->tagName.' '.++$ct;
}
$_SESSION[$rootTag] = $doc->saveXML();
/**********************************************************************************
* FreezableDOMElement class
*********************************************************************************/
class FreezableDOMElement extends DOMElement {
public $frozen; // boolean value
public function __construct($name) {
parent::__construct($name);
$this->frozen = false;
}
}
?>
It gives me the error Undefined property: DOMElement::$frozen. Like I mentioned in my original post, after saveXML and loadXML, an element originally instantiated with FreezableDOMElement is returning type DOMElement which is why the frozen property is not recognized. Is there any way around this?
You can not store a DOMElement object inside $_SESSION. It will work at first, but with the next request, it will be unset because it can not be serialized.
That's the same like for DOMDocument as you write about in your question.
Store it as XML instead or encapsulate the serialization mechanism.
You are basically facing three problems here:
Serialize the DOMDocument (you do this to)
Serialize the FreezableDOMElement (you do this to)
Keep the private member FreezableDOMElement::$frozen with the document.
As written, serialization is not available out of the box. Additionally, DOMDocument does not persist your FreezableDOMElement even w/o serialization. The following example demonstrates that the instance is not automatically kept, the default value FALSE is returned (Demo):
class FreezableDOMElement extends DOMElement
{
private $frozen = FALSE;
public function getFrozen()
{
return $this->frozen;
}
public function setFrozen($frozen)
{
$this->frozen = (bool)$frozen;
}
}
class FreezableDOMDocument extends DOMDocument
{
public function __construct()
{
parent::__construct();
$this->registerNodeClass('DOMElement', 'FreezableDOMElement');
}
}
$doc = new FreezableDOMDocument();
$doc->loadXML('<root><child></child></root>');
# own objects do not persist
$doc->documentElement->setFrozen(TRUE);
printf("Element is frozen (should): %d\n", $doc->documentElement->getFrozen()); # it is not (0)
As PHP does not so far support setUserData (DOM Level 3), one way could be to store the additional information inside a namespaced attribute with the element. This can also be serialized by creating the XML string when serializing the object and loading it when unserializing (see Serializable). This then solves all three problems (Demo):
class FreezableDOMElement extends DOMElement
{
public function getFrozen()
{
return $this->getFrozenAttribute()->nodeValue === 'YES';
}
public function setFrozen($frozen)
{
$this->getFrozenAttribute()->nodeValue = $frozen ? 'YES' : 'NO';
}
private function getFrozenAttribute()
{
return $this->getSerializedAttribute('frozen');
}
protected function getSerializedAttribute($localName)
{
$namespaceURI = FreezableDOMDocument::NS_URI;
$prefix = FreezableDOMDocument::NS_PREFIX;
if ($this->hasAttributeNS($namespaceURI, $localName)) {
$attrib = $this->getAttributeNodeNS($namespaceURI, $localName);
} else {
$this->ownerDocument->documentElement->setAttributeNS('http://www.w3.org/2000/xmlns/', 'xmlns:' . $prefix, $namespaceURI);
$attrib = $this->ownerDocument->createAttributeNS($namespaceURI, $prefix . ':' . $localName);
$attrib = $this->appendChild($attrib);
}
return $attrib;
}
}
class FreezableDOMDocument extends DOMDocument implements Serializable
{
const NS_URI = '/frozen.org/freeze/2';
const NS_PREFIX = 'freeze';
public function __construct()
{
parent::__construct();
$this->registerNodeClasses();
}
private function registerNodeClasses()
{
$this->registerNodeClass('DOMElement', 'FreezableDOMElement');
}
/**
* #return DOMNodeList
*/
private function getNodes()
{
$xp = new DOMXPath($this);
return $xp->query('//*');
}
public function serialize()
{
return parent::saveXML();
}
public function unserialize($serialized)
{
parent::__construct();
$this->registerNodeClasses();
$this->loadXML($serialized);
}
public function saveBareXML()
{
$doc = new DOMDocument();
$doc->loadXML(parent::saveXML());
$xp = new DOMXPath($doc);
foreach ($xp->query('//#*[namespace-uri()=\'' . self::NS_URI . '\']') as $attr) {
/* #var $attr DOMAttr */
$attr->parentNode->removeAttributeNode($attr);
}
$doc->documentElement->removeAttributeNS(self::NS_URI, self::NS_PREFIX);
return $doc->saveXML();
}
public function saveXMLDirect()
{
return parent::saveXML();
}
}
$doc = new FreezableDOMDocument();
$doc->loadXML('<root><child></child></root>');
$doc->documentElement->setFrozen(TRUE);
$child = $doc->getElementsByTagName('child')->item(0);
$child->setFrozen(TRUE);
echo "Plain XML:\n", $doc->saveXML(), "\n";
echo "Bare XML:\n", $doc->saveBareXML(), "\n";
$serialized = serialize($doc);
echo "Serialized:\n", $serialized, "\n";
$newDoc = unserialize($serialized);
printf("Document Element is frozen (should be): %s\n", $newDoc->documentElement->getFrozen() ? 'YES' : 'NO');
printf("Child Element is frozen (should be): %s\n", $newDoc->getElementsByTagName('child')->item(0)->getFrozen() ? 'YES' : 'NO');
It's not really feature complete but a working demo. It's possible to obtain the full XML without the additional "freeze" data.
Is there any way I can insert an HTML template to existing DOMNode without content being encoded?
I have tried to do that with:
$dom->createElement('div', '<h1>Hello world</h1>');
$dom->createTextNode('<h1>Hello world</h1>');
The output is pretty much the same, with only difference that first code would wrap it in a div.
I have tried to loadHTML from string but I have no idea how can I append it's body content to another DOMDocument.
In javascript, this process seems to be quite simple and obvious.
You can use
DOMDocumentFragment::appendXML — Append raw XML data
Example:
// just some setup
$dom = new DOMDocument;
$dom->loadXml('<html><body/></html>');
$body = $dom->documentElement->firstChild;
// this is the part you are looking for
$template = $dom->createDocumentFragment();
$template->appendXML('<h1>This is <em>my</em> template</h1>');
$body->appendChild($template);
// output
echo $dom->saveXml();
Output:
<?xml version="1.0"?>
<html><body><h1>This is <em>my</em> template</h1></body></html>
If you want to import from another DOMDocument, replace the three lines with
$tpl = new DOMDocument;
$tpl->loadXml('<h1>This is <em>my</em> template</h1>');
$body->appendChild($dom->importNode($tpl->documentElement, TRUE));
Using TRUE as the second argument to importNode will do a recursive import of the node tree.
If you need to import (malformed) HTML, change loadXml to loadHTML. This will trigger the HTML parser of libxml (what ext/DOM uses internally):
libxml_use_internal_errors(true);
$tpl = new DOMDocument;
$tpl->loadHtml('<h1>This is <em>malformed</em> template</h2>');
$body->appendChild($dom->importNode($tpl->documentElement, TRUE));
libxml_use_internal_errors(false);
Note that libxml will try to correct the markup, e.g. it will change the wrong closing </h2> to </h1>.
It works with another DOMDocument for parsing the HTML code. But you need to import the nodes into the main document before you can use them in it:
$newDiv = $dom->createElement('div');
$tmpDoc = new DOMDocument();
$tmpDoc->loadHTML($str);
foreach ($tmpDoc->getElementsByTagName('body')->item(0)->childNodes as $node) {
$node = $dom->importNode($node, true);
$newDiv->appendChild($node);
}
And as a handy function:
function appendHTML(DOMNode $parent, $source) {
$tmpDoc = new DOMDocument();
$tmpDoc->loadHTML($source);
foreach ($tmpDoc->getElementsByTagName('body')->item(0)->childNodes as $node) {
$node = $parent->ownerDocument->importNode($node, true);
$parent->appendChild($node);
}
}
Then you can simply do this:
$elem = $dom->createElement('div');
appendHTML($elem, '<h1>Hello world</h1>');
As I do not want to struggle with XML, because it throws errors faster and I am not a fan of prefixing an # to prevent error output. The loadHTML does the better job in my opinion and it is quite simple as that:
$doc = new DOMDocument();
$div = $doc->createElement('div');
// use a helper to load the HTML into a string
$helper = new DOMDocument();
$helper->loadHTML('This is my HTML Link.');
// now the magic!
// import the document node of the $helper object deeply (true)
// into the $div and append as child.
$div->appendChild($doc->importNode($helper->documentElement, true));
// add the div to the $doc
$doc->appendChild($div);
// final output
echo $doc->saveHTML();
Here is simple example by using DOMDocumentFragment:
$doc = new DOMDocument();
$doc->loadXML("<root/>");
$f = $doc->createDocumentFragment();
$f->appendXML("<foo>text</foo><bar>text2</bar>");
$doc->documentElement->appendChild($f);
echo $doc->saveXML();
Here is helper function for replacing DOMNode:
/**
* Helper function for replacing $node (DOMNode)
* with an XML code (string)
*
* #var DOMNode $node
* #var string $xml
*/
public function replaceNodeXML(&$node, $xml) {
$f = $this->dom->createDocumentFragment();
$f->appendXML($xml);
$node->parentNode->replaceChild($f,$node);
}
Source: Some old "PHP5 Dom Based Template" article.
And here is another suggestion posted by Pian0_M4n to use value attribute as workaround:
$dom = new DomDocument;
// main object
$object = $dom->createElement('div');
// html attribute
$attr = $dom->createAttribute('value');
// ugly html string
$attr->value = "<div> this is a really html string ©</div><i></i> with all the © that XML hates!";
$object->appendChild($attr);
// jquery fix (or javascript as well)
$('div').html($(this).attr('value')); // and it works!
$('div').removeAttr('value'); // to clean-up
No ideal, but at least it works.
Gumbo's code works perfectly! Just a little enhancement that adding the TRUE parameter so that it works with nested html snippets.
$node = $parent->ownerDocument->importNode($node);
$node = $parent->ownerDocument->importNode($node, **TRUE**);
I've got my HTML inside of $html.
dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$tags = $xpath->query('//div[#id="header"]');
foreach($tags as $tag) {
var_dump($tag->nodeValue); // the innerHTML of that element
var_dump($tag); // object(DOMElement)#3 (0) { }
}
Is there a way to get that node, or remove it?
Basically, I'm parsing an existing website and need to remove elements from it. What method do I call to do that?
Thanks
Have you checked out DOMNode::removeChild ?