I've been messing around with DOMDocument lately, and I've noticed that in order to transfer elements from one document to the next, I have to call $DOMDocument->importNode() on the target DOMDocument.
However, I'm running into weird issues, where once the originating document is destroyed, the cloned element misbehaves.
For example, here's some lovely working code:
$dom1 = new DOMDocument;
$dom2 = new DOMDocument;
$dom2->loadHTML('<div id="div"><span class="inner"></span></div>');
$div = $dom2->getElementById('div');
$children = $dom1->importNode( $div, true )->childNodes;
echo $children->item(0)->tagName; // Output: "span"
Here's a demo: http://codepad.viper-7.com/pjd9Ty
The problem arises when I try using the elements after their original document is out of scope:
global $dom;
$dom = new DOMDocument;
function get_div_children () {
global $dom;
$local_dom = new DOMDocument;
$local_dom->loadHTML('<div id="div"><span class="inner"></span></div>');
$div = $local_dom->getElementById('div');
return $dom->importNode( $div, true )->childNodes;
}
echo get_div_children()->item(0)->tagName;
The above results in the following errors:
PHP Warning: Couldn't fetch DOMElement. Node no longer exists in ...
PHP Notice: Undefined property: DOMElement::$tagName in ...
Here's a demo: http://codepad.viper-7.com/c0kqOA
My question is twofold:
Shouldn't the returned elements exist even after the original document was destroyed, since they were cloned into the current document?
A workaround. For various reasons, I have to manipulate the elements after the original document is destroyed, but before I actually insert them into the DOM of the other DOMDocument. Is there any way to accomplish this?
Clarification: I understand that if the elements are inserted into the DOM, it behaves as expected. But, as outlined above, my setup calls for the elements to be manipulated before being inserted into the DOM (long story). Given that the first example here works - and that manipulating elements outside of the DOM is standard procedure in JavaScript - shouldn't this be possible here as well?
The cloned node has a reference to $dom, but $dom has not. Internal PHP garbage collector destroys such nodes when the calling context changes. There is only one way to create this reference: $dom->documentElement->appendChild($node).
So, use code like this (static keyword will prevent garbage collector from destroying your variable):
global $dom;
$dom = new DOMDocument;
function get_div_children () {
global $dom;
$local_dom = new DOMDocument;
$local_dom->loadHTML('<div id="div"><span class="inner"></span></div>');
$div = $local_dom->getElementById('div');
static $nodes;
$nodes = $dom->importNode( $div, true )->childNodes;
return $nodes;
}
echo get_div_children()->item(0)->tagName;
Related
I'd like to remove <font> tags from my html and am trying to use replaceChild to do so, but it doesn't seem to work properly. Can anyone catch what might be wrong?
$html = '<html><body><br><font class="heading2">Limited Size and Resources</font><p><br><strong>Q: When can a member use the limited size and resources exception?</strong></p></body></html>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$font_tags = $dom->GetElementsByTagName('font');
foreach($font_tags as $font_tag) {
foreach($font_tag as $child) {
$child->replaceChild($child->nodeValue, $font_tag);
}
}
echo $dom->saveHTML();
From what I understand, $font_tags is a DOMNodeList, so I need to iterate through it twice in order to use the DOMNode::replaceChild function. I then want to replace the current value with just the content inside of the tags. However, when I output the $html nothing changes. Any ideas what could be wrong?
Here is a PHP Sandbox to test the code.
I'll put my remarks inline
$html = '<html><body><br><font class="heading2">Limited Size and Resources</font><p><br><strong>Q: When can a member use the limited size and resources exception?</strong></p></body></html>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$font_tags = $dom->GetElementsByTagName('font');
/* You only need one loop, as it is iterating your collection
You would only need a second loop if each font tag had children of their own
*/
foreach($font_tags as $font_tag) {
/* replaceChild replaces children of the node being called
So, to replace the font tag, call the function on its parent
$prent will be that reference
*/
$prent = $font_tag->parentNode;
/* You can't insert arbitrary text, you have to create a textNode
That textNode must also be a member of your document
*/
$prent->replaceChild($dom->createTextNode($font_tag->nodeValue), $font_tag);
}
echo $dom->saveHTML();
Updated Sandbox: Hopefully I understood your requirements correctly
First, let us find out what wasn't working in your code.
foreach($font_tag as $child) wasn't even iterating once as $font_tag is a single 'font' tag element from font_tags array, and not an array itself.
$child->replaceChild($child->nodeValue, $font_tag); - A child node can't replace its parent ($font_tag), but the reverse is possible.
As replaceChild is a method of the parent node to replace its child.
For more details check the PHP: DOMNode::replaceChild documentation, or the point 2 below my code.
echo $html will output the $html string, but not the updated $dom object that we are modifying.
This would work -
$html = '<html><body><br><font class="heading2">Limited Size and Resources</font><p><br><strong>Q: When can a member use the limited size and resources exception?</strong></p></body></html>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$font_tags = $dom->GetElementsByTagName('font');
foreach($font_tags as $font_tag)
{
$new_node = $dom->createTextNode($font_tag->nodeValue);
$font_tag->parentNode->replaceChild($new_node, $font_tag);
}
echo $dom->saveHTML();
I am creating a $new_node directly in the $dom, so the node is live in the DOMDocument and not any local variable.
To replace the child object $font_tag, we have to first traverse to the parent node using the parentNode method.
Finally, we are printing out the modified $dom using saveHTML method, which will convert the DOMDocument into a HTML String.
Remove a specific span tag from HTML while preserving/keeping the inside content using PHP and DOMDocument
<?php
$content = '<span style="font-family: helvetica; font-size: 12pt;"><div>asdf</div><span>TWO</span>Business owners are fearful of leading. They would rather follow the leader than embrace a bold move that challenges their confidence. </span>';
$dom = new DOMDocument();
// Use LIBXML for preventing output of doctype, <html>, and <body> tags
$dom->loadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//span[#style="font-family: helvetica; font-size: 12pt;"]') as $span) {
// Move all span tag content to its parent node just before it.
while ($span->hasChildNodes()) {
$child = $span->removeChild($span->firstChild);
$span->parentNode->insertBefore($child, $span);
}
// Remove the span tag.
$span->parentNode->removeChild($span);
}
// Get the final HTML with span tags stripped
$output = $dom->saveHTML();
print_r($output);
I am using the function below, but not sure about it is always stable/secure... Is it?
When and who is stable/secure to "reuse parts of the DOMXpath preparing procedures"?
To simlify the use of the XPath query() method we can adopt a function that memorizes the last calls with static variables,
function DOMXpath_reuser($file) {
static $doc=NULL;
static $docName='';
static $xp=NULL;
if (!$doc)
$doc = new DOMDocument();
if ($file!=$docName) {
$doc->loadHTMLFile($file);
$xp = NULL;
}
if (!$xp)
$xp = new DOMXpath($doc);
return $xp; // ??RETURNED VALUES ARE ALWAYS STABLE??
}
The present question is similar to this other one about XSLTProcessor reuse.
In both questions the problem can be generalized for any language or framework that use LibXML2 as DomDocument implementation.
There are another related question: How to "refresh" DOMDocument instances of LibXML2?
Illustrating
The reuse is very commom (examples):
$f = "my_XML_file.xml";
$elements = DOMXpath_reuser($f)->query("//*[#id]");
// use elements to get information
$elements = DOMXpath_reuser($f)->("/html/body/div[1]");
// use elements to get information
But, if you do something like removeChild, replaceChild, etc. (example),
$div = DOMXpath_reuser($f)->query("/html/body/div[1]")->item(0); //STABLE
$div->parentNode->removeChild($div); // CHANGES DOM
$elements = DOMXpath_reuser($f)->query("//div[#id]"); // INSTABLE! !!
extrange things can be occur, and the queries not works as expected!!
When (what DOMDocument methods affect XPath?)
Why we can not use something like normalizeDocument to "refresh DOM" (exist?)?
Only a "new DOMXpath($doc);" is allways secure? need to reload $doc also?
DOMXpath is affected by the load*() methods on DOMDocument. After loading a new xml or html, you need to recreate the DOMXpath instance:
$xml = '<xml/>';
$dom = new DOMDocument();
$dom->loadXml($xml);
$xpath = new DOMXpath($dom);
var_dump($xpath->document === $dom); // bool(true)
$dom->loadXml($xml);
var_dump($xpath->document === $dom); // bool(false)
In DOMXpath_reuser() you store a static variable and recreate the xpath depending on the file name. If you want to reuse an Xpath object, suggest extending DOMDocument. This way you only need pass the $dom variable around. It would work with a stored xml file as well with xml string or a document your are creating.
The following class extends DOMDocument with an method xpath() that always returns a valid DOMXpath instance for it. It stores and registers the namespaces, too:
class MyDOMDocument
extends DOMDocument {
private $_xpath = NULL;
private $_namespaces = array();
public function xpath() {
// if the xpath instance is missing or not attached to the document
if (is_null($this->_xpath) || $this->_xpath->document != $this) {
// create a new one
$this->_xpath = new DOMXpath($this);
// and register the namespaces for it
foreach ($this->_namespaces as $prefix => $namespace) {
$this->_xpath->registerNamespace($prefix, $namespace);
}
}
return $this->_xpath;
}
public function registerNamespaces(array $namespaces) {
$this->_namespaces = array_merge($this->_namespaces, $namespaces);
if (isset($this->_xpath)) {
foreach ($namespaces as $prefix => $namespace) {
$this->_xpath->registerNamespace($prefix, $namespace);
}
}
}
}
$xml = <<<'ATOM'
<feed xmlns="http://www.w3.org/2005/Atom">
<title>Test</title>
</feed>
ATOM;
$dom = new MyDOMDocument();
$dom->registerNamespaces(
array(
'atom' => 'http://www.w3.org/2005/Atom'
)
);
$dom->loadXml($xml);
// created, first access
var_dump($dom->xpath()->evaluate('string(/atom:feed/atom:title)', NULL, FALSE));
$dom->loadXml($xml);
// recreated, connection was lost
var_dump($dom->xpath()->evaluate('string(/atom:feed/atom:title)', NULL, FALSE));
The DOMXpath class (instead of XSLTProcessor in your another question) use reference to given DOMDocument object in contructor. DOMXpath create libxml context object based on given DOMDocument and save it to internal class data. Besides libxml context its saves references to originalDOMDocument` given in contructor arguments.
What that means:
Part of sample from ThomasWeinert answer:
var_dump($xpath->document === $dom); // bool(true)
$dom->loadXml($xml);
var_dump($xpath->document === $dom); // bool(false)
gives false after load becouse of $dom already holds pointer to new libxml data but DOMXpath holds libxml context for $dom before load and pointer to real document after load.
Now about query works
If it should return XPATH_NODESET (as in your case) its make a node copy - node by node iterating throw detected node set(\ext\dom\xpath.c from 468 line). Copy but with original document node as parent. Its means that you can modify result but this gone away you XPath and DOMDocument connection.
XPath results provide a parentNode memeber that knows their origin:
for attribute values, parentNode returns the element that carries them. An example is //foo/#attribute, where the parent would be a foo Element.
for the text() function (as in //text()), it returns the element that contains the text or tail that was returned.
note that parentNode may not always return an element. For example, the XPath functions string() and concat() will construct strings that do not have an origin. For them, parentNode will return None.
So,
There is no any reasons to cache XPath. It do not anything besides xmlXPathNewContext (just allocate lightweight internal struct).
Each time your modify your DOMDocument (removeChild, replaceChild, etc.) your should recreate XPath.
We can not use something like normalizeDocument to "refresh DOM" because of it change internal document structure and invalidate xmlXPathNewContext created in Xpath constructor.
Only "new DOMXpath($doc);" is allways secure? Yes, if you do not change $doc between Xpath usage. Need to reload $doc also - no, because of it invalidated previously created xmlXPathNewContext.
(this is not a real answer, but a consolidation of comments and answers posted here and related questions)
This new version of the question's DOMXpath_reuser function contains the #ThomasWeinert suggestion (for avoid DOM changes by external re-load) and an option $enforceRefresh to workaround the problem of instability (as related question shows the programmer must detect when).
function DOMXpath_reuser_v2($file, $enforceRefresh=0) { //changed here
static $doc=NULL;
static $docName='';
static $xp=NULL;
if (!$doc)
$doc = new DOMDocument();
if ( $file!=$docName || ($xp && $doc !== $xp->document) ) { // changed here
$doc->load($file);
$xp = NULL;
} elseif ($enforceRefresh==2) { // add this new refresh mode
$doc->loadXML($doc->saveXML());
$xp = NULL;
}
if (!$xp || $enforceRefresh==1) //changed here
$xp = new DOMXpath($doc);
return $xp;
}
When must to use $enforceRefresh=1 ?
... perhaps an open problem, only little tips and clues...
when DOM submited to setAttribute, removeChild, replaceChild, etc.
...? more cases?
When must to use $enforceRefresh=2 ?
... perhaps an open problem, only little tips and clues...
when DOM was subject to indexes inconsistences, etc. See this question/solution.
...? more cases?
I have the following script snippet. Originally I did not realize to use getElementById that I needed to include createDocumentType, but now I get the error listed above. What am I doing wrong here? Thanks in advance!
...
$result = curl_exec($ch); //contains some webpage i am grabbing remotely
$dom = new DOMDocument();
$dom->createDocumentType('html', '-//W3C//DTD HTML 4.01 Transitional//EN', 'http://www.w3.org/TR/html4/loose.dtd');
$elements = $dom->loadHTML($result);
$e = $elements->getElementById('1');
...
Edit: Additional note, I verified the DOM is correct on the remote page.
DOMDocument does not have a method named createDocumentType, as you can see in the Manual. The method belongs to the DOMImplemetation class. It is used like this (taken from the manual):
// Creates an instance of the DOMImplementation class
$imp = new DOMImplementation;
// Creates a DOMDocumentType instance
$dtd = $imp->createDocumentType('graph', '', 'graph.dtd');
// Creates a DOMDocument instance
$dom = $imp->createDocument("", "", $dtd);
Since you want to load HTML into the document, you don't need to specify a document type, since it is determined from the imported HTML. You just have to have some id attributes, or a DTD that identifies an other attribute as an id. This is part of the HTML file, not the parsing PHP code.
$dom = new DOMDocument();
$dom->loadHTML($result);
$element = $dom->getElementById('my_id');
will do the job.
I'm using DOMDocument to retrieve on a HTML page a special div.
I just want to retrive the content of this div, without the div tag.
For example :
$dom = new DOMDocument;
$dom->loadHTML($webtext['content']);
$main = $dom->getElementById('inter');
$dom->saveHTML()
Here, i have the result :
<div id="inter">
//SOME THINGS IN MY DIV
</div>
And i just want to have :
//SOME THINGS IN MY DIV
Ideas ? Thanks !
I'm going to go with simple does it. You already have:
$dom = new DOMDocument;
$dom->loadHTML($webtext['content']);
$main = $dom->getElementById('inter');
$dom->saveHTML();
Now, DOMDocument::getElementById() returns one DOMElement which extends DOMNode which has the public stringnodeValue. Since you don't specify if you are expecting anything but text within that div, I'm going to assume that you want anything that may be stored in there as plain text. For that, we are going to remove $dom->saveHTML();, and instead replace it with:
$divString = $main->nodeValue;
With that, $divString will contain //SOME THINGS IN MY DIV, which, from your example, is the desired output.
If, however, you want the HTML of the inside of it and not just a String representation - replace it with the following instead:
$divString = "";
foreach($main->childNodes as $c)
$divString .= $c->ownerDocument->saveXML($c);
What that does is takes advantage of the inherited DOMNode::childNodes which contains a DOMNodeList each containing its own DOMNode (for reference, see above), and we loop through each one getting the ownerDocument which is a DOMDocument and we call the DOMDocument::saveXML() function. The reason we pass the current $c node in to the function is to prevent an entire valid document from being outputted, and because the ownerDocument is what we are looping through - we need to get one child at a time, with no children left behind. (sorry, it's late, couldn't resist.)
Now, after either option, you can do with $divString what you will. I hope this has helped explain the process to you and hopefully you walk away with a better understanding of what is going on instead of rote copying of code just because it works. ^^
you can use my custom function to remove extra div from content
$html_string = '<div id="inter">
SOME THINGS IN MY DIV
</div>';
// custom function
function DOMgetinnerHTML($element)
{
$innerHTML = "";
$children = $element->childNodes;
foreach ($children as $child)
{
$tmp_dom = new DOMDocument();
$tmp_dom->appendChild($tmp_dom->importNode($child, true));
$innerHTML.=trim($tmp_dom->saveHTML());
}
return $innerHTML;
}
your code will like
$dom = new DOMDocument;
$dom->loadHTML($html_string);
$divs = $dom->getElementsByTagName('div');
$innerHTML_contents = DOMgetinnerHTML($divs->item(0));
echo $innerHTML_contents
and your output will be
SOME THINGS IN MY DIV
you can use xpath
$xpath = new DOMXPath($xml);
foreach($xpath->query('//div[#id="inter"]/*') as $node)
{
$node->nodeValue
}
or simplu you can edit your code. see here
$main = $dom->getElementById('inter');
echo $main->nodeValue
I am having a strange behavior in my script. That has me confused
Script 1.
$dom = new DOMDocument();
$dom->loadHTMLFile("html/signinform.html");//loads file here
$form = $dom->getElementsByTagName("form")->item(0);
$div = $dom->createElement("div");
$dom->appendChild($div)->appendChild($form);
echo $dom->saveHTML();
Script 2.
$dom = new DOMDocument();
$div = $dom->createElement("div");
$dom->loadHTMLFile("html/signinform.html");//loads file here
$form = $dom->getElementsByTagName("form")->item(0);
$dom->appendChild($div)->appendChild($form);
echo $dom->saveHTML();
Script 1 works without problem. It shows the form. However Script 2 throws the following error: Fatal error: Uncaught exception 'DOMException' with message 'Wrong Document Error' in C:\Users
Could someone explain to me why the mere changing of position of the loadHTMLFile function results in such error? Thanks
You have added an element to the DOM (div) and then attempted to load a file to be parsed and its DOM structure used.
Load the file first if you intend to use one.
For DOM manipulation you do not need to insert an already existing element so doing something like this: $dom->appendChild($form) only reinserts the same form element, when you pull an element using $dom->getElementsByTag("form")->item(0) it becomes it's own DOM object which you can reference directly and append to. A proper example would be:
$dom = new DOMDocument();
$dom->loadHTMLFile("assets/dom_document-form.html");
$div = $dom->createElement("div");
$form = $dom->getElementsByTagName("form")->item(0);
$form->appendChild($div);
echo $dom->saveHTML();
One should append directly to the object they pulled from the DOM instead and load the document first.
To help aid your initial questions too:
Append directly to element that you pulled as it references the object.
new DOMDocument can be used to create multiple documents.
using DOMDocument::createElement before loadHTMLFile creates 2 DOMDocuments.
Using DomDocument::createDocumentFragment acts the same and creates it's own DOM.
If you would like to keep your code the same and create two DomDocuments then you should use DomDocument::importNode, an example of this would be:
$dom = new DOMDocument();
$div = $dom->createElement("div");
$dom->loadHTMLFile("assets/dom_document-form.html");
$node = $dom->importNode($div);
$form = $dom->getElementsByTagName("form")->item(0);
$form->appendChild($node);
echo $dom->saveHTML();