In DomDocument, reuse of DOMXpath, it is stable? - php

I am using the function below, but not sure about it is always stable/secure... Is it?
When and who is stable/secure to "reuse parts of the DOMXpath preparing procedures"?
To simlify the use of the XPath query() method we can adopt a function that memorizes the last calls with static variables,
function DOMXpath_reuser($file) {
static $doc=NULL;
static $docName='';
static $xp=NULL;
if (!$doc)
$doc = new DOMDocument();
if ($file!=$docName) {
$doc->loadHTMLFile($file);
$xp = NULL;
}
if (!$xp)
$xp = new DOMXpath($doc);
return $xp; // ??RETURNED VALUES ARE ALWAYS STABLE??
}
The present question is similar to this other one about XSLTProcessor reuse.
In both questions the problem can be generalized for any language or framework that use LibXML2 as DomDocument implementation.
There are another related question: How to "refresh" DOMDocument instances of LibXML2?
Illustrating
The reuse is very commom (examples):
$f = "my_XML_file.xml";
$elements = DOMXpath_reuser($f)->query("//*[#id]");
// use elements to get information
$elements = DOMXpath_reuser($f)->("/html/body/div[1]");
// use elements to get information
But, if you do something like removeChild, replaceChild, etc. (example),
$div = DOMXpath_reuser($f)->query("/html/body/div[1]")->item(0); //STABLE
$div->parentNode->removeChild($div); // CHANGES DOM
$elements = DOMXpath_reuser($f)->query("//div[#id]"); // INSTABLE! !!
extrange things can be occur, and the queries not works as expected!!
When (what DOMDocument methods affect XPath?)
Why we can not use something like normalizeDocument to "refresh DOM" (exist?)?
Only a "new DOMXpath($doc);" is allways secure? need to reload $doc also?

DOMXpath is affected by the load*() methods on DOMDocument. After loading a new xml or html, you need to recreate the DOMXpath instance:
$xml = '<xml/>';
$dom = new DOMDocument();
$dom->loadXml($xml);
$xpath = new DOMXpath($dom);
var_dump($xpath->document === $dom); // bool(true)
$dom->loadXml($xml);
var_dump($xpath->document === $dom); // bool(false)
In DOMXpath_reuser() you store a static variable and recreate the xpath depending on the file name. If you want to reuse an Xpath object, suggest extending DOMDocument. This way you only need pass the $dom variable around. It would work with a stored xml file as well with xml string or a document your are creating.
The following class extends DOMDocument with an method xpath() that always returns a valid DOMXpath instance for it. It stores and registers the namespaces, too:
class MyDOMDocument
extends DOMDocument {
private $_xpath = NULL;
private $_namespaces = array();
public function xpath() {
// if the xpath instance is missing or not attached to the document
if (is_null($this->_xpath) || $this->_xpath->document != $this) {
// create a new one
$this->_xpath = new DOMXpath($this);
// and register the namespaces for it
foreach ($this->_namespaces as $prefix => $namespace) {
$this->_xpath->registerNamespace($prefix, $namespace);
}
}
return $this->_xpath;
}
public function registerNamespaces(array $namespaces) {
$this->_namespaces = array_merge($this->_namespaces, $namespaces);
if (isset($this->_xpath)) {
foreach ($namespaces as $prefix => $namespace) {
$this->_xpath->registerNamespace($prefix, $namespace);
}
}
}
}
$xml = <<<'ATOM'
<feed xmlns="http://www.w3.org/2005/Atom">
<title>Test</title>
</feed>
ATOM;
$dom = new MyDOMDocument();
$dom->registerNamespaces(
array(
'atom' => 'http://www.w3.org/2005/Atom'
)
);
$dom->loadXml($xml);
// created, first access
var_dump($dom->xpath()->evaluate('string(/atom:feed/atom:title)', NULL, FALSE));
$dom->loadXml($xml);
// recreated, connection was lost
var_dump($dom->xpath()->evaluate('string(/atom:feed/atom:title)', NULL, FALSE));

The DOMXpath class (instead of XSLTProcessor in your another question) use reference to given DOMDocument object in contructor. DOMXpath create libxml context object based on given DOMDocument and save it to internal class data. Besides libxml context its saves references to originalDOMDocument` given in contructor arguments.
What that means:
Part of sample from ThomasWeinert answer:
var_dump($xpath->document === $dom); // bool(true)
$dom->loadXml($xml);
var_dump($xpath->document === $dom); // bool(false)
gives false after load becouse of $dom already holds pointer to new libxml data but DOMXpath holds libxml context for $dom before load and pointer to real document after load.
Now about query works
If it should return XPATH_NODESET (as in your case) its make a node copy - node by node iterating throw detected node set(\ext\dom\xpath.c from 468 line). Copy but with original document node as parent. Its means that you can modify result but this gone away you XPath and DOMDocument connection.
XPath results provide a parentNode memeber that knows their origin:
for attribute values, parentNode returns the element that carries them. An example is //foo/#attribute, where the parent would be a foo Element.
for the text() function (as in //text()), it returns the element that contains the text or tail that was returned.
note that parentNode may not always return an element. For example, the XPath functions string() and concat() will construct strings that do not have an origin. For them, parentNode will return None.
So,
There is no any reasons to cache XPath. It do not anything besides xmlXPathNewContext (just allocate lightweight internal struct).
Each time your modify your DOMDocument (removeChild, replaceChild, etc.) your should recreate XPath.
We can not use something like normalizeDocument to "refresh DOM" because of it change internal document structure and invalidate xmlXPathNewContext created in Xpath constructor.
Only "new DOMXpath($doc);" is allways secure? Yes, if you do not change $doc between Xpath usage. Need to reload $doc also - no, because of it invalidated previously created xmlXPathNewContext.

(this is not a real answer, but a consolidation of comments and answers posted here and related questions)
This new version of the question's DOMXpath_reuser function contains the #ThomasWeinert suggestion (for avoid DOM changes by external re-load) and an option $enforceRefresh to workaround the problem of instability (as related question shows the programmer must detect when).
function DOMXpath_reuser_v2($file, $enforceRefresh=0) { //changed here
static $doc=NULL;
static $docName='';
static $xp=NULL;
if (!$doc)
$doc = new DOMDocument();
if ( $file!=$docName || ($xp && $doc !== $xp->document) ) { // changed here
$doc->load($file);
$xp = NULL;
} elseif ($enforceRefresh==2) { // add this new refresh mode
$doc->loadXML($doc->saveXML());
$xp = NULL;
}
if (!$xp || $enforceRefresh==1) //changed here
$xp = new DOMXpath($doc);
return $xp;
}
When must to use $enforceRefresh=1 ?
... perhaps an open problem, only little tips and clues...
when DOM submited to setAttribute, removeChild, replaceChild, etc.
...? more cases?
When must to use $enforceRefresh=2 ?
... perhaps an open problem, only little tips and clues...
when DOM was subject to indexes inconsistences, etc. See this question/solution.
...? more cases?

Related

How to add elements to DOMNodeList in PHP?

Is there a way to create my own DOMNodeList? E.g.:
$doc = new DOMDocument();
$elem = $doc->createElement('div');
$nodeList = new DOMNodeList;
$nodeList->addItem($elem); // ?
My idea is to extend DOMDocument class adding some useful methods that return data as DOMNodeList.
Is it possible to do it without writing my own version of DOMNodeList class?
You cannot add items to DOMNodeList via it's public interface. However, DOMNodeLists are live collections when connected to a DOM Tree, so adding a Child Element to a DOMElement will add an element in that element's child collection (which is a DOMNodeList):
$doc = new DOMDocument();
$nodelist = $doc->childNodes; // a DOMNodeList
echo $nodelist->length; // 0
$elem = $doc->createElement('div');
$doc->appendChild($elem);
echo $nodelist->length; // 1
You say you want to add "some useful methods that return data as DOMNodeList". In the context of DOMDocument, this is what XPath does. It allows you to query all the nodes in the document and return them in a DOMNodeList. Maybe that's what you are looking for.

XML Xpath Failing on getElementsByTagName

<?xml version="1.0" encoding="UTF-8"?>
<AddProduct>
<auth><id>vendor123</id><auth_code>abc123</auth_code></auth>
</AddProduct>
What am I doing wrong to get : Fatal error: Call to undefined method DOMNodeList::getElementsByTagName()
$xml = $_GET['xmlRequest'];
$dom = new DOMDocument();
#$dom->loadXML($xml);
$xpath = new DOMXPath($dom);
$auth = $xpath->query('*/auth');
$id = $auth->getElementsByTagName('id')->item(0)->nodeValue;
$code = $auth->getElementsByTagName('auth_code')->item(0)->nodeValue;
You could retrieve the data (in the XML you posted) you want using XPath only:
$id = $xpath->query('//auth/id')->item(0)->nodeValue;
$code = $xpath->query('//auth/auth_code')->item(0)->nodeValue;
You are also calling getElementsByTagName() on $auth (DOMXPath), as #Ohgodwhy pointed out in the comments, which is causing the error. If you want to use it, you should call it on $dom.
Your XPath expression returns the auth child of the current (context) node. Unless your XML file is different, it's clearer to use one of:
/*/auth # returns auth nodes two levels below root
/AddProduct/auth # returns auth nodes in below /AddProduct
//auth # returns all auth nodes
This is what I came up with after reviewing php's documentation (http://us1.php.net/manual/en/class.domdocument.php, http://us1.php.net/manual/en/domdocument.loadxml.php, http://us3.php.net/manual/en/domxpath.query.php, http://us3.php.net/domxpath)
$dom = new DOMDocument();
$dom->loadXML($xml);
$id = $dom->getElementsByTagName("id")->item(0)->nodeValue;
$code = $dom->getElementsByTagName("auth_code")->item(0)->nodeValue;
As helderdarocha and Ohgodwhy pointed out, the getElementByTagName is a DOMDocument method not a DOMXPath method. I like helderdarocha's solution that only uses XPath, the solution I posted accomplishes the same thing but only uses the DOMDocument.

Call to undefined method DOMDocument::createDocumentType()

I have the following script snippet. Originally I did not realize to use getElementById that I needed to include createDocumentType, but now I get the error listed above. What am I doing wrong here? Thanks in advance!
...
$result = curl_exec($ch); //contains some webpage i am grabbing remotely
$dom = new DOMDocument();
$dom->createDocumentType('html', '-//W3C//DTD HTML 4.01 Transitional//EN', 'http://www.w3.org/TR/html4/loose.dtd');
$elements = $dom->loadHTML($result);
$e = $elements->getElementById('1');
...
Edit: Additional note, I verified the DOM is correct on the remote page.
DOMDocument does not have a method named createDocumentType, as you can see in the Manual. The method belongs to the DOMImplemetation class. It is used like this (taken from the manual):
// Creates an instance of the DOMImplementation class
$imp = new DOMImplementation;
// Creates a DOMDocumentType instance
$dtd = $imp->createDocumentType('graph', '', 'graph.dtd');
// Creates a DOMDocument instance
$dom = $imp->createDocument("", "", $dtd);
Since you want to load HTML into the document, you don't need to specify a document type, since it is determined from the imported HTML. You just have to have some id attributes, or a DTD that identifies an other attribute as an id. This is part of the HTML file, not the parsing PHP code.
$dom = new DOMDocument();
$dom->loadHTML($result);
$element = $dom->getElementById('my_id');
will do the job.

Accessing an imported element after the original DOMDocument is destroyed

I've been messing around with DOMDocument lately, and I've noticed that in order to transfer elements from one document to the next, I have to call $DOMDocument->importNode() on the target DOMDocument.
However, I'm running into weird issues, where once the originating document is destroyed, the cloned element misbehaves.
For example, here's some lovely working code:
$dom1 = new DOMDocument;
$dom2 = new DOMDocument;
$dom2->loadHTML('<div id="div"><span class="inner"></span></div>');
$div = $dom2->getElementById('div');
$children = $dom1->importNode( $div, true )->childNodes;
echo $children->item(0)->tagName; // Output: "span"
Here's a demo: http://codepad.viper-7.com/pjd9Ty
The problem arises when I try using the elements after their original document is out of scope:
global $dom;
$dom = new DOMDocument;
function get_div_children () {
global $dom;
$local_dom = new DOMDocument;
$local_dom->loadHTML('<div id="div"><span class="inner"></span></div>');
$div = $local_dom->getElementById('div');
return $dom->importNode( $div, true )->childNodes;
}
echo get_div_children()->item(0)->tagName;
The above results in the following errors:
PHP Warning: Couldn't fetch DOMElement. Node no longer exists in ...
PHP Notice: Undefined property: DOMElement::$tagName in ...
Here's a demo: http://codepad.viper-7.com/c0kqOA
My question is twofold:
Shouldn't the returned elements exist even after the original document was destroyed, since they were cloned into the current document?
A workaround. For various reasons, I have to manipulate the elements after the original document is destroyed, but before I actually insert them into the DOM of the other DOMDocument. Is there any way to accomplish this?
Clarification: I understand that if the elements are inserted into the DOM, it behaves as expected. But, as outlined above, my setup calls for the elements to be manipulated before being inserted into the DOM (long story). Given that the first example here works - and that manipulating elements outside of the DOM is standard procedure in JavaScript - shouldn't this be possible here as well?
The cloned node has a reference to $dom, but $dom has not. Internal PHP garbage collector destroys such nodes when the calling context changes. There is only one way to create this reference: $dom->documentElement->appendChild($node).
So, use code like this (static keyword will prevent garbage collector from destroying your variable):
global $dom;
$dom = new DOMDocument;
function get_div_children () {
global $dom;
$local_dom = new DOMDocument;
$local_dom->loadHTML('<div id="div"><span class="inner"></span></div>');
$div = $local_dom->getElementById('div');
static $nodes;
$nodes = $dom->importNode( $div, true )->childNodes;
return $nodes;
}
echo get_div_children()->item(0)->tagName;

Returning element within DocumentFragment fails, because node no longer exists

Here is a testcase that highlights an error I've run into. I think the node is being destroyed/garbage collected/something after the function returns -- is there a better way I can go about this?
function render($doc) {
$fragment = $doc -> createDocumentFragment();
$fragment -> appendXML('<iframe foo="bar"/>');
return $fragment -> childNodes -> item(0);
}
$doc = new \DOMDocument();
$element = render($doc);
// Exception: Couldn't fetch DOMElement. Node no longer exists
echo $element -> tagName; // fails -- because element no longer exists
Since you're creating only one element there is no need to make a fragment. Just create the element and set its attribute.
function render($doc) {
$element = $doc -> createElement('iframe');
$element -> setAttribute('foo', 'bar');
return element;
}
$doc = new DOMDocument();
$element = render($doc);
echo $element -> tagName;
I found a workaround: simply call cloneNode() and return the clone:
return $element->cloneNode();
I agree that this is weird behavior...I don't understand why PHP does this, but at least there's a workaround that still allows you to use document fragments. For more complex fragments you may need to pass true to cloneNode to tell it to make a deep copy, I'm not sure.

Categories