Is there a way to create my own DOMNodeList? E.g.:
$doc = new DOMDocument();
$elem = $doc->createElement('div');
$nodeList = new DOMNodeList;
$nodeList->addItem($elem); // ?
My idea is to extend DOMDocument class adding some useful methods that return data as DOMNodeList.
Is it possible to do it without writing my own version of DOMNodeList class?
You cannot add items to DOMNodeList via it's public interface. However, DOMNodeLists are live collections when connected to a DOM Tree, so adding a Child Element to a DOMElement will add an element in that element's child collection (which is a DOMNodeList):
$doc = new DOMDocument();
$nodelist = $doc->childNodes; // a DOMNodeList
echo $nodelist->length; // 0
$elem = $doc->createElement('div');
$doc->appendChild($elem);
echo $nodelist->length; // 1
You say you want to add "some useful methods that return data as DOMNodeList". In the context of DOMDocument, this is what XPath does. It allows you to query all the nodes in the document and return them in a DOMNodeList. Maybe that's what you are looking for.
Related
I am using the function below, but not sure about it is always stable/secure... Is it?
When and who is stable/secure to "reuse parts of the DOMXpath preparing procedures"?
To simlify the use of the XPath query() method we can adopt a function that memorizes the last calls with static variables,
function DOMXpath_reuser($file) {
static $doc=NULL;
static $docName='';
static $xp=NULL;
if (!$doc)
$doc = new DOMDocument();
if ($file!=$docName) {
$doc->loadHTMLFile($file);
$xp = NULL;
}
if (!$xp)
$xp = new DOMXpath($doc);
return $xp; // ??RETURNED VALUES ARE ALWAYS STABLE??
}
The present question is similar to this other one about XSLTProcessor reuse.
In both questions the problem can be generalized for any language or framework that use LibXML2 as DomDocument implementation.
There are another related question: How to "refresh" DOMDocument instances of LibXML2?
Illustrating
The reuse is very commom (examples):
$f = "my_XML_file.xml";
$elements = DOMXpath_reuser($f)->query("//*[#id]");
// use elements to get information
$elements = DOMXpath_reuser($f)->("/html/body/div[1]");
// use elements to get information
But, if you do something like removeChild, replaceChild, etc. (example),
$div = DOMXpath_reuser($f)->query("/html/body/div[1]")->item(0); //STABLE
$div->parentNode->removeChild($div); // CHANGES DOM
$elements = DOMXpath_reuser($f)->query("//div[#id]"); // INSTABLE! !!
extrange things can be occur, and the queries not works as expected!!
When (what DOMDocument methods affect XPath?)
Why we can not use something like normalizeDocument to "refresh DOM" (exist?)?
Only a "new DOMXpath($doc);" is allways secure? need to reload $doc also?
DOMXpath is affected by the load*() methods on DOMDocument. After loading a new xml or html, you need to recreate the DOMXpath instance:
$xml = '<xml/>';
$dom = new DOMDocument();
$dom->loadXml($xml);
$xpath = new DOMXpath($dom);
var_dump($xpath->document === $dom); // bool(true)
$dom->loadXml($xml);
var_dump($xpath->document === $dom); // bool(false)
In DOMXpath_reuser() you store a static variable and recreate the xpath depending on the file name. If you want to reuse an Xpath object, suggest extending DOMDocument. This way you only need pass the $dom variable around. It would work with a stored xml file as well with xml string or a document your are creating.
The following class extends DOMDocument with an method xpath() that always returns a valid DOMXpath instance for it. It stores and registers the namespaces, too:
class MyDOMDocument
extends DOMDocument {
private $_xpath = NULL;
private $_namespaces = array();
public function xpath() {
// if the xpath instance is missing or not attached to the document
if (is_null($this->_xpath) || $this->_xpath->document != $this) {
// create a new one
$this->_xpath = new DOMXpath($this);
// and register the namespaces for it
foreach ($this->_namespaces as $prefix => $namespace) {
$this->_xpath->registerNamespace($prefix, $namespace);
}
}
return $this->_xpath;
}
public function registerNamespaces(array $namespaces) {
$this->_namespaces = array_merge($this->_namespaces, $namespaces);
if (isset($this->_xpath)) {
foreach ($namespaces as $prefix => $namespace) {
$this->_xpath->registerNamespace($prefix, $namespace);
}
}
}
}
$xml = <<<'ATOM'
<feed xmlns="http://www.w3.org/2005/Atom">
<title>Test</title>
</feed>
ATOM;
$dom = new MyDOMDocument();
$dom->registerNamespaces(
array(
'atom' => 'http://www.w3.org/2005/Atom'
)
);
$dom->loadXml($xml);
// created, first access
var_dump($dom->xpath()->evaluate('string(/atom:feed/atom:title)', NULL, FALSE));
$dom->loadXml($xml);
// recreated, connection was lost
var_dump($dom->xpath()->evaluate('string(/atom:feed/atom:title)', NULL, FALSE));
The DOMXpath class (instead of XSLTProcessor in your another question) use reference to given DOMDocument object in contructor. DOMXpath create libxml context object based on given DOMDocument and save it to internal class data. Besides libxml context its saves references to originalDOMDocument` given in contructor arguments.
What that means:
Part of sample from ThomasWeinert answer:
var_dump($xpath->document === $dom); // bool(true)
$dom->loadXml($xml);
var_dump($xpath->document === $dom); // bool(false)
gives false after load becouse of $dom already holds pointer to new libxml data but DOMXpath holds libxml context for $dom before load and pointer to real document after load.
Now about query works
If it should return XPATH_NODESET (as in your case) its make a node copy - node by node iterating throw detected node set(\ext\dom\xpath.c from 468 line). Copy but with original document node as parent. Its means that you can modify result but this gone away you XPath and DOMDocument connection.
XPath results provide a parentNode memeber that knows their origin:
for attribute values, parentNode returns the element that carries them. An example is //foo/#attribute, where the parent would be a foo Element.
for the text() function (as in //text()), it returns the element that contains the text or tail that was returned.
note that parentNode may not always return an element. For example, the XPath functions string() and concat() will construct strings that do not have an origin. For them, parentNode will return None.
So,
There is no any reasons to cache XPath. It do not anything besides xmlXPathNewContext (just allocate lightweight internal struct).
Each time your modify your DOMDocument (removeChild, replaceChild, etc.) your should recreate XPath.
We can not use something like normalizeDocument to "refresh DOM" because of it change internal document structure and invalidate xmlXPathNewContext created in Xpath constructor.
Only "new DOMXpath($doc);" is allways secure? Yes, if you do not change $doc between Xpath usage. Need to reload $doc also - no, because of it invalidated previously created xmlXPathNewContext.
(this is not a real answer, but a consolidation of comments and answers posted here and related questions)
This new version of the question's DOMXpath_reuser function contains the #ThomasWeinert suggestion (for avoid DOM changes by external re-load) and an option $enforceRefresh to workaround the problem of instability (as related question shows the programmer must detect when).
function DOMXpath_reuser_v2($file, $enforceRefresh=0) { //changed here
static $doc=NULL;
static $docName='';
static $xp=NULL;
if (!$doc)
$doc = new DOMDocument();
if ( $file!=$docName || ($xp && $doc !== $xp->document) ) { // changed here
$doc->load($file);
$xp = NULL;
} elseif ($enforceRefresh==2) { // add this new refresh mode
$doc->loadXML($doc->saveXML());
$xp = NULL;
}
if (!$xp || $enforceRefresh==1) //changed here
$xp = new DOMXpath($doc);
return $xp;
}
When must to use $enforceRefresh=1 ?
... perhaps an open problem, only little tips and clues...
when DOM submited to setAttribute, removeChild, replaceChild, etc.
...? more cases?
When must to use $enforceRefresh=2 ?
... perhaps an open problem, only little tips and clues...
when DOM was subject to indexes inconsistences, etc. See this question/solution.
...? more cases?
Say I have an html file that I have loaded, I run this query:
$url = 'http://www.fangraphs.com/players.aspx';
$html = file_get_contents($url);
$myDom = new DOMDocument;
$myDom->formatOutput = true;
#$myDom->loadHTML($html);
$anchor = $xpath->query('//a[contains(#href,"letter")]');
That gives me a list of these anchors that look like the following:
Aa
But I need a way to only get "players.aspx?letter=Aa".
I thought I could try:
$anchor = $xpath->query('//a[contains(#href,"letter")]/#href');
But that gives me a php error saying I couldn't append node when I try the following:
$xpath = new DOMXPath($myDom);
$newDom = new DOMDocument;
$j = 0;
while( $myAnchor = $anchor->item($j++) ){
$node = $newDom->importNode( $myAnchor, true ); // import node
$newDom->appendChild($node);
}
Any idea how to obtain just the value of the href tags that the first query selects?? Thanks!
Use:
//a/#href[contains(., 'letter')]
this selects any href attribute of any a whose string value (of the attribute) contains the string "letter" .
Your XPath query is returning attributes themselves (i.e., DOMAttr objects) rather than elements (i.e., DOMElement objects). That's fine, and that seems to be what you want, but appending them to the document is the problem. A DOMAttr is not a standalone node in the document tree; it's associated with a DOMElement but is not a child in the usual sense. Thus, directly appending a DOMAttr to the document is invalid.
From the W3C specs:
Attr objects inherit the Node interface, but since they are not actually child nodes of the element they describe, the DOM does not consider them part of the document tree. . . . The DOM takes the view that attributes are properties of elements rather than having a separate identity from the elements they are associated with
Either associate the DOMAttr with a DOMElement and append that element, or pull out the DOMAttr's value and use that as you wish.
To just append its plain text value, use its value in a DOMText node and append that. For example, change this line:
$newDom->appendChild($node);
to this:
$newDom->appendChild(new DOMText($node->value));
try this..
$xml_string = 'your xml string';
$xml = simplexml_load_string($xml_string);
foreach($xml->a[0]->attributes() as $href => $value) {
$myAnchorsValues[] = $value;
}
var_dump($myAnchorsValues);
I'm using PHP/Zend to load html into a DOM, and then I get a specific div id that I want to modify.
$dom = new Zend_Dom_Query($html);
$element = $dom->query('div[id="someid"]');
How do I modify the text/content/html displayed inside that $element div, and then save the changes to the $dom or $html so I can print the modified html. Any idea how to do this?
Zend_Dom_Query is tailored just for querying a dom, so it doesn't provide an interface in and of itself to alter the dom and save it, but it does expose the PHP Native DOM objects that will let you do so. Something like this should work:
$dom = new Zend_Dom_Query($html);
$document = $dom->getDocument();
$elements = $dom->query('div[id="someid"]');
foreach($elements AS $element) {
//$element is an instance of DOMElement (http://www.php.net/DOMElement)
//You have to create new nodes off the document
$node = $document->createElement("div", "contents of div");
$element->appendChild($node)
}
$newHtml = $document->saveXml();
Take a look at the PHP Doc for DOMElement to get an idea of how you can alter the dom:
http://www.php.net/DOMElement
I'd like to search for nodes with the same node name in a SimpleXML Object no matter how deep they are nested and create an instance of them as an array.
In the HTML DOM I can do that with JavaScript by using getElementsByTagName(). Is there a way to do that in PHP as well?
Yes use xpath
$xml->xpath('//div');
Here $xml is your SimpleXML object.
In this example you will get array of all 'div' elements
$fname = dirname(__FILE__) . '\\xml\\crRoll.xml';
$dom = new DOMDocument;
$dom->load($fname, LIBXML_DTDLOAD|LIBXML_DTDATTR);
$root = $dom->documentElement;
$xpath = new DOMXpath($dom);
$xpath->registerNamespace('cr', "http://www.w3.org/1999/xhtml");
$candidateNodes = $xpath->query("//cr:break");
foreach ($candidateNodes as $child) {
$max = $child->getAttribute('tstamp');
}
This finds all the BREAK nodes (tstamp attr) using XPath ...
Only on DOMDocument::getElementsByTagName,
however, you can import/export SimpleXML into DOMDocument,
or simply use DOMDocument to parse XML.
Another answer mentioned about Xpath,
it will return duplication of node, if you have something like :-
<div><div>1</div></div>
Is there a way to remove a HTML element by using the DOMDocument class?
In addition to Dave Morgan's answer you can use DOMNode::removeChild to remove child from list of children:
Removing a child by tag name
//The following example will delete the table element of an HTML content.
$dom = new DOMDocument();
//avoid the whitespace after removing the node
$dom->preserveWhiteSpace = false;
//parse html dom elements
$dom->loadHTML($html_contents);
//get the table from dom
if($table = $dom->getElementsByTagName('table')->item(0)) {
//remove the node by telling the parent node to remove the child
$table->parentNode->removeChild($table);
//save the new document
echo $dom->saveHTML();
}
Removing a child by class name
//same beginning
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTML($html_contents);
//use DomXPath to find the table element with your class name
$xpath = new DomXPath($dom);
$classname='MyTableName';
$xpath_results = $xpath->query("//table[contains(#class, '$classname')]");
//get the first table from XPath results
if($table = $xpath_results->item(0)){
//remove the node the same way
$table ->parentNode->removeChild($table);
echo $dom->saveHTML();
}
Resources
http://us2.php.net/manual/en/domnode.removechild.php
How to delete element with DOMDocument?
How to get full HTML from DOMXPath::query() method?
http://us2.php.net/manual/en/domnode.removechild.php
DomDocument is a DomNode.. You can just call remove child and you should be fine.
EDIT: Just noticed you were probably talking about the page you are working with currently. Don't know if DomDocument would work. You may wanna look to use javascript at that point (if its already been served up to the client)