How to get full HTML from DOMXPath::query() method? - php

I have document from which I want to extract specific div with it's untouched content.
I do:
$dom = new DOMDocument();
$dom->loadHTML($string);//that's HTML of my document, string
and xpath query:
$xpath = new DOMXPath($dom);
$xpath_resultset = $xpath->query("//div[#class='text']");
/*I'm after div class="text"*/
now I do item(0) method on what I get with $xpath_resultset
$my_content = $xpath_resultset->item(0);
what I get is object (not string) $my_content which I can echo or settype() to string, but as result I get is with fully stripped markup?
What to do to get all from div class='text' here?

Just pass the node to the DOMDocument::saveHTML method:
$htmlString = $dom->saveHTML($xpath_resultset->item(0));
This will give you a string representation of that particular DOMNode and all its children.

Related

Xpath accessing the value of an attribute within an XML tag

I'm attempting to extract the date saved in the <PersonDetails> tag for some XML I am working with, example:
<Record>
<PersonDetails RecordDate="2017-03-31T00:00:00">
<FirstName>Joe</FirstName>
<Surname>Blogs</Surname>
<Status>Active</Status>
</PersonDetails>
</Record>
Currently I have been trying the following:
if (isset($XML->Record->xpath("//PersonDetails[#RecordDate]")[0])) {
$theDate = $XML->Record->xpath("//PersonDetails[#RecordDate]")[0])->textContent;
} else {
$theDate = "no date";
}
My intention is to have $theDate = 2017-03-31T00:00:00
A valid XPath expression for selecting attribute node should look like below:
$theDate = $XML->xpath("//Record/PersonDetails/#RecordDate")[0];
echo $theDate; // 2017-03-31T00:00:00
You're mixing SimpleXML and DOM here. Additionally the expression fetches a PersonDetails element that has a RecordDate attribute. [] are conditions.
SimpleXML
So to fetch attribute node you need to use //PersonDetails/#RecordDate. In SimpleXML this will create a SimpleXMLElement for a non existing element node that will return the attribute value if cast to a string. SimpleXMLElement::xpath() will always return an array so you need to cast the first element of that array into a string.
$theDate = (string)$XML->xpath("//PersonDetails/#RecordDate")[0];
DOM
$textContent is a property of DOM nodes. It contains the text content of all descendant nodes. But you don't need it in this case. If you use DOMXpath::evaluate(), the Xpath expression can return the string value directly.
$document = new DOMDocument();
$document->loadXml($xmlString);
$xpath = new DOMXpath($document);
$theDate = $xpath->evaluate('string(//PersonDetails/#RecordDate)');
The string typecast is moved into the Xpath expression.

What instead of createTextNode to not avoid XML paragraphs

I've got a php code which gets external xml file, adds something before last paragraph and saves it as new file.
<?php
$xmldoc = new DOMDocument();
$xmldoc->load('xml.xml');
$root = $xmldoc->firstChild;
$newElement = $xmldoc->createTextNode('<o id="1" url="link.html" price="899.00" avail="1" weight="0" stock="0" set="0" basket="0"></o>');
$root->appendChild($newElement);
$newText = $xmldoc->createTextNode($newAct);
$newElement->appendChild($newText);
$xmldoc->save('sample.xml');
?>
However, I don't want to lose XML signs like " <> ". What should I use instead of createTextNode? Because by now I've got a code like this:
<o id="1" url="link.html" price="899.00" avail="1" weight="0" stock="0" set="0" basket="0">
DOMDocument::createTextNode() does exactly that it creates a node containing text. It does not loose the special characters - they will be encoded as entities for the serialization.
Here are other methods like DOMDocument::createElement() to create an element node and DOMElement::setAttribute() to set attributes on an element node.
If you have an XML fragment as a string literal, here is a node type that can consume it. The DOMDocumentFragment.
$document = new DOMDocument();
$root = $document->appendChild($document->createElement('foo'));
$fragment = $document->createDocumentFragment();
$fragment->appendXml('<p>some text</p>');
$root->appendChild($fragment);
echo $document->saveXml();
Document fragments are kind of virtual nodes, the are a list of nodes with a single node as a container so that they can be passed to the DOM methods. Be careful using DOMDocumentFragment:appendXml() or you might open yourself to HTML/XML injections.
You are searching createElement.
<?php
$newElement = $xmldoc->createTextNode('o');
$newElement->setAttribute("url", "link.html");
You can then add attributes to match with your example.
See setAttribute. createTextNode creates only a text node, no XML. createElement, creates an XML element.

php - parentNode of a string found with preg_match

I'm trying to access the parentNode of an element found with preg_match, because I would like to read the result found with regex through the DOM of the document. I can't access it directly through PHP's DOMDocument because the amount of div's is variable and they have no actualy ID or any other attribute that is able to match.
To illustrate this: in the below example I'd match match_me with preg_match, and then I'd want to access the parentNode (div) and put all the child elements (the p's) in an DOMdocument object, so I can easily display them.
<div>
.... variable amount of divs
<div>
<div>
<p>1 match_me</p><p>2</p>
</div>
</div>
</div>
Use DOMXpath to query for the node by the value of its child:
$dom = new DOMDocument();
// Load your doc however necessary...
$xpath = new DOMXpath($dom);
// This query should match the parent div itself
$nodes = $xpath->query('/div[p/text() = "1 match_me"]');
$your_div = $nodes->item(0);
// Do something with the children
$p_tags = $your_div->childNodes;
// Or in this version, the query returns the `<p>` on which `parentNode` is called
$ndoes = $xpath->query('/p[text() = "1 match_me"]');
$your_div = $nodes->item(0)->parentNode;

PHP Xpath: Get all href's that contain "letter"

Say I have an html file that I have loaded, I run this query:
$url = 'http://www.fangraphs.com/players.aspx';
$html = file_get_contents($url);
$myDom = new DOMDocument;
$myDom->formatOutput = true;
#$myDom->loadHTML($html);
$anchor = $xpath->query('//a[contains(#href,"letter")]');
That gives me a list of these anchors that look like the following:
Aa
But I need a way to only get "players.aspx?letter=Aa".
I thought I could try:
$anchor = $xpath->query('//a[contains(#href,"letter")]/#href');
But that gives me a php error saying I couldn't append node when I try the following:
$xpath = new DOMXPath($myDom);
$newDom = new DOMDocument;
$j = 0;
while( $myAnchor = $anchor->item($j++) ){
$node = $newDom->importNode( $myAnchor, true ); // import node
$newDom->appendChild($node);
}
Any idea how to obtain just the value of the href tags that the first query selects?? Thanks!
Use:
//a/#href[contains(., 'letter')]
this selects any href attribute of any a whose string value (of the attribute) contains the string "letter" .
Your XPath query is returning attributes themselves (i.e., DOMAttr objects) rather than elements (i.e., DOMElement objects). That's fine, and that seems to be what you want, but appending them to the document is the problem. A DOMAttr is not a standalone node in the document tree; it's associated with a DOMElement but is not a child in the usual sense. Thus, directly appending a DOMAttr to the document is invalid.
From the W3C specs:
Attr objects inherit the Node interface, but since they are not actually child nodes of the element they describe, the DOM does not consider them part of the document tree. . . . The DOM takes the view that attributes are properties of elements rather than having a separate identity from the elements they are associated with
Either associate the DOMAttr with a DOMElement and append that element, or pull out the DOMAttr's value and use that as you wish.
To just append its plain text value, use its value in a DOMText node and append that. For example, change this line:
$newDom->appendChild($node);
to this:
$newDom->appendChild(new DOMText($node->value));
try this..
$xml_string = 'your xml string';
$xml = simplexml_load_string($xml_string);
foreach($xml->a[0]->attributes() as $href => $value) {
$myAnchorsValues[] = $value;
}
var_dump($myAnchorsValues);

How to remove an HTML element using the DOMDocument class

Is there a way to remove a HTML element by using the DOMDocument class?
In addition to Dave Morgan's answer you can use DOMNode::removeChild to remove child from list of children:
Removing a child by tag name
//The following example will delete the table element of an HTML content.
$dom = new DOMDocument();
//avoid the whitespace after removing the node
$dom->preserveWhiteSpace = false;
//parse html dom elements
$dom->loadHTML($html_contents);
//get the table from dom
if($table = $dom->getElementsByTagName('table')->item(0)) {
//remove the node by telling the parent node to remove the child
$table->parentNode->removeChild($table);
//save the new document
echo $dom->saveHTML();
}
Removing a child by class name
//same beginning
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTML($html_contents);
//use DomXPath to find the table element with your class name
$xpath = new DomXPath($dom);
$classname='MyTableName';
$xpath_results = $xpath->query("//table[contains(#class, '$classname')]");
//get the first table from XPath results
if($table = $xpath_results->item(0)){
//remove the node the same way
$table ->parentNode->removeChild($table);
echo $dom->saveHTML();
}
Resources
http://us2.php.net/manual/en/domnode.removechild.php
How to delete element with DOMDocument?
How to get full HTML from DOMXPath::query() method?
http://us2.php.net/manual/en/domnode.removechild.php
DomDocument is a DomNode.. You can just call remove child and you should be fine.
EDIT: Just noticed you were probably talking about the page you are working with currently. Don't know if DomDocument would work. You may wanna look to use javascript at that point (if its already been served up to the client)

Categories