Using PHP 7.1 I want to count the number of nodes in the root of this string:
<p>Lorem</p>
<p>Ipsum</p>
<div>Dolores</div>
<b>Amet</b>
Using following PHP:
$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
$dom->loadHTML($content);
$root = $dom->documentElement;
$children = $root->childNodes;
var_dump($children)
Returns:
object(DOMNodeList)#4 (1) {
["length"]=>
int(1)
}
I don't understand why the string of HTML only returns as 1 node. Additionally, I am unable to iterate through the nodes.
After a nice conversation in chat with #bart we find a solution.
$content = "
<p>Lorem</p>
<p>Ipsum</p>
<div>Dolores</div>
<b>Amet</b>
";
$dom = new DOMDocument;
$dom->loadHTML($content);
$allElements = $dom->getElementsByTagName('*');
echo $allElements->length;
echo "<br />";
$node = array();
foreach($allElements as $element) {
if(array_key_exists($element->tagName, $node)) {
$node[$element->tagName] += 1;
} else {
$node[$element->tagName] = 1;
}
}
print_r($node);
ps: html and body tag are added and counted by default increasing the result by 2.
For the record ( and despite other answer being accepted, here is the correct way to list the child nodes :-). This includes the text nodes, which people forget are there!
<?php
$content = "
<p>Lorem</p>
<p>Ipsum</p>
<div>Dolores</div>
<b>Amet</b>
";
$dom = new DOMDocument;
$dom->loadHTML($content);
$nodes=[];
$bodyNodes = $dom->getElementsByTagName('body'); // returns DOMNodeList object
foreach($bodyNodes[0]->childNodes as $child) // assuming 1 <body> node
{
$nodes[]=$child->nodeName;
}
print_r($nodes);
Outputs this, illustrating the point...:
Array
(
[0] => p
[1] => #text
[2] => p
[3] => #text
[4] => div
[5] => #text
[6] => b
[7] => #text
)
Well I was already typing this answer up so I'll add it here anyway.
You have to iterate through the contents of a DOMNodeList object, it's not an array structure that can be seen with var_dump() and friends. When iterating with foreach you get an instance of a DOMNode object. The count of elements in the DOMNodeList is stored in the length property.
$content = "
<p>Lorem</p>
<p>Ipsum</p>
<div>Dolores</div>
<b>Amet</b>
";
$dom = new DomDocument();
$dom->loadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$allElements = $dom->getElementsByTagName('*');
echo "We found $allElements->length elements\n";
foreach ($allElements as $element) {
echo "$element->tagName = $element->nodeValue\n";
}
Related
Why does not display the attribute html via xpath php
<?php
$content = '<div class="keep-me">Keep this div</div><div class="remove-me" id="test">Remove this div</div>';
$badClasses = array('');
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($content);
libxml_clear_errors();
$xPath = new DOMXpath($dom);
foreach($badClasses as $badClass){
$domNodeList = $xPath->query('//div[#class="remove-me"]/#id');
$domElemsToRemove = ''; // container of deleted elements
foreach ( $domNodeList as $domElement ) {
$domElemsToRemove .= $dom->saveHTML($domElement); // concat them
$domElement->parentNode->removeChild($domElement); // then remove
}
}
$content = $dom->saveHTML();
echo htmlentities($domElemsToRemove);
?>
Works - //div[#class="remove-me"] or //div[#class="remove-me"]/text()
Not working - //div[#class="remove-me"]/#id
Maybe there is a way easier
The XPath //div[#class="remove-me"]/#id is correct, but you need to just loop over the returned elements and add the nodeValue to a list of matching ID's...
$xPath = new DOMXpath($dom);
$domNodeList = $xPath->query('//div[#class="remove-me"]/#id');
$ids = []; // container of deleted elements
foreach ( $domNodeList as $domElement ) {
$ids[] = $domElement->nodeValue;
}
print_r($ids);
If the aim is to fetch the ID of any element with class "remove-me" as is how I interpret the question then perhaps you can try like this - untested btw...
.... other code before
$xp=new DOMXpath( $dom );
$col= $xp->query( '*[#class="remove-me"]' );
if( $col->length > 0 ){
foreach($col as $node){
$id=$node->hasAttribute('id') ? $node->getAttribute('id') : 'banana';
echo $id;
}
}
however looking at the code in the question suggests that you wish to delete nodes - in which case build an array of nodes ( nodelist ) and iterate through it from the end to the front - ie: backwards...
Given the following HTML string:
<div
class="example-class"
data-caption="Example caption"
data-link="https://www.example.com"
data-image-url="https://example.com/example.jpg">
</div>
How can I use PHP with xpath to output / retrieve an array with all attributes as key / value pairs?
Hoping for output like:
Array
(
[data-caption] => Example caption
[data-link] => https://www.example.com
[data-image-url] => https://example.com/example.jpg
)
// etc etc...
I know how to get individual attributes, but I'm hoping to do it in one fell swoop. Here's what I currently have:
function get_data($html = '') {
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//div/#data-link');
foreach ($nodes as $node) {
var_dump($node);
}
}
Thanks!
In XPath, you can use #* to reference attributes of any name, for example :
$nodes = $xpath->query('//div/#*');
foreach ($nodes as $node) {
echo $node->nodeName ." : ". $node->nodeValue ."<br>";
}
eval.in demo
output :
class : example-class
data-caption : Example caption
data-link : https://www.example.com
data-image-url : https://example.com/example.jpg
I think this should do what you want - or at least, give you the basis to proceed.
define('BR','<br />');
$strhtml='<div
class="example-class"
data-caption="Example caption"
data-link="https://www.example.com"
data-image-url="https://example.com/example.jpg">
</div>';
$dom=new DOMDocument;
$dom->loadHTML( $strhtml );
$xpath=new DOMXPath( $dom );
$col=$xpath->query('//div');
if( $col ){
foreach( $col as $node ) if( $node->nodeType==XML_ELEMENT_NODE ) {
foreach( $node->attributes as $attr ) echo $attr->nodeName.' '.$attr->nodeValue.BR;
}
}
$dom = $col = $xpath = null;
We have thousands of Closed Caption XML files that we have to import to a database as plain text, as well as preserve the HTML markup for conversion to another CC format. I have been able to extract the plain text quite easily, but can't seem to find the correct way of extracting the raw HTML as well.
Is there a way to accomplish something like "->htmlContent" in the same way that ->textContent works below?
$ctx = stream_context_create(array('http' => array('timeout' => 60)));
$xml = #file_get_contents('http://blah-blah-blah/16TH.xml', 0, $ctx);
$dom = new DOMDocument;
$dom->loadXML($xml);
$ptags = $dom->getElementsByTagName( "p" );
foreach( $ptags as $p ) {
$text = $p->textContent;
}
Typical <p> being processed:
<p begin="00:00:14.83" end="00:00:18.83" tts:textAlign="left">
<metadata ccrow="12" cccol="8"/>
(male narrator)<br></br> THE 16TH AND 17TH CENTURIES<br></br> WERE THE FORMATIVE 200 YEARS
</p>
Successful ->textContent Result
(male narrator) THE 16TH AND 17TH CENTURIES WERE THE FORMATIVE 200 YEARS
Desired HTML Result
(male narrator)<br></br> THE 16TH AND 17TH CENTURIES<br></br> WERE THE FORMATIVE 200 YEARS
In other word you would like to save specific nodes - br elements and text nodes. You can do this with DOM+Xpath:
$document = new DOMDocument();
$document->preserveWhiteSpace = false;
$document->loadXml($html);
$xpath = new DOMXpath($document);
foreach ($xpath->evaluate('//p') as $p) {
$content = '';
foreach ($xpath->evaluate('.//br|.//text()', $p) as $node) {
$content .= $document->saveHtml($node);
}
var_dump($content);
}
Output:
string(86) "
(male narrator)<br> THE 16TH AND 17TH CENTURIES<br> WERE THE FORMATIVE 200 YEARS
"
The Xpath Expression
Any descendant br: .//br
Any descendant text node: .//text()
Combined expression: .//br|.//text()
Namespaces
If you XML uses namespaces you have to register and use them.
$document = new DOMDocument();
$document->preserveWhiteSpace = false;
$document->loadXml($html);
$xpath = new DOMXpath($document);
$xpath->registerNamespace('tt', 'http://www.w3.org/2006/04/ttaf1');
foreach ($xpath->evaluate('//tt:p') as $p) {
$content = '';
foreach ($xpath->evaluate('.//tt:br|.//text()', $p) as $node) {
$content .= $document->saveHtml($node);
}
var_dump($content);
}
I couldn't see the forest for the trees...quite a simple solution after I realized that strip_tags() was failing because of the closing tags of the BR tag:
foreach( $ptags as $p ) {
$text = $p->textContent;
$html = $p->ownerDocument->saveXML($p); // Raw HTML
$html = str_ireplace('<br></br>','<br>',$html); // Cleanup the BR usage
$html = strip_tags($html,'<br>'); // Strip the tags I don't need
}
There's likely a more elegent solution with the DOM, or with regex, but this did get it done.
Here is an XML bit:
[11] => SimpleXMLElement Object
(
[#attributes] => Array
(
[id] => 46e8f57e67db48b29d84dda77cf0ef51
[label] => Publications
)
[section] => Array
(
[0] => SimpleXMLElement Object
(
[#attributes] => Array
(
[id] => 9a34d6b273914f18b2273e8de7c48fd6
[label] => Journal Articles
[recordId] => 1a5a5710b0e0468e92f9a2ced92906e3
)
I know the value "46e8f57e67db48b29d84dda77cf0ef51" but its location varies across files. Can I use XPath to find the path to this value? If not what could be used?
Latest trial that does not work:
$search = $xml->xpath("//text()=='047ec63e32fe450e943cb678339e8102'");
while(list( , $node) = each($search)) {
echo '047ec63e32fe450e943cb678339e8102',$node,"\n";
}
PHPs DOMNode objects have a function for that: DOMNode::getNodePath()
$xml = <<<'XML'
<root>
<child key="1">
<child key="2"/>
<child key="3"/>
</child>
</root>
XML;
$dom = new DOMDocument();
$dom->loadXml($xml);
$xpath = new DOMXpath($dom);
$nodes = $xpath->evaluate('//child');
foreach ($nodes as $node) {
var_dump($node->getNodePath());
}
Output:
string(11) "/root/child"
string(20) "/root/child/child[1]"
string(20) "/root/child/child[2]"
SimpleXML is a wrapper for DOM and here is a function that allows you to get the DOMNode for an SimpleXMLElement: dom_import_simplexml.
$xml = <<<'XML'
<root>
<child key="1">
<child key="2"/>
<child key="3"/>
</child>
</root>
XML;
$structure = simplexml_load_string($xml);
$elements = $structure->xpath('//child');
foreach ($elements as $element) {
$node = dom_import_simplexml($element);
var_dump($node->getNodePath());
}
To fetch an element by its attribute xpath can be used.
Select all nodes using the element joker anywhere in the document:
//*
Filter them by the id attribute:
//*[#id = "46e8f57e67db48b29d84dda77cf0ef51"]
$dom = new DOMDocument();
$dom->loadXml('<node id="46e8f57e67db48b29d84dda77cf0ef51"/>');
$xpath = new DOMXpath($dom);
foreach ($xpath->evaluate('//*[#id = "46e8f57e67db48b29d84dda77cf0ef51"]') as $node) {
var_dump(
$node->getNodePath()
);
}
Is this string always in the #id attribute? Then a valid and distinct path is always //*[#id='46e8f57e67db48b29d84dda77cf0ef51'], no matter where it is.
To construct a path to a given node, use $node->getNodePath() which will return an XPath expression for the current node. Also take this answer on constructing XPath expression using #id attributes, similar to like Firebug does, in account.
For SimpleXML you will have to do everything by hand. If you need to support attribute and other paths, you will have to add this, this code only supports element nodes.
$results = $xml->xpath("/highways/route[66]");
foreach($results as $result) {
$path = "";
while (true) {
// Is there an #id attribute? Shorten the path.
if ($id = $result['id']) {
$path = "//".$result->getName()."[#id='".(string) $id."']".$path;
break;
}
// Determine preceding and following elements, build a position predicate from it.
$preceding = $result->xpath("preceding-sibling::".$result->getName());
$following = $result->xpath("following-sibling::".$result->getName());
$predicate = (count($preceding) + count($following)) > 0 ? "[".(count($preceding)+1)."]" : "";
$path = "/".$result->getName().$predicate.$path;
// Is there a parent node? Then go on.
$result = $result->xpath("parent::*");
if (count($result) > 0) $result = $result[0];
else break;
}
echo $path."\n";
}
$tags = array(
"applet" => 1,
"script" => 1
);
$html = file_get_contents("test.html");
$dom = new DOMdocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$body = $xpath->query("//body")->item(0);
I'm about looping through the "body" of the web page and remove all unwanted tags listed in the $tags array but I can't find a way. So how can I do it?
Had you considered HTML Purifier? starting with your own html sanitizing is just re-inventing the wheel, and isn't easy to accomplish.
Furthermore, a blacklist approach is also bad, see SO/why-use-a-whitelist-for-html-sanitizing
You may also be interested in reading how to cinfigure allowed tags & attributes or testing HTML Purifier demo
$tags = array(
"applet" => 1,
"script" => 1
);
$html = file_get_contents("test.html");
$dom = new DOMdocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
for($i=0; $i<count($tags); ++$i) {
$list = $xpath->query("//".$tags[$i]);
for($j=0; $j<$list->length; ++$j) {
$node = $list->item($j);
if ($node == null) continue;
$node->parentNode->removeChild($node);
}
}
$string = $dom->saveXML();
Something like that.