How can I get all attributes with PHP xpath? - php

Given the following HTML string:
<div
class="example-class"
data-caption="Example caption"
data-link="https://www.example.com"
data-image-url="https://example.com/example.jpg">
</div>
How can I use PHP with xpath to output / retrieve an array with all attributes as key / value pairs?
Hoping for output like:
Array
(
[data-caption] => Example caption
[data-link] => https://www.example.com
[data-image-url] => https://example.com/example.jpg
)
// etc etc...
I know how to get individual attributes, but I'm hoping to do it in one fell swoop. Here's what I currently have:
function get_data($html = '') {
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//div/#data-link');
foreach ($nodes as $node) {
var_dump($node);
}
}
Thanks!

In XPath, you can use #* to reference attributes of any name, for example :
$nodes = $xpath->query('//div/#*');
foreach ($nodes as $node) {
echo $node->nodeName ." : ". $node->nodeValue ."<br>";
}
eval.in demo
output :
class : example-class
data-caption : Example caption
data-link : https://www.example.com
data-image-url : https://example.com/example.jpg

I think this should do what you want - or at least, give you the basis to proceed.
define('BR','<br />');
$strhtml='<div
class="example-class"
data-caption="Example caption"
data-link="https://www.example.com"
data-image-url="https://example.com/example.jpg">
</div>';
$dom=new DOMDocument;
$dom->loadHTML( $strhtml );
$xpath=new DOMXPath( $dom );
$col=$xpath->query('//div');
if( $col ){
foreach( $col as $node ) if( $node->nodeType==XML_ELEMENT_NODE ) {
foreach( $node->attributes as $attr ) echo $attr->nodeName.' '.$attr->nodeValue.BR;
}
}
$dom = $col = $xpath = null;

Related

Why does not display the attribute html via xpath php

Why does not display the attribute html via xpath php
<?php
$content = '<div class="keep-me">Keep this div</div><div class="remove-me" id="test">Remove this div</div>';
$badClasses = array('');
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($content);
libxml_clear_errors();
$xPath = new DOMXpath($dom);
foreach($badClasses as $badClass){
$domNodeList = $xPath->query('//div[#class="remove-me"]/#id');
$domElemsToRemove = ''; // container of deleted elements
foreach ( $domNodeList as $domElement ) {
$domElemsToRemove .= $dom->saveHTML($domElement); // concat them
$domElement->parentNode->removeChild($domElement); // then remove
}
}
$content = $dom->saveHTML();
echo htmlentities($domElemsToRemove);
?>
Works - //div[#class="remove-me"] or //div[#class="remove-me"]/text()
Not working - //div[#class="remove-me"]/#id
Maybe there is a way easier
The XPath //div[#class="remove-me"]/#id is correct, but you need to just loop over the returned elements and add the nodeValue to a list of matching ID's...
$xPath = new DOMXpath($dom);
$domNodeList = $xPath->query('//div[#class="remove-me"]/#id');
$ids = []; // container of deleted elements
foreach ( $domNodeList as $domElement ) {
$ids[] = $domElement->nodeValue;
}
print_r($ids);
If the aim is to fetch the ID of any element with class "remove-me" as is how I interpret the question then perhaps you can try like this - untested btw...
.... other code before
$xp=new DOMXpath( $dom );
$col= $xp->query( '*[#class="remove-me"]' );
if( $col->length > 0 ){
foreach($col as $node){
$id=$node->hasAttribute('id') ? $node->getAttribute('id') : 'banana';
echo $id;
}
}
however looking at the code in the question suggests that you wish to delete nodes - in which case build an array of nodes ( nodelist ) and iterate through it from the end to the front - ie: backwards...

How to parse body class with Xpath?

I'm trying to parse a page with Xpath, but I don't manage to get the body class.
Here is what I'm trying :
<?php
$url = 'http://figurinepop.com/mickey-paintbrush-disney-funko';
$html = file_get_contents($url);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$nodes = $xpath->query('//link[#rel="canonical"]/#href');
foreach($nodes as $node) {
$canonical = $node->nodeValue;
}
$nodes = $xpath->query('//html/body/#class');
foreach($nodes as $node) {
$bodyclass = $node->nodeValue;
}
$output['canonical'] = $canonical;
$output['bodyclass'] = $bodyclass;
echo '<pre>'; print_r ($output); echo '</pre>';
?>
Here is what I get :
Array
(
[canonical] => http://figurinepop.com/mickey-paintbrush-disney-funko
[bodyclass] =>
)
It's working with many elements (title, canonical, div...) but the body class.
I've tested the Xpath query with a chrome extension and it seems well written.
What is wrong ?

simple_html_dom for faster retrieval

I am trying to do some web scraping using simple_html_dom. But I just want inner text of a span element only. Do I have to load the entire page for that? It is taking a lot of time since I am running it in a loop. What are other alternatives to do this faster?
Here is what I am doing now-
$html = file_get_html($url);
foreach($html->find('span') as $element) {
if($element->innertext=="some text") {
$html->clear();
unset($html);
break;
}
else {
//do something
}
This is too slow if this is used inside a loop. Faster way to do this?
I am not sure about the speed, but instead of doing foreach loop, you can do something like this
$html->find( $selector, $idx )
<?php
$html = file_get_html( $url );
if ( is_object( $html ) ) {
if ( $span = $html->find( "span", 0 ) ) {
$span->innertext = "some text";
}
}
?>
You could give the following a try:
$dom = new DOMDocument();
$dom->loadHTMLFile($url);
$xpath = new DOMXPath($dom);
$content = $xpath->query("//span")->item(0)->nodeValue;
echo $content;
Fastest will be:
$dom = new DOMDocument();
$dom->loadHTMLFile($url);
$xpath = new DOMXPath($dom);
$content = $xpath->query("//span[contains(text(), 'some text')]")->item(0)->nodeValue;

How to loop through all the Childs under a tag in PHP DOMDocument

I have the following html
$html = '<body><div style="font-color:#000">Hello</div>
<span style="what">My name is rasid</span><div>new to you
</div><div style="rashid">New here</div></body>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$elements = $dom->getElementsByTagName('body');
I have tried
foreach($elements as $child)
{
echo $child->nodeName;
}
The Ouput is
body
But I need to loop through all the tags under body not the body. How can I do that.
I have also tried in above example to replace
$elements = $dom->getElementsByTagName('body');
with
$elements = $dom->getElementsByTagName('body')->item(0);
But It gives Error. Any Solution??
try this
$elements = $dom->getElementsByTagName('*');
$i = 1; //counter to output from 3rd one, since foreach loop below will output" html body div span div div"
foreach($elements as $child)
{
if ($i > 2) echo $child->nodeName."<br>"; //output "div span div div"
++$i;
}
If you only want child nodes of the body element, you can use:
$body = $dom->getElementsByTagName( 'body' )->item( 0 );
foreach( $body->childNodes as $node )
{
echo $node->nodeName . PHP_EOL;
}
If you want all descending nodes of the body element, you could use DOMXPath:
$xpath = new DOMXPath( $dom );
$bodyDescendants = $xpath->query( '//body//node()' );
foreach( $bodyDescendants as $node )
{
echo $node->nodeName . PHP_EOL;
}
use this code
$elements = $dom->getElementsByTagName('*');
foreach($elements as $child)
{
echo $child->nodeName;
}

Trying to use PHP DOM to replace node text without changing child nodes

I am trying to use the dom object to simplify the implementation of a glossary tooltip. What I need to do is to replace a text element in a paragraph, but NOT in an anchor tag that may be embedded in the paragraph.
$html = '<p>Replace this tag not this tag</p>';
$document = new DOMDocument();
$document->loadHTML($html);
$document->preserveWhiteSpace = false;
$document->validateOnParse = true;
$nodes = $document->getElementByTagName("p");
foreach ($nodes as $node) {
$node->nodeValue = str_replace("tag","element",$node->nodeValue);
}
echo $document->saveHTML();
I get:
'...<p>Replace this element not this element</p>...'
I want:
'...<p>Replace this element not this tag</p>...'
How do I implement this such that only the parent node text is changed and the child node (a tag) is not changed?
Try this:
$html = '<p>Replace this tag not this tag</p>';
$document = new DOMDocument();
$document->loadHTML($html);
$document->preserveWhiteSpace = false;
$document->validateOnParse = true;
$nodes = $document->getElementsByTagName("p");
foreach ($nodes as $node) {
while( $node->hasChildNodes() ) {
$node = $node->childNodes->item(0);
}
$node->nodeValue = str_replace("tag","element",$node->nodeValue);
}
echo $document->saveHTML();
Hope this helps.
UPDATE
To answer #paul's question in the comments below, you can create
$html = '<p>Replace this tag not this tag</p>';
$document = new DOMDocument();
$document->loadHTML($html);
$document->preserveWhiteSpace = false;
$document->validateOnParse = true;
$nodes = $document->getElementsByTagName("p");
//create the element which should replace the text in the original string
$elem = $document->createElement( 'dfn', 'tag' );
$attr = $document->createAttribute('title');
$attr->value = 'element';
$elem->appendChild( $attr );
foreach ($nodes as $node) {
while( $node->hasChildNodes() ) {
$node = $node->childNodes->item(0);
}
//dump the new string here, which replaces the source string
$node->nodeValue = str_replace("tag",$document->saveHTML($elem),$node->nodeValue);
}
echo $document->saveHTML();

Categories