Why does not display the attribute html via xpath php - php

Why does not display the attribute html via xpath php
<?php
$content = '<div class="keep-me">Keep this div</div><div class="remove-me" id="test">Remove this div</div>';
$badClasses = array('');
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($content);
libxml_clear_errors();
$xPath = new DOMXpath($dom);
foreach($badClasses as $badClass){
$domNodeList = $xPath->query('//div[#class="remove-me"]/#id');
$domElemsToRemove = ''; // container of deleted elements
foreach ( $domNodeList as $domElement ) {
$domElemsToRemove .= $dom->saveHTML($domElement); // concat them
$domElement->parentNode->removeChild($domElement); // then remove
}
}
$content = $dom->saveHTML();
echo htmlentities($domElemsToRemove);
?>
Works - //div[#class="remove-me"] or //div[#class="remove-me"]/text()
Not working - //div[#class="remove-me"]/#id
Maybe there is a way easier

The XPath //div[#class="remove-me"]/#id is correct, but you need to just loop over the returned elements and add the nodeValue to a list of matching ID's...
$xPath = new DOMXpath($dom);
$domNodeList = $xPath->query('//div[#class="remove-me"]/#id');
$ids = []; // container of deleted elements
foreach ( $domNodeList as $domElement ) {
$ids[] = $domElement->nodeValue;
}
print_r($ids);

If the aim is to fetch the ID of any element with class "remove-me" as is how I interpret the question then perhaps you can try like this - untested btw...
.... other code before
$xp=new DOMXpath( $dom );
$col= $xp->query( '*[#class="remove-me"]' );
if( $col->length > 0 ){
foreach($col as $node){
$id=$node->hasAttribute('id') ? $node->getAttribute('id') : 'banana';
echo $id;
}
}
however looking at the code in the question suggests that you wish to delete nodes - in which case build an array of nodes ( nodelist ) and iterate through it from the end to the front - ie: backwards...

Related

How to parse body class with Xpath?

I'm trying to parse a page with Xpath, but I don't manage to get the body class.
Here is what I'm trying :
<?php
$url = 'http://figurinepop.com/mickey-paintbrush-disney-funko';
$html = file_get_contents($url);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$nodes = $xpath->query('//link[#rel="canonical"]/#href');
foreach($nodes as $node) {
$canonical = $node->nodeValue;
}
$nodes = $xpath->query('//html/body/#class');
foreach($nodes as $node) {
$bodyclass = $node->nodeValue;
}
$output['canonical'] = $canonical;
$output['bodyclass'] = $bodyclass;
echo '<pre>'; print_r ($output); echo '</pre>';
?>
Here is what I get :
Array
(
[canonical] => http://figurinepop.com/mickey-paintbrush-disney-funko
[bodyclass] =>
)
It's working with many elements (title, canonical, div...) but the body class.
I've tested the Xpath query with a chrome extension and it seems well written.
What is wrong ?

How can I get all attributes with PHP xpath?

Given the following HTML string:
<div
class="example-class"
data-caption="Example caption"
data-link="https://www.example.com"
data-image-url="https://example.com/example.jpg">
</div>
How can I use PHP with xpath to output / retrieve an array with all attributes as key / value pairs?
Hoping for output like:
Array
(
[data-caption] => Example caption
[data-link] => https://www.example.com
[data-image-url] => https://example.com/example.jpg
)
// etc etc...
I know how to get individual attributes, but I'm hoping to do it in one fell swoop. Here's what I currently have:
function get_data($html = '') {
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//div/#data-link');
foreach ($nodes as $node) {
var_dump($node);
}
}
Thanks!
In XPath, you can use #* to reference attributes of any name, for example :
$nodes = $xpath->query('//div/#*');
foreach ($nodes as $node) {
echo $node->nodeName ." : ". $node->nodeValue ."<br>";
}
eval.in demo
output :
class : example-class
data-caption : Example caption
data-link : https://www.example.com
data-image-url : https://example.com/example.jpg
I think this should do what you want - or at least, give you the basis to proceed.
define('BR','<br />');
$strhtml='<div
class="example-class"
data-caption="Example caption"
data-link="https://www.example.com"
data-image-url="https://example.com/example.jpg">
</div>';
$dom=new DOMDocument;
$dom->loadHTML( $strhtml );
$xpath=new DOMXPath( $dom );
$col=$xpath->query('//div');
if( $col ){
foreach( $col as $node ) if( $node->nodeType==XML_ELEMENT_NODE ) {
foreach( $node->attributes as $attr ) echo $attr->nodeName.' '.$attr->nodeValue.BR;
}
}
$dom = $col = $xpath = null;

simple_html_dom for faster retrieval

I am trying to do some web scraping using simple_html_dom. But I just want inner text of a span element only. Do I have to load the entire page for that? It is taking a lot of time since I am running it in a loop. What are other alternatives to do this faster?
Here is what I am doing now-
$html = file_get_html($url);
foreach($html->find('span') as $element) {
if($element->innertext=="some text") {
$html->clear();
unset($html);
break;
}
else {
//do something
}
This is too slow if this is used inside a loop. Faster way to do this?
I am not sure about the speed, but instead of doing foreach loop, you can do something like this
$html->find( $selector, $idx )
<?php
$html = file_get_html( $url );
if ( is_object( $html ) ) {
if ( $span = $html->find( "span", 0 ) ) {
$span->innertext = "some text";
}
}
?>
You could give the following a try:
$dom = new DOMDocument();
$dom->loadHTMLFile($url);
$xpath = new DOMXPath($dom);
$content = $xpath->query("//span")->item(0)->nodeValue;
echo $content;
Fastest will be:
$dom = new DOMDocument();
$dom->loadHTMLFile($url);
$xpath = new DOMXPath($dom);
$content = $xpath->query("//span[contains(text(), 'some text')]")->item(0)->nodeValue;

Trying to use PHP DOM to replace node text without changing child nodes

I am trying to use the dom object to simplify the implementation of a glossary tooltip. What I need to do is to replace a text element in a paragraph, but NOT in an anchor tag that may be embedded in the paragraph.
$html = '<p>Replace this tag not this tag</p>';
$document = new DOMDocument();
$document->loadHTML($html);
$document->preserveWhiteSpace = false;
$document->validateOnParse = true;
$nodes = $document->getElementByTagName("p");
foreach ($nodes as $node) {
$node->nodeValue = str_replace("tag","element",$node->nodeValue);
}
echo $document->saveHTML();
I get:
'...<p>Replace this element not this element</p>...'
I want:
'...<p>Replace this element not this tag</p>...'
How do I implement this such that only the parent node text is changed and the child node (a tag) is not changed?
Try this:
$html = '<p>Replace this tag not this tag</p>';
$document = new DOMDocument();
$document->loadHTML($html);
$document->preserveWhiteSpace = false;
$document->validateOnParse = true;
$nodes = $document->getElementsByTagName("p");
foreach ($nodes as $node) {
while( $node->hasChildNodes() ) {
$node = $node->childNodes->item(0);
}
$node->nodeValue = str_replace("tag","element",$node->nodeValue);
}
echo $document->saveHTML();
Hope this helps.
UPDATE
To answer #paul's question in the comments below, you can create
$html = '<p>Replace this tag not this tag</p>';
$document = new DOMDocument();
$document->loadHTML($html);
$document->preserveWhiteSpace = false;
$document->validateOnParse = true;
$nodes = $document->getElementsByTagName("p");
//create the element which should replace the text in the original string
$elem = $document->createElement( 'dfn', 'tag' );
$attr = $document->createAttribute('title');
$attr->value = 'element';
$elem->appendChild( $attr );
foreach ($nodes as $node) {
while( $node->hasChildNodes() ) {
$node = $node->childNodes->item(0);
}
//dump the new string here, which replaces the source string
$node->nodeValue = str_replace("tag",$document->saveHTML($elem),$node->nodeValue);
}
echo $document->saveHTML();

Remove empty tags from a XML with PHP

Question
How can I remove empty xml tags in PHP?
Example:
$value1 = "2";
$value2 = "4";
$value3 = "";
xml = '<parentnode>
<tag1> ' .$value1. '</tag1>
<tag2> ' .$value2. '</tag2>
<tag3> ' .$value3. '</tag3>
</parentnode>';
XML Result:
<parentnode>
<tag1>2</tag1>
<tag2>4</tag2>
<tag3></tag3> // <- Empty tag
</parentnode>
What I want!
<parentnode>
<tag1>2</tag1>
<tag2>4</tag2>
</parentnode>
The XML without the empty tags like "tag3"
Thanks!
You can use XPath with the predicate not(node()) to select all elements that do not have child nodes.
<?php
$doc = new DOMDocument;
$doc->preserveWhiteSpace = false;
$doc->loadxml('<parentnode>
<tag1>2</tag1>
<tag2>4</tag2>
<tag3></tag3>
<tag2>4</tag2>
<tag3></tag3>
<tag2>4</tag2>
<tag3></tag3>
</parentnode>');
$xpath = new DOMXPath($doc);
foreach( $xpath->query('//*[not(node())]') as $node ) {
$node->parentNode->removeChild($node);
}
$doc->formatOutput = true;
echo $doc->savexml();
prints
<?xml version="1.0"?>
<parentnode>
<tag1>2</tag1>
<tag2>4</tag2>
<tag2>4</tag2>
<tag2>4</tag2>
</parentnode>
This works recursively and removes nodes that:
contain only spaces
do not have attributes
do not have child notes
// not(*) does not have children elements
// not(#*) does not have attributes
// text()[normalize-space()] nodes that include whitespace text
while (($node_list = $xpath->query('//*[not(*) and not(#*) and not(text()[normalize-space()])]')) && $node_list->length) {
foreach ($node_list as $node) {
$node->parentNode->removeChild($node);
}
}
$dom = new DOMDocument;
$dom->loadXML($xml);
$elements = $dom->getElementsByTagName('*');
foreach($elements as $element) {
if ( ! $element->hasChildNodes() OR $element->nodeValue == '') {
$element->parentNode->removeChild($element);
}
}
echo $dom->saveXML();
CodePad.
The solution that worked with my production PHP SimpleXMLElement object code, by using Xpath, was:
/*
* Remove empty (no children) and blank (no text) XML element nodes, but not an empty root element (/child::*).
* This does not work recursively; meaning after empty child elements are removed, parents are not reexamined.
*/
foreach( $this->xml->xpath('/child::*//*[not(*) and not(text()[normalize-space()])]') as $emptyElement ) {
unset( $emptyElement[0] );
}
Note that it is not required to use PHP DOM, DOMDocument, DOMXPath, or dom_import_simplexml().
//this is a recursively option
do {
$removed = false;
foreach( $this->xml->xpath('/child::*//*[not(*) and not(text()[normalize-space()])]') as $emptyElement ) {
unset( $emptyElement[0] );
$removed = true;
}
} while ($removed) ;
If you're going to be a lot of this, just do something like:
$value[] = "2";
$value[] = "4";
$value[] = "";
$xml = '<parentnode>';
for($i=1,$m=count($value); $i<$m+1; $i++)
$xml .= !empty($value[$i-1]) ? "<tag{$i}>{$value[$i-1]}</tag{$i}>" : null;
$xml .= '</parentnode>';
echo $xml;
Ideally though, you should probably use domdocument.

Categories