Driving me up the wall, it's like this, the DOM
<div class="product-intro"><p class="product-desc"><span class="product-model">234</span>Product Description</p></div>
It's not anything like this... invalid argument, no nodes found:
$node3 = $xp->query("//p[#class='product-desc and not(#class='product-model']");
$node3 = $xp->query("//p[#class='product-desc'][not([#class='product-model'])]");
This alone:
$node3 = $xp->query("//p[#class='product-desc']");
Works perfectly well and fine-- as far as getting a result.
The output is
234Product Description
I know I could just do a string replace, but not ideally.. How do I get it to exclude the product-model class in my query?
Entire script:
$x = '<div class="product-item productContainer" data-itemno="234">
<div class="product-and-intro">
<div class="product">
<a href="/en/234.html" title="Product Description">
<img src="/ProductImages/106/234.jpg" alt="Product Description" class="itemImage" />
<div class="product-intro">
<p class="product-desc"><span class="product-model">234</span>Product Description</p>
<p class="price"><span class="us">US$</span>6.50 <span class="oldprice"><s>$ 13.00</s></span></p>
</div>
</a>
</div>
</div>
</div>';
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($x);
$xp = new DOMXPath($dom);
$node1 = $xp->query("//div[#class='product']//img");
$node2 = $xp->query("//span[#class='product-model']");
$node3 = $xp->query("//p[#class='product-desc']");
// $node3 = $xp->query("//p[#class='product-desc']/text()[2]");
// $node3 = $xp->query("//p[#class='product-desc' and not(span/#class='product-model')]");
$node4 = $xp->query("//p[#class='price']");
foreach ($node1 as $n) {
echo $n->getAttribute('src');
echo '<br>';
}
foreach ($node2 as $n2) {
echo $n2->nodeValue;
echo '<br>';
}
foreach ($node3 as $n3) {
echo $n3->nodeValue;
echo '<br>';
}
foreach ($node4 as $n4) {
echo $n4->nodeValue;
echo '<br>';
}
If you want to select all p elements with a #class attribute value of product-desc and then filter out those who have a span sub-element with the #class attribute value product-model you can use this XPath expression:
//p[#class='product-desc' and not(span/#class='product-model')]
Or, in a whole
$node3 = $xp->query("//p[#class='product-desc' and not(span/#class='product-model')]");
Related
I know nothing, ZERO, about xpath or DOM.
In the end I need the href value and the content of the span from 12 H2 tags on the page. I have figured out how to get each item individually but getting them all in one shot isn't clicking, no matter how much I read. A little help?
<h2 class="make-it-pretty">
<a class="more-pretty" href="some-file-somewhere">
<span class="another-class">Product Name</span>
</a>
</h2>
Here is what I use to get them individually.
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$htext = $xpath->query('//h2[contains(#class, "make-it-pretty")]')->item(0);
echo $htext->textContent;
I would probably use $doc->loadHTMLFile instead, but:
<?php
$html = '<html lang="en"><head><meta charset="UTF-8" /><title>Title Here</title></head>
<body>
<h2 class="make-it-pretty"><a class="more-pretty" href="some-file-somewhere"><span class="another-class">Product Name</span></a></h2>
</body></html>';
$doc = #new DOMDocument(); $doc->loadHTML($html);
function getElementsByClassName($className, $withinNode = null){
global $doc;
$d = $withinNode ?? $doc;
$r = []; $a = $d->getElementsByTagName('*');
foreach($a as $n){
if($n->getAttribute('class') === $className)$r[] = $n;
}
return $r;
}
$anotherClass = getElementsByClassName('another-class');
// getElementsByClassName('make-it-pretty'); works as well, in this case
echo $anotherClass[0]->textContent;
?>
try this without Xpath
<?
$html ='<h2 class="make-it-pretty"> <a class="more-pretty" href="some-file-somewhere"> <span class="another-class">Product Name</span> </a> </h2><h2 class="make-it-pretty"> <a class="more-pretty" href="some-file-somewhere"> <span class="another-class">Product Name</span> </a> </h2><h2 class="make-it-pretty"> <a class="more-pretty" href="some-file-somewhere"> <span class="another-class">Product Name</span> </a> </h2>';
$dom = new DOMDocument("1.0", "utf-8");
if($dom->loadHTML($html, LIBXML_NOWARNING)){
$h2s = $dom->getElementsByTagName('h2');
foreach ($h2s as $h2) {
$as = $h2->getElementsByTagName('a');
echo '<pre>';
//print_r($as);
foreach($as as $a){
print_r('link :'.$a->getAttribute('href')."\n");
$spans = $a->getElementsByTagName('span');
}
foreach($spans as $span){
print_r('content :'.$span->nodeValue."\n");
}
}
}
below is my html structure, i want output like : content inside post_message div and respective images
something like :
test 123 -> 1.png
test 1232 -> 2.png
test 1232 -> 3.png
Html content
<div class="abc">
<div>
<div class="udata">
<div class="post_message"><p>test 123</p></div>
<div class="">
<img class="scaledImageFitWidth img" src="1.png">
</div>
</div>
</div>
</div>
<div class="abc">
<div>
<div class="udata">
<div class="post_message"><p>test 1232</p></div>
<div class="">
<img class="scaledImageFitWidth img" src="2.png">
<img class="scaledImageFitWidth img" src="3.png">
</div>
</div>
</div>
</div>
Below is my php code but it seems not working :
<?php
$dom = new DomDocument();
// $dom->load($filePath);
#$dom->loadHTML($fop);
$finder = new DomXPath($dom);
$classname="udata";
$nodes = $finder->query("//*[contains(#class, '$classname')]");
// print_r($nodes);
foreach ($nodes as $i => $node) {
$entries = $finder->query("//*[contains(#class, 'post_message')]", $node);
print_r($entries);
$isrc = $node->query("//img/#src");
print_r($isrc);
}
When using XPath, you always need to make your XPath relative to the start node, so using the descendant axes to ensure you limit the subsequent search is only in the nodes part of the start point.
So the code would look more like...
foreach ($nodes as $i => $node) {
$entries = $finder->query("descendant::*[contains(#class, 'post_message')]", $node);
echo $entries[0]->textContent .":";
$isrc = $finder->query("descendant::img/#src", $node);
foreach ( $isrc as $src ) {
echo $src->textContent.",";
}
echo PHP_EOL;
}
which would output
test 123:1.png,
test 1232:2.png,3.png,
my input
<div id='makeme' class='testme'>
<span id='whatspanID'>somthing</span>
<p class='ptagclass'></p>
</div>
My expected output
<div>
<span></span>
<p></p>
</div>
To remove the content inside the tag, i can use below snippet, but how to remove the attributes from the tag
$html = str_get_html($str);
foreach($html->find("text") as $ht) {
$ht->innertext = "";
}
$html->save();
Using DOM and Xpath allows you to select text and attribute nodes.
$html = <<<'HTML'
<div id='makeme' class='testme'>
<span id='whatspanID'>somthing</span>
<p class='ptagclass'></p>
</div>
HTML;
$dom = new DOMDocument();
$dom->loadHtml($html);
$xpath = new DOMXpath($dom);
$div = $xpath->evaluate('//div[#id="makeme"]')->item(0);
$nodes = $xpath->evaluate('.//text()|#*|.//*/#*', $div);
foreach ($nodes as $node) {
if ($node instanceof DOMAttr) {
$node->parentNode->removeAttributeNode($node);
} else {
$node->parentNode->removeChild($node);
}
}
echo $dom->saveHtml($div);
Output:
<div>
<span></span><p></p>
</div>
My below code retrieves a series of images from the search results of a site and also the corresponding age data. It works fine however I get a list of images followed by a list of the information in the age field.
img img img img age age age age and so on.
How do I combine these so I can display them in sets: img age img age img age
<?php
error_reporting(-1);
$html = new DOMDocument();
#$html->loadHtmlFile('http://www.site.com/searchresults.html');
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//div[#class='age']" );
$tags = $html->getElementsByTagName('img');
foreach ($tags as $tag) {
$image = $tag->getAttribute('src');
echo '<img src='. $image .' alt="image" ><br>';
}
foreach ($nodelist as $n)
{
echo $n->nodeValue."<br>";
}
?>
Sample page, I want to extract the img source title data from <div class="age" title="30 usa">:
<div id="sr-15763292" class="search-result">
<div class="thumb-wrapper">
<a class="bioLink" href="http://www.site.com/user/" title="View user"><img src="http://www.site.com/img/15763292.jpg" class="thumb" alt="user" width="140" height="105"></a>
<p class="status"><a href="http://www.site.com/user/" >Online</a></p>
</div>
<div class="rating">
<div class="rating-stars rating4"></div>
</div>
<div class="age" title="30 usa">
<p>30</p>
<p class="gender m">m</p>
<p>USA</p>
</div>
<div>
<p class="headline">Hello there.</p>
</div>
</div>
It's hard to answer if we don't know what the HTML looks like! Assuming it looks something like this
<div class="age"><p>21</p>
<img src="a.jpg" />
</div>
<div class="age"><p>51</p>
<img src="b.jpg" />
</div>
you need to find each div and then find the image inside each div. getElementsByTagName() will give you a list even if there's only one result, so use item() to fetch the first.
error_reporting(-1);
$html = new DOMDocument();
#$html->loadHtmlFile('results.html');
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//div[#class='age']" );
foreach ($nodelist as $node) {
$tags = $node->getElementsByTagName('img');
$image = $tags->item(0)->getAttribute('src');
echo '<img src="'. $image .'" alt="image" ><br>';
echo $node->textContent . '<br>';
}
If the HTML is like this
<div class="age"><p>21</p></div><img src="a.jpg" />
you can try
$node->nextSibling()
As a general point trace through the HTML and think how do I get from A to B? Go forwards? backwards? up to parent, to the next node and down again ...?
I have the following HTML markup
<div contenteditable="true" class="text"></div>
<div contenteditable="true" class="text"></div>
<div style="display: block;" class="ui-draggable">
<img class='avatar' src=""/>
<p style="">
<img class='pic' src=""/><br>
<span class='fulltext' style="display:none"></span>
</p>-<span class='create'></span>
<a class='permalink' href=""></a>
</div>
<div contenteditable="true" class="text"></div>
<div style="display: block;" class="ui-draggable">
<img class='avatar' src=""/>
<p style="">
<img class='pic' src=""/><br>
<span class='fulltext' style="display:none"></span>
</p><span class='create'></span><a class='permalink' href=""></a>
</div>
The parent div's can be more.In order to parse the information and to insert it in the DB I'm using the following code -
$dom = new DOMDocument();
$dom->loadHTML($xml);
$xpath = new DOMXPath($dom);
$div = $xpath->query('//div');
$i=0;
$q=1;
foreach($div as $book) {
$attr = $book->getAttribute('class');
//if div contenteditable
if($attr == 'text') {
echo '</br>'.$book->nodeValue."</br>";
}
else {
$new = new DOMDocument();
$newxpath = new DOMXPath($new);
$avatar = $xpath->query("(//img[#class='avatar']/#src)[$q]");
$picture = $xpath->query("(//p/img[#class='pic']/#src)[$q]");
$fulltext = $xpath->query("(//p/span[#class='fulltext'])[$q]");
$permalink = $xpath->query("(//a[#class='permalink'])[$q]");
echo $permalink->item(0)->nodeValue; //date
echo $permalink->item(0)->getAttribute('href');
echo $fulltext->item(0)->nodeValue;
echo $avatar->item(0)->value;
echo $picture->item(0)->value;
$q++;
}
$i++;
}
But I think that there's a better way for parsing the HTML. Is there? Thank you in advance
Note that DOMXPath::query supports a second param called contextparam. Also you won't need a second DOMDocument and DOMXPath inside the loop. Use:
$avatar = $xpath->query("img[#class='avatar']/#src", $book);
to get <img src=""> attribute nodes relative to the div nodes. If you follow my advices your example should be fine.
Here comes a version of your code that follows the above said:
$dom = new DOMDocument();
$dom->loadHTML($xml);
$xpath = new DOMXPath($dom);
$divs = $xpath->query('//div');
foreach($divs as $book) {
$attr = $book->getAttribute('class');
if($attr == 'text') {
echo '</br>'.$book->nodeValue."</br>";
} else {
$avatar = $xpath->query("img[#class='avatar']/#src", $book);
$picture = $xpath->query("p/img[#class='pic']/#src", $book);
$fulltext = $xpath->query("p/span[#class='fulltext']", $book);
$permalink = $xpath->query("a[#class='permalink']", $book);
echo $permalink->item(0)->nodeValue; //date
echo $permalink->item(0)->getAttribute('href');
echo $fulltext->item(0)->nodeValue;
echo $avatar->item(0)->value;
echo $picture->item(0)->value;
}
}
As a matter of fact, you do it the right way : html has to be parsed with a DOM object.
Then some optimisation can be brough :
$div = $xpath->query('//div');
is quite greedy, a getElementsByTagName should be more appropriate :
$div = $dom->getElementsByTagName('div');