below is my html structure, i want output like : content inside post_message div and respective images
something like :
test 123 -> 1.png
test 1232 -> 2.png
test 1232 -> 3.png
Html content
<div class="abc">
<div>
<div class="udata">
<div class="post_message"><p>test 123</p></div>
<div class="">
<img class="scaledImageFitWidth img" src="1.png">
</div>
</div>
</div>
</div>
<div class="abc">
<div>
<div class="udata">
<div class="post_message"><p>test 1232</p></div>
<div class="">
<img class="scaledImageFitWidth img" src="2.png">
<img class="scaledImageFitWidth img" src="3.png">
</div>
</div>
</div>
</div>
Below is my php code but it seems not working :
<?php
$dom = new DomDocument();
// $dom->load($filePath);
#$dom->loadHTML($fop);
$finder = new DomXPath($dom);
$classname="udata";
$nodes = $finder->query("//*[contains(#class, '$classname')]");
// print_r($nodes);
foreach ($nodes as $i => $node) {
$entries = $finder->query("//*[contains(#class, 'post_message')]", $node);
print_r($entries);
$isrc = $node->query("//img/#src");
print_r($isrc);
}
When using XPath, you always need to make your XPath relative to the start node, so using the descendant axes to ensure you limit the subsequent search is only in the nodes part of the start point.
So the code would look more like...
foreach ($nodes as $i => $node) {
$entries = $finder->query("descendant::*[contains(#class, 'post_message')]", $node);
echo $entries[0]->textContent .":";
$isrc = $finder->query("descendant::img/#src", $node);
foreach ( $isrc as $src ) {
echo $src->textContent.",";
}
echo PHP_EOL;
}
which would output
test 123:1.png,
test 1232:2.png,3.png,
Related
I have some HTML that contains this:
<div class="test">
Outer
<div class="test">Inner 1</div>
<div class="test">Inner 2</div>
</div>
I'm doing str_replace() on the contents of these elements:
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
foreach($xpath->query("//div[#class='test']") as $node) {
$node->nodeValue = str_replace(" ", "X", $node->nodeValue);
}
That should replace any spaces with an "X".
But it results in this error:
Warning: Couldn't fetch DOMElement. Node no longer exists in /path/to/my/file.php on line 63
It works if there's only one nested div:
<div class="test">
Outer
<div class="test">Inner 1</div>
</div>
Why does this happen, and how can I get it working?
Try changing
foreach($xpath->query("//div[#class='test']") as $node)
to
foreach($xpath->query('//div[#class="test"]//div[#class="test"]') as $node)
Edit per comments:
Assuming there's a space in the outer element (i.e., its "Outer 1:):
<?php
$string = <<<XML
<div class="test">
Outer 1
<div class="test">Inner 1</div>
<div class="test">Inner 2</div>
</div>
XML;
$dom = new DOMDocument();
$dom->loadHTML($string);
$xpath = new DOMXpath($dom);
foreach($xpath->query('//div[#class="test"]//text()') as $node) {
$nnode = trim($node->nodeValue);
echo $nnode = str_replace(" ", "X", $nnode);
}
Driving me up the wall, it's like this, the DOM
<div class="product-intro"><p class="product-desc"><span class="product-model">234</span>Product Description</p></div>
It's not anything like this... invalid argument, no nodes found:
$node3 = $xp->query("//p[#class='product-desc and not(#class='product-model']");
$node3 = $xp->query("//p[#class='product-desc'][not([#class='product-model'])]");
This alone:
$node3 = $xp->query("//p[#class='product-desc']");
Works perfectly well and fine-- as far as getting a result.
The output is
234Product Description
I know I could just do a string replace, but not ideally.. How do I get it to exclude the product-model class in my query?
Entire script:
$x = '<div class="product-item productContainer" data-itemno="234">
<div class="product-and-intro">
<div class="product">
<a href="/en/234.html" title="Product Description">
<img src="/ProductImages/106/234.jpg" alt="Product Description" class="itemImage" />
<div class="product-intro">
<p class="product-desc"><span class="product-model">234</span>Product Description</p>
<p class="price"><span class="us">US$</span>6.50 <span class="oldprice"><s>$ 13.00</s></span></p>
</div>
</a>
</div>
</div>
</div>';
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($x);
$xp = new DOMXPath($dom);
$node1 = $xp->query("//div[#class='product']//img");
$node2 = $xp->query("//span[#class='product-model']");
$node3 = $xp->query("//p[#class='product-desc']");
// $node3 = $xp->query("//p[#class='product-desc']/text()[2]");
// $node3 = $xp->query("//p[#class='product-desc' and not(span/#class='product-model')]");
$node4 = $xp->query("//p[#class='price']");
foreach ($node1 as $n) {
echo $n->getAttribute('src');
echo '<br>';
}
foreach ($node2 as $n2) {
echo $n2->nodeValue;
echo '<br>';
}
foreach ($node3 as $n3) {
echo $n3->nodeValue;
echo '<br>';
}
foreach ($node4 as $n4) {
echo $n4->nodeValue;
echo '<br>';
}
If you want to select all p elements with a #class attribute value of product-desc and then filter out those who have a span sub-element with the #class attribute value product-model you can use this XPath expression:
//p[#class='product-desc' and not(span/#class='product-model')]
Or, in a whole
$node3 = $xp->query("//p[#class='product-desc' and not(span/#class='product-model')]");
Tying to extract the value "Output" between spans only if the title is "ABCD (1,2)" using php. Basically, find "Output (extract Output).
Here is the section of html:
<div class="wrap">
<strong title="ABCD (1,2)" class="name">ABCD (1,2):</strong>
<div id="test1">
<div class="testclass" id="test2">
<span>Output</span>
</div>
</div>
</div>
Here is the code I like to use:
<?php
$html = file_get_contents('test.html');
$dom = new DOMDocument;
#$dom->loadHTML($html);
//Some code needs to go here!
$tags = $dom->getElementsByTagName('strong');
?>
One way would be to just use xpath in this case, use a query that would select that desired element. Get that element that has that title and get the following div, and under it, go to the span:
Example (using the markup above):
$html = '
<div class="wrap">
<strong title="ABCD (1,2)" class="name">ABCD (1,2):</strong>
<div id="test1">
<div class="testclass" id="test2">
<span>Output</span>
</div>
</div>
</div>
';
$search_string = 'ABCD (1,2)';
$dom = new DOMDocument;
#$dom->loadHTML($html);
$query = "//strong[#title = '{$search_string}']/following-sibling::div/div/span";
$xpath = new DOMXpath($dom);
$result = $xpath->query($query);
if($result->length > 0) {
echo $result->item(0)->nodeValue;
}
I have PHP code which removes all nodes that have at least one attribute. Here is my code:
<?php
$data = <<<DATA
<div>
<p>These line shall stay</p>
<p class="myclass">Remove this one</p>
<p>But keep this</p>
<div style="color: red">and this</div>
</div>
DATA;
$dom = new DOMDOcument();
$dom->loadHTML($data, LIBXML_HTML_NOIMPLIED);
$dom->removeChild($dom->doctype);
$xpath = new DOMXPath($dom);
$lines_to_be_removed = $xpath->query("//*[count(#*)>0]");
foreach ($lines_to_be_removed as $line) {
$line->parentNode->removeChild($line);
}
// just to check
echo $dom->saveHTML();
?>
As you see in the fiddle, this is the current output of code above:
<div>
<p>These line shall stay</p>
<p>But keep this</p>
</div>
While this is desired result:
<div>
<p>These line shall stay</p>
Remove this one
<p>But keep this</p>
and this
</div>
How can I do that?
Prior to removing the elements you want to pluck out their child nodes and tack them on behind it.
Example:
$data = <<<DATA
<div>
<p>These line shall stay</p>
<p class="myclass">Remove this one</p>
<p>But keep this</p>
<div style="color: red">and this</div>
<div style="color: red">and <p>also</p> this</div>
<div style="color: red">and this <div style="color: red">too</div></div>
</div>
DATA;
$dom = new DOMDocument();
$dom->loadHTML($data, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($xpath->query("//*[#*]") as $node) {
$parent = $node->parentNode;
while ($node->hasChildNodes()) {
$parent->insertBefore($node->lastChild, $node->nextSibling);
}
$parent->removeChild($node);
}
echo $dom->saveHTML();
Outputs:
<div>
<p>These line shall stay</p>
Remove this one
<p>But keep this</p>
and this
and <p>also</p> this
and this too
</div>
https://3v4l.org/9qHRM
(I added some nested elements to demonstrate the safety of this approach.)
Couple of asides:
You don't need $dom->removeChild($dom->doctype) if you load with the additional LIBXML_HTML_NODEFDTD flag.
Your xpath expression can be simplified to //*[#*]
You could use replaceChild() with the text content of that node:
foreach ($lines_to_be_removed as $line) {
$line->parentNode->replaceChild($dom->createTextNode($line->textContent),$line);
}
// <div>
// <p>These line shall stay</p>
// Remove this one
// <p>But keep this</p>
// and this
// </div>
However, this may prove problematic with your // notation of your xpath selector and recursion.
Using a more manual approach to copy the child contents of the target nodes into the parent nodes.
$data = '
<div>
<div>1A</div>
<div class="foo">1B
<div>2C</div>
<div class="foo">2D</div>
<div>2E</div>
<div class="foo">2F
<div>3G</div>
<div class="foo">3H</div>
</div>
</div>
</div>';
$dom = new DOMDOcument();
$dom->loadHTML($data, LIBXML_HTML_NOIMPLIED);
$dom->removeChild($dom->doctype);
SomeFunctionName( $dom->documentElement );
$html = $dom->saveHTML();
function SomeFunctionName( $parent )
{
$nodesToDelete = array();
if( $parent->hasChildNodes() )
{
foreach( $parent->childNodes as $node )
{
SomeFunctionName( $node );
if( $node->hasAttributes() and count( $node->attributes ) > 0 )
{
foreach( $node->childNodes as $childNode )
{
$node->parentNode->insertBefore( clone $childNode, $node );
}
$nodesToDelete[] = $node;
}
}
}
foreach( $nodesToDelete as $delete)
{
$delete->parentNode->removeChild( $delete );
}
}
// <div>
// <div>1A</div>
// 1B
// <div>2C</div>
// 2D
// <div>2E</div>
// 2F
// <div>3G</div>
// 3H
// <div>3I</div>
// 3J
// </div>
If you want to nest the child elements in a new "div" container swap out this porition of code
foreach( $parent->childNodes as $node )
{
SomeFunctionName( $node );
if( $node->hasAttributes() and count( $node->attributes ) > 0 )
{
$newNode = $node->ownerDocument->createElement('div');
foreach( $node->childNodes as $childNode )
{
$newNode->appendChild( clone $childNode );
}
$node->parentNode->insertBefore( $newNode, $node );
$nodesToDelete[] = $node;
}
}
// <div>
// <div>1A</div>
// <div>1B
// <div>2C</div>
// <div>2D</div>
// <div>2E</div>
// <div>2F
// <div>3G</div>
// <div>3H</div>
// <div>3I</div>
// <div>3J</div>
// </div>
// </div>
// </div>
This will remove all tags that have class and style attributes, so it's not a bullet proof:
<?php
$data = <<<DATA
<div>
<p>These line shall stay</p>
<p class="myclass">Remove this one</p>
<p>But keep this</p>
<div style="color: red">and this</div>
</div>
DATA;
$dom = new DOMDOcument();
$dom->loadHTML($data, LIBXML_HTML_NOIMPLIED);
$dom->removeChild($dom->doctype);
$xpath = new DOMXPath($dom);
$lines_to_be_removed = $xpath->query("//*[count(#class)>0 or count(#style)>0]");
foreach ($lines_to_be_removed as $line) {
$line->parentNode->removeChild($line);
}
// just to check
echo $dom->saveHTML();
?>
Note this line:
$lines_to_be_removed = $xpath->query("//*[count(#class)>0] or count(#style)>0]");
My below code retrieves a series of images from the search results of a site and also the corresponding age data. It works fine however I get a list of images followed by a list of the information in the age field.
img img img img age age age age and so on.
How do I combine these so I can display them in sets: img age img age img age
<?php
error_reporting(-1);
$html = new DOMDocument();
#$html->loadHtmlFile('http://www.site.com/searchresults.html');
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//div[#class='age']" );
$tags = $html->getElementsByTagName('img');
foreach ($tags as $tag) {
$image = $tag->getAttribute('src');
echo '<img src='. $image .' alt="image" ><br>';
}
foreach ($nodelist as $n)
{
echo $n->nodeValue."<br>";
}
?>
Sample page, I want to extract the img source title data from <div class="age" title="30 usa">:
<div id="sr-15763292" class="search-result">
<div class="thumb-wrapper">
<a class="bioLink" href="http://www.site.com/user/" title="View user"><img src="http://www.site.com/img/15763292.jpg" class="thumb" alt="user" width="140" height="105"></a>
<p class="status"><a href="http://www.site.com/user/" >Online</a></p>
</div>
<div class="rating">
<div class="rating-stars rating4"></div>
</div>
<div class="age" title="30 usa">
<p>30</p>
<p class="gender m">m</p>
<p>USA</p>
</div>
<div>
<p class="headline">Hello there.</p>
</div>
</div>
It's hard to answer if we don't know what the HTML looks like! Assuming it looks something like this
<div class="age"><p>21</p>
<img src="a.jpg" />
</div>
<div class="age"><p>51</p>
<img src="b.jpg" />
</div>
you need to find each div and then find the image inside each div. getElementsByTagName() will give you a list even if there's only one result, so use item() to fetch the first.
error_reporting(-1);
$html = new DOMDocument();
#$html->loadHtmlFile('results.html');
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//div[#class='age']" );
foreach ($nodelist as $node) {
$tags = $node->getElementsByTagName('img');
$image = $tags->item(0)->getAttribute('src');
echo '<img src="'. $image .'" alt="image" ><br>';
echo $node->textContent . '<br>';
}
If the HTML is like this
<div class="age"><p>21</p></div><img src="a.jpg" />
you can try
$node->nextSibling()
As a general point trace through the HTML and think how do I get from A to B? Go forwards? backwards? up to parent, to the next node and down again ...?