My below code retrieves a series of images from the search results of a site and also the corresponding age data. It works fine however I get a list of images followed by a list of the information in the age field.
img img img img age age age age and so on.
How do I combine these so I can display them in sets: img age img age img age
<?php
error_reporting(-1);
$html = new DOMDocument();
#$html->loadHtmlFile('http://www.site.com/searchresults.html');
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//div[#class='age']" );
$tags = $html->getElementsByTagName('img');
foreach ($tags as $tag) {
$image = $tag->getAttribute('src');
echo '<img src='. $image .' alt="image" ><br>';
}
foreach ($nodelist as $n)
{
echo $n->nodeValue."<br>";
}
?>
Sample page, I want to extract the img source title data from <div class="age" title="30 usa">:
<div id="sr-15763292" class="search-result">
<div class="thumb-wrapper">
<a class="bioLink" href="http://www.site.com/user/" title="View user"><img src="http://www.site.com/img/15763292.jpg" class="thumb" alt="user" width="140" height="105"></a>
<p class="status"><a href="http://www.site.com/user/" >Online</a></p>
</div>
<div class="rating">
<div class="rating-stars rating4"></div>
</div>
<div class="age" title="30 usa">
<p>30</p>
<p class="gender m">m</p>
<p>USA</p>
</div>
<div>
<p class="headline">Hello there.</p>
</div>
</div>
It's hard to answer if we don't know what the HTML looks like! Assuming it looks something like this
<div class="age"><p>21</p>
<img src="a.jpg" />
</div>
<div class="age"><p>51</p>
<img src="b.jpg" />
</div>
you need to find each div and then find the image inside each div. getElementsByTagName() will give you a list even if there's only one result, so use item() to fetch the first.
error_reporting(-1);
$html = new DOMDocument();
#$html->loadHtmlFile('results.html');
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//div[#class='age']" );
foreach ($nodelist as $node) {
$tags = $node->getElementsByTagName('img');
$image = $tags->item(0)->getAttribute('src');
echo '<img src="'. $image .'" alt="image" ><br>';
echo $node->textContent . '<br>';
}
If the HTML is like this
<div class="age"><p>21</p></div><img src="a.jpg" />
you can try
$node->nextSibling()
As a general point trace through the HTML and think how do I get from A to B? Go forwards? backwards? up to parent, to the next node and down again ...?
Related
below is my html structure, i want output like : content inside post_message div and respective images
something like :
test 123 -> 1.png
test 1232 -> 2.png
test 1232 -> 3.png
Html content
<div class="abc">
<div>
<div class="udata">
<div class="post_message"><p>test 123</p></div>
<div class="">
<img class="scaledImageFitWidth img" src="1.png">
</div>
</div>
</div>
</div>
<div class="abc">
<div>
<div class="udata">
<div class="post_message"><p>test 1232</p></div>
<div class="">
<img class="scaledImageFitWidth img" src="2.png">
<img class="scaledImageFitWidth img" src="3.png">
</div>
</div>
</div>
</div>
Below is my php code but it seems not working :
<?php
$dom = new DomDocument();
// $dom->load($filePath);
#$dom->loadHTML($fop);
$finder = new DomXPath($dom);
$classname="udata";
$nodes = $finder->query("//*[contains(#class, '$classname')]");
// print_r($nodes);
foreach ($nodes as $i => $node) {
$entries = $finder->query("//*[contains(#class, 'post_message')]", $node);
print_r($entries);
$isrc = $node->query("//img/#src");
print_r($isrc);
}
When using XPath, you always need to make your XPath relative to the start node, so using the descendant axes to ensure you limit the subsequent search is only in the nodes part of the start point.
So the code would look more like...
foreach ($nodes as $i => $node) {
$entries = $finder->query("descendant::*[contains(#class, 'post_message')]", $node);
echo $entries[0]->textContent .":";
$isrc = $finder->query("descendant::img/#src", $node);
foreach ( $isrc as $src ) {
echo $src->textContent.",";
}
echo PHP_EOL;
}
which would output
test 123:1.png,
test 1232:2.png,3.png,
Driving me up the wall, it's like this, the DOM
<div class="product-intro"><p class="product-desc"><span class="product-model">234</span>Product Description</p></div>
It's not anything like this... invalid argument, no nodes found:
$node3 = $xp->query("//p[#class='product-desc and not(#class='product-model']");
$node3 = $xp->query("//p[#class='product-desc'][not([#class='product-model'])]");
This alone:
$node3 = $xp->query("//p[#class='product-desc']");
Works perfectly well and fine-- as far as getting a result.
The output is
234Product Description
I know I could just do a string replace, but not ideally.. How do I get it to exclude the product-model class in my query?
Entire script:
$x = '<div class="product-item productContainer" data-itemno="234">
<div class="product-and-intro">
<div class="product">
<a href="/en/234.html" title="Product Description">
<img src="/ProductImages/106/234.jpg" alt="Product Description" class="itemImage" />
<div class="product-intro">
<p class="product-desc"><span class="product-model">234</span>Product Description</p>
<p class="price"><span class="us">US$</span>6.50 <span class="oldprice"><s>$ 13.00</s></span></p>
</div>
</a>
</div>
</div>
</div>';
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($x);
$xp = new DOMXPath($dom);
$node1 = $xp->query("//div[#class='product']//img");
$node2 = $xp->query("//span[#class='product-model']");
$node3 = $xp->query("//p[#class='product-desc']");
// $node3 = $xp->query("//p[#class='product-desc']/text()[2]");
// $node3 = $xp->query("//p[#class='product-desc' and not(span/#class='product-model')]");
$node4 = $xp->query("//p[#class='price']");
foreach ($node1 as $n) {
echo $n->getAttribute('src');
echo '<br>';
}
foreach ($node2 as $n2) {
echo $n2->nodeValue;
echo '<br>';
}
foreach ($node3 as $n3) {
echo $n3->nodeValue;
echo '<br>';
}
foreach ($node4 as $n4) {
echo $n4->nodeValue;
echo '<br>';
}
If you want to select all p elements with a #class attribute value of product-desc and then filter out those who have a span sub-element with the #class attribute value product-model you can use this XPath expression:
//p[#class='product-desc' and not(span/#class='product-model')]
Or, in a whole
$node3 = $xp->query("//p[#class='product-desc' and not(span/#class='product-model')]");
<span class="byline">
<ul class="foobar"></ul>
<img alt="" src="resize_image.php?src=images/newsManagement/87600069ef0dffad5fd02f862ea3787b.jpg&w=675&h=675">
<p style="text-align: justify;">
<img alt="" src="resize_image.php?src=images/newsManagement/87600069ef0dffad5fd02f862ea3787b.jpg&w=675&h=675">
<hr>
Hi this is my html. I can fetch all images using DOMDocument but i want to get first images that comes after ul.foobar class. I don't want other images. How can I query for that.
I tried this for all images.
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($url);
//$xpath = new DomXpath($doc);
//$entries = $xpath->query("//div[#id='newsbox']/ul[#class='foobar']");
$elements = $dom->getElementsByTagName('img');
if (!is_null($elements)) {
foreach ($elements as $element) {
echo "<br/>". $element->getAttribute('src'). ": ";
}
}
I think you can use DOMXPath query with this xpath expression:
$image = $xpath->query('//ul[#class="foobar"]/following-sibling::img')->item(0);
This will get the following img siblings for <ul class="foobar"> using following-sibling and then get the first item.
The $image is of type DOMElement.
In this example I've used loadHTML to load the html from a string $source.
If you want to load your html from a file, you could for example use loadHTMLFile.
$source = <<<SOURCE
<span class="byline">
<ul class="foobar"></ul>
<img alt="" src="resize_image.php?src=images/newsManagement/87600069ef0dffad5fd02f862ea3787b.jpg&w=675&h=675">
<p style="text-align: justify;">
<img alt="" src="resize_image.php?src=images/newsManagement/87600069ef0dffad5fd02f862ea3787b.jpg&w=675&h=675">
<hr>
SOURCE;
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($source);
$xpath = new DomXpath($dom);
$image = $xpath->query('//ul[#class="foobar"]/following-sibling::img')->item(0);
We have following rss feed
<title>THIS IS THE TITLE</title>
<link>http://www.website.com/....</link>
<description>
<div class="primary-image">
<img typeof="foaf:Image" src="http://website.com/" alt="Drink driving" title="Drink driving" />
</div>
<div class="field-group-format group_meta field-group-div group-meta speed-fast effect-none">
<span class="field field-name-field-published-date field-type-datetime field-label-hidden">
<span class="field-item even">
<span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="2014-01-29T17:43:00+00:00">29 Jan, 2014 5:43pm</span>
</span>
</span>
<span class="field field-name-field-author field-type-node-reference field-label-hidden">
<span class="field-item even">Joe Finnerty</span>
</span>
</div>
<p class="short-desc">TEXT THAT I WANT TO EXTRACT FROM HERE</p>
</description>
And i am trying to extract the <p class="short-desc">TEXT THAT I WANT TO EXTRACT FROM HERE</p> with the following this script and checked some questions here but did not find a practical response.
I tried adding
$htmlStr = $node->getElementsByTagName('description')->item(0)->nodeValue;
$html = new DOMDocument();
$html->loadHTML($htmlStr);
$xpath = new DOMXPath($html);
$desc = $xpath->query("//*[contains(concat(' ', normalize-space(#class), ' '), ' short-desc')]");
before $item = array ( , within the foreach loop but did not work.
but did not do the job. Also instead of
< is replacing < AND
" is replacing " AND
> is replacing >
Please help i am trying to find an answer for some days now and did not find it.
Assuming that you are passing the above HTML content to the $html variable ..
$dom = new DOMDocument;
#$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('p') as $tag) {
if ($tag->getAttribute('class') === 'short-desc') {
echo $tag->nodeValue; //"prints" TEXT THAT I WANT TO EXTRACT FROM HERE
}
}
If i understand correctly, you want to remove tags from feeds so you can try like this:
<?php
$text = '<p>Test paragraph.</p><!-- Comment --> Other text';
echo strip_tags($text);
?>
output will be:
Test paragraph. Other text
For more info:http://in3.php.net/strip_tags
why not use regex?
$strRegex = '%<p class="short-desc">(.+?)</p>%s';
if (preg_match_all($strRegex, $strContent, $arrMatches))
{
var_dump($arrMatches[1][0]);
}
and to get the content use
$path = 'path/to/file';
$strContent = file_get_contents($path);
I want to remove all image-tags before the headline starts, but they are not nested the same way. And then remove the empty tags.
<div class="c2">
<img src="image/file" width="480" height="360" alt="Image" />
</div>
<div class="c2">
<div class="headline">
headline
</div>
<div class="headline">
headline2
</div>
</div>
and different nested tags like
<div class="c2">
<p>
<img src="image/A.JPG" width="480" height="319" alt="Image" />
</p>
<div class="headline">
A headline
</div>
</div>
i think that could be solved recursively, but i dont know how.
Thanks for your help!
EDIT: if you want to remove only <img> followed by <div><div class="headline>" or <div class="headline">, use this xpath:
$imgs = $xpath->query("//img[../following-sibling::div[1]/div/#class='headline' or ../following-sibling::div[1]/#class='headline']");
see it working: http://codepad.viper-7.com/QhprLP
Do it like this:
$doc = new DOMDocument();
$doc->loadHTML($x); // assuming HTML in $x
$xpath = new DOMXpath($doc);
$imgs = $xpath->query("//img"); // select all <img> nodes
foreach ($imgs as $img) { // loop through list of all <img> nodes
$parent = $img->parentNode;
$parent->removeChild($img); // delete <img> node
if ($parent->childNodes->length >= 1) // if parent node of <img> is empty delete it
$parent->parentNode->removeChild($parent);
}
echo htmlentities($doc->saveHTML()); // display the new HTML
see it working: http://codepad.viper-7.com/350Hw6