getting first images next to id with DOMXpath::query - php

<span class="byline">
<ul class="foobar"></ul>
<img alt="" src="resize_image.php?src=images/newsManagement/87600069ef0dffad5fd02f862ea3787b.jpg&w=675&h=675">
<p style="text-align: justify;">
<img alt="" src="resize_image.php?src=images/newsManagement/87600069ef0dffad5fd02f862ea3787b.jpg&w=675&h=675">
<hr>
Hi this is my html. I can fetch all images using DOMDocument but i want to get first images that comes after ul.foobar class. I don't want other images. How can I query for that.
I tried this for all images.
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($url);
//$xpath = new DomXpath($doc);
//$entries = $xpath->query("//div[#id='newsbox']/ul[#class='foobar']");
$elements = $dom->getElementsByTagName('img');
if (!is_null($elements)) {
foreach ($elements as $element) {
echo "<br/>". $element->getAttribute('src'). ": ";
}
}

I think you can use DOMXPath query with this xpath expression:
$image = $xpath->query('//ul[#class="foobar"]/following-sibling::img')->item(0);
This will get the following img siblings for <ul class="foobar"> using following-sibling and then get the first item.
The $image is of type DOMElement.
In this example I've used loadHTML to load the html from a string $source.
If you want to load your html from a file, you could for example use loadHTMLFile.
$source = <<<SOURCE
<span class="byline">
<ul class="foobar"></ul>
<img alt="" src="resize_image.php?src=images/newsManagement/87600069ef0dffad5fd02f862ea3787b.jpg&w=675&h=675">
<p style="text-align: justify;">
<img alt="" src="resize_image.php?src=images/newsManagement/87600069ef0dffad5fd02f862ea3787b.jpg&w=675&h=675">
<hr>
SOURCE;
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($source);
$xpath = new DomXpath($dom);
$image = $xpath->query('//ul[#class="foobar"]/following-sibling::img')->item(0);

Related

How to get html tags starting with 'abcd' using PHP?

I have the following html code:
<div class="pictures">
<figure>
<img src="img/foo.jpg" height="400" width="400" id="abcd001"/>
<figcaption>foo</figcaption>
</figure>
<figure>
<img src="img/bar.jpg" height="400" width="400" id="abcd002"/>
<figcaption>bar</figcaption>
</figure>
<figure>
<img src="img/Joe.jpg" height="400" width="400" id="abcd003"/>
<figcaption>Joe</figcaption>
</figure>
</div>
<div id="abcd004">
Lorem Ipsum.
<div>
And am trying to obtain all html tags and their children whose ids start with 'abcd'. The query function of the XDOMDocument class does not seem to work with regex.
$dom = new DomDocument();
$dom->load("/var/www/html/myWebsite.html")
$xpath = new DOMXPath($dom);
$xdom = $xpath->query("//img[#id='abcd*']");
foreach($xdom as $entry)
{
echo $entry;
}
Any suggestions?
EDIT: start-with does not work because that function does not seem to work on the IDs of html tags
I think you are not using the correct syntax for start-with, try the following:
$xpath = new DOMXPath($dom);
$xdom = $xpath->query("//img[starts-with(#id, 'abcd')]");
foreach($xdom as $entry) {
echo $entry->getAttribute('src') . "\n";
}
See it working here: https://3v4l.org/04i5a

Add data-mfp-src attribute to image tags PHP

his is the content:
<div class="image">
<img src="https://www.gravatar.com/avatar/" alt="test" width="50" height="50">
</div>
I want to use preg_replace to add data-mfp-src attribute (getting the value from the src attribute) to be the final code like this:
<div class="image">
<img src="https://www.gravatar.com/avatar/" data-mfp-src="https://www.gravatar.com/avatar/" alt="test" width="50" height="50">
</div>
This is my code and it's working without any issues but i want to use preg_replcae for some specific reasons:
function lazyload_images( $content ){
$content = mb_convert_encoding($content, 'HTML-ENTITIES', "UTF-8");
$dom = new DOMDocument;
libxml_use_internal_errors(true);
#$dom->loadHTML($content);
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);
foreach ($xpath->evaluate('//div[img]') as $paragraphWithImage) {
//$paragraphWithImage->setAttribute('class', 'test');
foreach ($paragraphWithImage->getElementsByTagName('img') as $image) {
$image->setAttribute('data-mfp-src', $image->getAttribute('src'));
$image->removeAttribute('src');
}
};
return preg_replace('~<(?:!DOCTYPE|/?(?:html|head|body))[^>]*>\s*~i', '', $dom->saveHTML($dom->documentElement));
}
As a robust means of isolating the src value and setting the new attribute to this value, I'll urge you to avoid regex. Not that it can't be done, but that my snippet to follow won't break if more classes are added to the <div> nor if the <img> attributes are shifted around.
Code: (Demo)
$html = <<<HTML
<div class="image">
<img src="https://www.gravatar.com/avatar/" alt="test" width="50" height="50">
</div>
HTML;
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
// using a loop in case there are multiple occurrences
foreach ($xpath->query("//div[contains(#class, 'image')]/img") as $node) {
$node->setAttribute('data-mfp-src', $node->getAttribute('src'));
}
echo $dom->saveHTML();
Output:
<div class="image">
<img src="https://www.gravatar.com/avatar/" alt="test" width="50" height="50" data-mfp-src="https://www.gravatar.com/avatar/">
</div>
Resources:
http://php.net/manual/en/domelement.setattribute.php
http://php.net/manual/en/domelement.getattribute.php
Just to show you what the regex might look like...
Find: ~<img src="([^"]*)"~
Replace: <img src="$1" data-mfp-src="$1"
Demo: https://regex101.com/r/lXIoFw/1 but again, I don't recommend it because it could silently let you down in the future.

Regular Expression to ignore a link text

I have the following code:
<p> <img src="spas01.jpg" alt="" width="630" height="480"></p>
<p style="text-align: right;">Spas</p>
<p>My Site Content [...]</p>
I need a regular expression to get only the "My Site Content [...]".
So, i need to ignore first image (and maybe other) and links.
Try This:
Use (?<=<p>)([^><]+)(?=</p>) or <p>\K([^><]+)(?=</p>)
Update
$re = "#<p>\\K([^><]+)(?=</p>)#m";
$str = "<p> <img src=\"spas01.jpg\" alt=\"\" width=\"630\" height=\"480\"></p>\n<p style=\"text-align: right;\">Spas</p>\n<p>My Site Content [...]</p>";
preg_match_all($re, $str, $matches);
Demo
With DOMDocument and DOMXPath:
$html = <<<'EOD'
<p> <img src="spas01.jpg" alt="" width="630" height="480"></p>
<p style="text-align: right;">Spas</p>
<p>My Site Content [...]</p>
EOD;
$dom = new DOMDocument;
$dom->loadHTML($html);
$xp = new DOMXPath($dom);
$query = '//p//text()[not(ancestor::a)]';
$textNodes = $xp->query($query);
foreach ($textNodes as $textNode) {
echo $textNode->nodeValue . PHP_EOL;
}

how do I get sets of data with xpath

My below code retrieves a series of images from the search results of a site and also the corresponding age data. It works fine however I get a list of images followed by a list of the information in the age field.
img img img img age age age age and so on.
How do I combine these so I can display them in sets: img age img age img age
<?php
error_reporting(-1);
$html = new DOMDocument();
#$html->loadHtmlFile('http://www.site.com/searchresults.html');
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//div[#class='age']" );
$tags = $html->getElementsByTagName('img');
foreach ($tags as $tag) {
$image = $tag->getAttribute('src');
echo '<img src='. $image .' alt="image" ><br>';
}
foreach ($nodelist as $n)
{
echo $n->nodeValue."<br>";
}
?>
Sample page, I want to extract the img source title data from <div class="age" title="30 usa">:
<div id="sr-15763292" class="search-result">
<div class="thumb-wrapper">
<a class="bioLink" href="http://www.site.com/user/" title="View user"><img src="http://www.site.com/img/15763292.jpg" class="thumb" alt="user" width="140" height="105"></a>
<p class="status"><a href="http://www.site.com/user/" >Online</a></p>
</div>
<div class="rating">
<div class="rating-stars rating4"></div>
</div>
<div class="age" title="30 usa">
<p>30</p>
<p class="gender m">m</p>
<p>USA</p>
</div>
<div>
<p class="headline">Hello there.</p>
</div>
</div>
It's hard to answer if we don't know what the HTML looks like! Assuming it looks something like this
<div class="age"><p>21</p>
<img src="a.jpg" />
</div>
<div class="age"><p>51</p>
<img src="b.jpg" />
</div>
you need to find each div and then find the image inside each div. getElementsByTagName() will give you a list even if there's only one result, so use item() to fetch the first.
error_reporting(-1);
$html = new DOMDocument();
#$html->loadHtmlFile('results.html');
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//div[#class='age']" );
foreach ($nodelist as $node) {
$tags = $node->getElementsByTagName('img');
$image = $tags->item(0)->getAttribute('src');
echo '<img src="'. $image .'" alt="image" ><br>';
echo $node->textContent . '<br>';
}
If the HTML is like this
<div class="age"><p>21</p></div><img src="a.jpg" />
you can try
$node->nextSibling()
As a general point trace through the HTML and think how do I get from A to B? Go forwards? backwards? up to parent, to the next node and down again ...?

PHP DOMDocument parse HTML

I have the following HTML markup
<div contenteditable="true" class="text"></div>
<div contenteditable="true" class="text"></div>
<div style="display: block;" class="ui-draggable">
<img class='avatar' src=""/>
<p style="">
<img class='pic' src=""/><br>
<span class='fulltext' style="display:none"></span>
</p>-<span class='create'></span>
<a class='permalink' href=""></a>
</div>
<div contenteditable="true" class="text"></div>
<div style="display: block;" class="ui-draggable">
<img class='avatar' src=""/>
<p style="">
<img class='pic' src=""/><br>
<span class='fulltext' style="display:none"></span>
</p><span class='create'></span><a class='permalink' href=""></a>
</div>
The parent div's can be more.In order to parse the information and to insert it in the DB I'm using the following code -
$dom = new DOMDocument();
$dom->loadHTML($xml);
$xpath = new DOMXPath($dom);
$div = $xpath->query('//div');
$i=0;
$q=1;
foreach($div as $book) {
$attr = $book->getAttribute('class');
//if div contenteditable
if($attr == 'text') {
echo '</br>'.$book->nodeValue."</br>";
}
else {
$new = new DOMDocument();
$newxpath = new DOMXPath($new);
$avatar = $xpath->query("(//img[#class='avatar']/#src)[$q]");
$picture = $xpath->query("(//p/img[#class='pic']/#src)[$q]");
$fulltext = $xpath->query("(//p/span[#class='fulltext'])[$q]");
$permalink = $xpath->query("(//a[#class='permalink'])[$q]");
echo $permalink->item(0)->nodeValue; //date
echo $permalink->item(0)->getAttribute('href');
echo $fulltext->item(0)->nodeValue;
echo $avatar->item(0)->value;
echo $picture->item(0)->value;
$q++;
}
$i++;
}
But I think that there's a better way for parsing the HTML. Is there? Thank you in advance
Note that DOMXPath::query supports a second param called contextparam. Also you won't need a second DOMDocument and DOMXPath inside the loop. Use:
$avatar = $xpath->query("img[#class='avatar']/#src", $book);
to get <img src=""> attribute nodes relative to the div nodes. If you follow my advices your example should be fine.
Here comes a version of your code that follows the above said:
$dom = new DOMDocument();
$dom->loadHTML($xml);
$xpath = new DOMXPath($dom);
$divs = $xpath->query('//div');
foreach($divs as $book) {
$attr = $book->getAttribute('class');
if($attr == 'text') {
echo '</br>'.$book->nodeValue."</br>";
} else {
$avatar = $xpath->query("img[#class='avatar']/#src", $book);
$picture = $xpath->query("p/img[#class='pic']/#src", $book);
$fulltext = $xpath->query("p/span[#class='fulltext']", $book);
$permalink = $xpath->query("a[#class='permalink']", $book);
echo $permalink->item(0)->nodeValue; //date
echo $permalink->item(0)->getAttribute('href');
echo $fulltext->item(0)->nodeValue;
echo $avatar->item(0)->value;
echo $picture->item(0)->value;
}
}
As a matter of fact, you do it the right way : html has to be parsed with a DOM object.
Then some optimisation can be brough :
$div = $xpath->query('//div');
is quite greedy, a getElementsByTagName should be more appropriate :
$div = $dom->getElementsByTagName('div');

Categories