PHP DOMDocument parse HTML - php

I have the following HTML markup
<div contenteditable="true" class="text"></div>
<div contenteditable="true" class="text"></div>
<div style="display: block;" class="ui-draggable">
<img class='avatar' src=""/>
<p style="">
<img class='pic' src=""/><br>
<span class='fulltext' style="display:none"></span>
</p>-<span class='create'></span>
<a class='permalink' href=""></a>
</div>
<div contenteditable="true" class="text"></div>
<div style="display: block;" class="ui-draggable">
<img class='avatar' src=""/>
<p style="">
<img class='pic' src=""/><br>
<span class='fulltext' style="display:none"></span>
</p><span class='create'></span><a class='permalink' href=""></a>
</div>
The parent div's can be more.In order to parse the information and to insert it in the DB I'm using the following code -
$dom = new DOMDocument();
$dom->loadHTML($xml);
$xpath = new DOMXPath($dom);
$div = $xpath->query('//div');
$i=0;
$q=1;
foreach($div as $book) {
$attr = $book->getAttribute('class');
//if div contenteditable
if($attr == 'text') {
echo '</br>'.$book->nodeValue."</br>";
}
else {
$new = new DOMDocument();
$newxpath = new DOMXPath($new);
$avatar = $xpath->query("(//img[#class='avatar']/#src)[$q]");
$picture = $xpath->query("(//p/img[#class='pic']/#src)[$q]");
$fulltext = $xpath->query("(//p/span[#class='fulltext'])[$q]");
$permalink = $xpath->query("(//a[#class='permalink'])[$q]");
echo $permalink->item(0)->nodeValue; //date
echo $permalink->item(0)->getAttribute('href');
echo $fulltext->item(0)->nodeValue;
echo $avatar->item(0)->value;
echo $picture->item(0)->value;
$q++;
}
$i++;
}
But I think that there's a better way for parsing the HTML. Is there? Thank you in advance

Note that DOMXPath::query supports a second param called contextparam. Also you won't need a second DOMDocument and DOMXPath inside the loop. Use:
$avatar = $xpath->query("img[#class='avatar']/#src", $book);
to get <img src=""> attribute nodes relative to the div nodes. If you follow my advices your example should be fine.
Here comes a version of your code that follows the above said:
$dom = new DOMDocument();
$dom->loadHTML($xml);
$xpath = new DOMXPath($dom);
$divs = $xpath->query('//div');
foreach($divs as $book) {
$attr = $book->getAttribute('class');
if($attr == 'text') {
echo '</br>'.$book->nodeValue."</br>";
} else {
$avatar = $xpath->query("img[#class='avatar']/#src", $book);
$picture = $xpath->query("p/img[#class='pic']/#src", $book);
$fulltext = $xpath->query("p/span[#class='fulltext']", $book);
$permalink = $xpath->query("a[#class='permalink']", $book);
echo $permalink->item(0)->nodeValue; //date
echo $permalink->item(0)->getAttribute('href');
echo $fulltext->item(0)->nodeValue;
echo $avatar->item(0)->value;
echo $picture->item(0)->value;
}
}

As a matter of fact, you do it the right way : html has to be parsed with a DOM object.
Then some optimisation can be brough :
$div = $xpath->query('//div');
is quite greedy, a getElementsByTagName should be more appropriate :
$div = $dom->getElementsByTagName('div');

Related

Php Remove content html from specific class

Hi I would like to remove from a parent id or class all html code
<?php
$html = '<div class="m-interstitial"><div class="m-interstitial">
<div class="m-interstitial__ad" data-readmore-target="">
<div class="m-block-ad" data-tms-ad-type="box" data-tms-ad-status="idle" data-tms-ad-pos="1">
<div class="m-block-ad__label m-block-ad__label--report-enabled"><span class="m-block-ad__label__text">Advertising</span> <button class="m-block-ad__label__report-link" title="Report this ad" data-tms-ad-report=""> </button></div>
<div class="m-block-ad__content"> </div>
</div>
</div>
<button class="m-interstitial__unlock-btn" data-readmore-unlocker=""> <span class="m-interstitial__unlock-btn__text">Read more</span>
</button></div>';
// I tried it with below code but it does not work
//$remove = preg_replace('#<div class="m-interstitial">(.*?)</div>#', '', $html);
$remove = preg_replace('#<div class="m-interstitial">(.*?)</div>#s', '', $html);
var_dump($remove); // result = normally I want the result is empty "" but it seems does not works.
my preg_replace does not works as I wish. Any ideas ?
thank you
Based on your code example, why don't you just set $html = ''; if that is what you want? If you have differing HTML, then use XPath to find matches:
<?php
$html = '<div class="m-interstitial">
<div class="m-interstitial">
<div class="m-interstitial__ad" data-readmore-target="">
<div class="m-block-ad" data-tms-ad-type="box" data-tms-ad-status="idle" data-tms-ad-pos="1">
<div class="m-block-ad__label m-block-ad__label--report-enabled"><span class="m-block-ad__label__text">Advertising</span> <button class="m-block-ad__label__report-link" title="Report this ad" data-tms-ad-report=""> </button></div>
<div class="m-block-ad__content"> </div>
</div>
</div>
<button class="m-interstitial__unlock-btn" data-readmore-unlocker=""> <span class="m-interstitial__unlock-btn__text">Read more</span></button>
</div>';
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->omitXmlDeclaration = true;
$dom->preserveWhiteSpace = false;
$dom->validateOnParse = false;
$dom->strictErrorChecking = false;
$dom->formatOutput = false;
$dom->loadHTML('<?xml encoding="utf-8" ?>'.$html);
libxml_clear_errors();
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);
$child = $xpath->query("(//div[#class='m-interstitial'])[1]");
$parent = $child[0]->parentNode;
$parent->removeChild($child[0]);
echo $dom->saveXML($dom->documentElement);
I am not 100% sure if this is what you want to do, but in theory, using XPath/DOM would be used like this.
Resulting in a empty HTML (since you want to filter out the parent or root element of your html).
<html><body/></html>
I just do almost the same but your seems better
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$styles = $xpath->query('//div[#class="m-interstitial"]');
if ($styles) {
foreach ($styles as $style) {
$style->textContent = "";
}
}
$html = $doc->saveHTML();
var_dump($html );

Get H2 text and href values from inside all H2 tags on the page using xpath?

I know nothing, ZERO, about xpath or DOM.
In the end I need the href value and the content of the span from 12 H2 tags on the page. I have figured out how to get each item individually but getting them all in one shot isn't clicking, no matter how much I read. A little help?
<h2 class="make-it-pretty">
<a class="more-pretty" href="some-file-somewhere">
<span class="another-class">Product Name</span>
</a>
</h2>
Here is what I use to get them individually.
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$htext = $xpath->query('//h2[contains(#class, "make-it-pretty")]')->item(0);
echo $htext->textContent;
I would probably use $doc->loadHTMLFile instead, but:
<?php
$html = '<html lang="en"><head><meta charset="UTF-8" /><title>Title Here</title></head>
<body>
<h2 class="make-it-pretty"><a class="more-pretty" href="some-file-somewhere"><span class="another-class">Product Name</span></a></h2>
</body></html>';
$doc = #new DOMDocument(); $doc->loadHTML($html);
function getElementsByClassName($className, $withinNode = null){
global $doc;
$d = $withinNode ?? $doc;
$r = []; $a = $d->getElementsByTagName('*');
foreach($a as $n){
if($n->getAttribute('class') === $className)$r[] = $n;
}
return $r;
}
$anotherClass = getElementsByClassName('another-class');
// getElementsByClassName('make-it-pretty'); works as well, in this case
echo $anotherClass[0]->textContent;
?>
try this without Xpath
<?
$html ='<h2 class="make-it-pretty"> <a class="more-pretty" href="some-file-somewhere"> <span class="another-class">Product Name</span> </a> </h2><h2 class="make-it-pretty"> <a class="more-pretty" href="some-file-somewhere"> <span class="another-class">Product Name</span> </a> </h2><h2 class="make-it-pretty"> <a class="more-pretty" href="some-file-somewhere"> <span class="another-class">Product Name</span> </a> </h2>';
$dom = new DOMDocument("1.0", "utf-8");
if($dom->loadHTML($html, LIBXML_NOWARNING)){
$h2s = $dom->getElementsByTagName('h2');
foreach ($h2s as $h2) {
$as = $h2->getElementsByTagName('a');
echo '<pre>';
//print_r($as);
foreach($as as $a){
print_r('link :'.$a->getAttribute('href')."\n");
$spans = $a->getElementsByTagName('span');
}
foreach($spans as $span){
print_r('content :'.$span->nodeValue."\n");
}
}
}

How to Exclude Class in Parent Xpath Class? PHP

Driving me up the wall, it's like this, the DOM
<div class="product-intro"><p class="product-desc"><span class="product-model">234</span>Product Description</p></div>
It's not anything like this... invalid argument, no nodes found:
$node3 = $xp->query("//p[#class='product-desc and not(#class='product-model']");
$node3 = $xp->query("//p[#class='product-desc'][not([#class='product-model'])]");
This alone:
$node3 = $xp->query("//p[#class='product-desc']");
Works perfectly well and fine-- as far as getting a result.
The output is
234Product Description
I know I could just do a string replace, but not ideally.. How do I get it to exclude the product-model class in my query?
Entire script:
$x = '<div class="product-item productContainer" data-itemno="234">
<div class="product-and-intro">
<div class="product">
<a href="/en/234.html" title="Product Description">
<img src="/ProductImages/106/234.jpg" alt="Product Description" class="itemImage" />
<div class="product-intro">
<p class="product-desc"><span class="product-model">234</span>Product Description</p>
<p class="price"><span class="us">US$</span>6.50 <span class="oldprice"><s>$ 13.00</s></span></p>
</div>
</a>
</div>
</div>
</div>';
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($x);
$xp = new DOMXPath($dom);
$node1 = $xp->query("//div[#class='product']//img");
$node2 = $xp->query("//span[#class='product-model']");
$node3 = $xp->query("//p[#class='product-desc']");
// $node3 = $xp->query("//p[#class='product-desc']/text()[2]");
// $node3 = $xp->query("//p[#class='product-desc' and not(span/#class='product-model')]");
$node4 = $xp->query("//p[#class='price']");
foreach ($node1 as $n) {
echo $n->getAttribute('src');
echo '<br>';
}
foreach ($node2 as $n2) {
echo $n2->nodeValue;
echo '<br>';
}
foreach ($node3 as $n3) {
echo $n3->nodeValue;
echo '<br>';
}
foreach ($node4 as $n4) {
echo $n4->nodeValue;
echo '<br>';
}
If you want to select all p elements with a #class attribute value of product-desc and then filter out those who have a span sub-element with the #class attribute value product-model you can use this XPath expression:
//p[#class='product-desc' and not(span/#class='product-model')]
Or, in a whole
$node3 = $xp->query("//p[#class='product-desc' and not(span/#class='product-model')]");

getting first images next to id with DOMXpath::query

<span class="byline">
<ul class="foobar"></ul>
<img alt="" src="resize_image.php?src=images/newsManagement/87600069ef0dffad5fd02f862ea3787b.jpg&w=675&h=675">
<p style="text-align: justify;">
<img alt="" src="resize_image.php?src=images/newsManagement/87600069ef0dffad5fd02f862ea3787b.jpg&w=675&h=675">
<hr>
Hi this is my html. I can fetch all images using DOMDocument but i want to get first images that comes after ul.foobar class. I don't want other images. How can I query for that.
I tried this for all images.
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($url);
//$xpath = new DomXpath($doc);
//$entries = $xpath->query("//div[#id='newsbox']/ul[#class='foobar']");
$elements = $dom->getElementsByTagName('img');
if (!is_null($elements)) {
foreach ($elements as $element) {
echo "<br/>". $element->getAttribute('src'). ": ";
}
}
I think you can use DOMXPath query with this xpath expression:
$image = $xpath->query('//ul[#class="foobar"]/following-sibling::img')->item(0);
This will get the following img siblings for <ul class="foobar"> using following-sibling and then get the first item.
The $image is of type DOMElement.
In this example I've used loadHTML to load the html from a string $source.
If you want to load your html from a file, you could for example use loadHTMLFile.
$source = <<<SOURCE
<span class="byline">
<ul class="foobar"></ul>
<img alt="" src="resize_image.php?src=images/newsManagement/87600069ef0dffad5fd02f862ea3787b.jpg&w=675&h=675">
<p style="text-align: justify;">
<img alt="" src="resize_image.php?src=images/newsManagement/87600069ef0dffad5fd02f862ea3787b.jpg&w=675&h=675">
<hr>
SOURCE;
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($source);
$xpath = new DomXpath($dom);
$image = $xpath->query('//ul[#class="foobar"]/following-sibling::img')->item(0);

how do I get sets of data with xpath

My below code retrieves a series of images from the search results of a site and also the corresponding age data. It works fine however I get a list of images followed by a list of the information in the age field.
img img img img age age age age and so on.
How do I combine these so I can display them in sets: img age img age img age
<?php
error_reporting(-1);
$html = new DOMDocument();
#$html->loadHtmlFile('http://www.site.com/searchresults.html');
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//div[#class='age']" );
$tags = $html->getElementsByTagName('img');
foreach ($tags as $tag) {
$image = $tag->getAttribute('src');
echo '<img src='. $image .' alt="image" ><br>';
}
foreach ($nodelist as $n)
{
echo $n->nodeValue."<br>";
}
?>
Sample page, I want to extract the img source title data from <div class="age" title="30 usa">:
<div id="sr-15763292" class="search-result">
<div class="thumb-wrapper">
<a class="bioLink" href="http://www.site.com/user/" title="View user"><img src="http://www.site.com/img/15763292.jpg" class="thumb" alt="user" width="140" height="105"></a>
<p class="status"><a href="http://www.site.com/user/" >Online</a></p>
</div>
<div class="rating">
<div class="rating-stars rating4"></div>
</div>
<div class="age" title="30 usa">
<p>30</p>
<p class="gender m">m</p>
<p>USA</p>
</div>
<div>
<p class="headline">Hello there.</p>
</div>
</div>
It's hard to answer if we don't know what the HTML looks like! Assuming it looks something like this
<div class="age"><p>21</p>
<img src="a.jpg" />
</div>
<div class="age"><p>51</p>
<img src="b.jpg" />
</div>
you need to find each div and then find the image inside each div. getElementsByTagName() will give you a list even if there's only one result, so use item() to fetch the first.
error_reporting(-1);
$html = new DOMDocument();
#$html->loadHtmlFile('results.html');
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//div[#class='age']" );
foreach ($nodelist as $node) {
$tags = $node->getElementsByTagName('img');
$image = $tags->item(0)->getAttribute('src');
echo '<img src="'. $image .'" alt="image" ><br>';
echo $node->textContent . '<br>';
}
If the HTML is like this
<div class="age"><p>21</p></div><img src="a.jpg" />
you can try
$node->nextSibling()
As a general point trace through the HTML and think how do I get from A to B? Go forwards? backwards? up to parent, to the next node and down again ...?

Categories