PHP DOMDocument parse HTML

PHP DOMDocument parse HTML - php

I have the following HTML markup
<div contenteditable="true" class="text"></div>
<div contenteditable="true" class="text"></div>
<div style="display: block;" class="ui-draggable">
<img class='avatar' src=""/>
<p style="">
<img class='pic' src=""/><br>
<span class='fulltext' style="display:none"></span>
</p>-<span class='create'></span>
<a class='permalink' href=""></a>
</div>
<div contenteditable="true" class="text"></div>
<div style="display: block;" class="ui-draggable">
<img class='avatar' src=""/>
<p style="">
<img class='pic' src=""/><br>
<span class='fulltext' style="display:none"></span>
</p><span class='create'></span><a class='permalink' href=""></a>
</div>
The parent div's can be more.In order to parse the information and to insert it in the DB I'm using the following code -
$dom = new DOMDocument();
$dom->loadHTML($xml);
$xpath = new DOMXPath($dom);
$div = $xpath->query('//div');
$i=0;
$q=1;
foreach($div as $book) {
$attr = $book->getAttribute('class');
//if div contenteditable
if($attr == 'text') {
echo '</br>'.$book->nodeValue."</br>";
}
else {
$new = new DOMDocument();
$newxpath = new DOMXPath($new);
$avatar = $xpath->query("(//img[#class='avatar']/#src)[$q]");
$picture = $xpath->query("(//p/img[#class='pic']/#src)[$q]");
$fulltext = $xpath->query("(//p/span[#class='fulltext'])[$q]");
$permalink = $xpath->query("(//a[#class='permalink'])[$q]");
echo $permalink->item(0)->nodeValue; //date
echo $permalink->item(0)->getAttribute('href');
echo $fulltext->item(0)->nodeValue;
echo $avatar->item(0)->value;
echo $picture->item(0)->value;
$q++;
}
$i++;
}
But I think that there's a better way for parsing the HTML. Is there? Thank you in advance

Note that DOMXPath::query supports a second param called contextparam. Also you won't need a second DOMDocument and DOMXPath inside the loop. Use:
$avatar = $xpath->query("img[#class='avatar']/#src", $book);
to get <img src=""> attribute nodes relative to the div nodes. If you follow my advices your example should be fine.
Here comes a version of your code that follows the above said:
$dom = new DOMDocument();
$dom->loadHTML($xml);
$xpath = new DOMXPath($dom);
$divs = $xpath->query('//div');
foreach($divs as $book) {
$attr = $book->getAttribute('class');
if($attr == 'text') {
echo '</br>'.$book->nodeValue."</br>";
} else {
$avatar = $xpath->query("img[#class='avatar']/#src", $book);
$picture = $xpath->query("p/img[#class='pic']/#src", $book);
$fulltext = $xpath->query("p/span[#class='fulltext']", $book);
$permalink = $xpath->query("a[#class='permalink']", $book);
echo $permalink->item(0)->nodeValue; //date
echo $permalink->item(0)->getAttribute('href');
echo $fulltext->item(0)->nodeValue;
echo $avatar->item(0)->value;
echo $picture->item(0)->value;
}
}

As a matter of fact, you do it the right way : html has to be parsed with a DOM object.
Then some optimisation can be brough :
$div = $xpath->query('//div');
is quite greedy, a getElementsByTagName should be more appropriate :
$div = $dom->getElementsByTagName('div');

Related

Php Remove content html from specific class

Hi I would like to remove from a parent id or class all html code
<?php
$html = '<div class="m-interstitial"><div class="m-interstitial">
<div class="m-interstitial__ad" data-readmore-target="">
<div class="m-block-ad" data-tms-ad-type="box" data-tms-ad-status="idle" data-tms-ad-pos="1">
<div class="m-block-ad__label m-block-ad__label--report-enabled"><span class="m-block-ad__label__text">Advertising</span> <button class="m-block-ad__label__report-link" title="Report this ad" data-tms-ad-report=""> </button></div>
<div class="m-block-ad__content"> </div>
</div>
</div>
<button class="m-interstitial__unlock-btn" data-readmore-unlocker=""> <span class="m-interstitial__unlock-btn__text">Read more</span>
</button></div>';
// I tried it with below code but it does not work
//$remove = preg_replace('#<div class="m-interstitial">(.*?)</div>#', '', $html);
$remove = preg_replace('#<div class="m-interstitial">(.*?)</div>#s', '', $html);
var_dump($remove); // result = normally I want the result is empty "" but it seems does not works.
my preg_replace does not works as I wish. Any ideas ?
thank you

Based on your code example, why don't you just set $html = ''; if that is what you want? If you have differing HTML, then use XPath to find matches:
<?php
$html = '<div class="m-interstitial">
<div class="m-interstitial">
<div class="m-interstitial__ad" data-readmore-target="">
<div class="m-block-ad" data-tms-ad-type="box" data-tms-ad-status="idle" data-tms-ad-pos="1">
<div class="m-block-ad__label m-block-ad__label--report-enabled"><span class="m-block-ad__label__text">Advertising</span> <button class="m-block-ad__label__report-link" title="Report this ad" data-tms-ad-report=""> </button></div>
<div class="m-block-ad__content"> </div>
</div>
</div>
<button class="m-interstitial__unlock-btn" data-readmore-unlocker=""> <span class="m-interstitial__unlock-btn__text">Read more</span></button>
</div>';
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->omitXmlDeclaration = true;
$dom->preserveWhiteSpace = false;
$dom->validateOnParse = false;
$dom->strictErrorChecking = false;
$dom->formatOutput = false;
$dom->loadHTML('<?xml encoding="utf-8" ?>'.$html);
libxml_clear_errors();
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);
$child = $xpath->query("(//div[#class='m-interstitial'])[1]");
$parent = $child[0]->parentNode;
$parent->removeChild($child[0]);
echo $dom->saveXML($dom->documentElement);
I am not 100% sure if this is what you want to do, but in theory, using XPath/DOM would be used like this.
Resulting in a empty HTML (since you want to filter out the parent or root element of your html).
<html><body/></html>

I just do almost the same but your seems better
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$styles = $xpath->query('//div[#class="m-interstitial"]');
if ($styles) {
foreach ($styles as $style) {
$style->textContent = "";
}
}
$html = $doc->saveHTML();
var_dump($html );

Get H2 text and href values from inside all H2 tags on the page using xpath?

I know nothing, ZERO, about xpath or DOM.
In the end I need the href value and the content of the span from 12 H2 tags on the page. I have figured out how to get each item individually but getting them all in one shot isn't clicking, no matter how much I read. A little help?
<h2 class="make-it-pretty">
<a class="more-pretty" href="some-file-somewhere">
<span class="another-class">Product Name</span>
</a>
</h2>
Here is what I use to get them individually.
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$htext = $xpath->query('//h2[contains(#class, "make-it-pretty")]')->item(0);
echo $htext->textContent;

I would probably use $doc->loadHTMLFile instead, but:
<?php
$html = '<html lang="en"><head><meta charset="UTF-8" /><title>Title Here</title></head>
<body>
<h2 class="make-it-pretty"><a class="more-pretty" href="some-file-somewhere"><span class="another-class">Product Name</span></a></h2>
</body></html>';
$doc = #new DOMDocument(); $doc->loadHTML($html);
function getElementsByClassName($className, $withinNode = null){
global $doc;
$d = $withinNode ?? $doc;
$r = []; $a = $d->getElementsByTagName('*');
foreach($a as $n){
if($n->getAttribute('class') === $className)$r[] = $n;
}
return $r;
}
$anotherClass = getElementsByClassName('another-class');
// getElementsByClassName('make-it-pretty'); works as well, in this case
echo $anotherClass[0]->textContent;
?>

try this without Xpath
<?
$html ='<h2 class="make-it-pretty"> <a class="more-pretty" href="some-file-somewhere"> <span class="another-class">Product Name</span> </a> </h2><h2 class="make-it-pretty"> <a class="more-pretty" href="some-file-somewhere"> <span class="another-class">Product Name</span> </a> </h2><h2 class="make-it-pretty"> <a class="more-pretty" href="some-file-somewhere"> <span class="another-class">Product Name</span> </a> </h2>';
$dom = new DOMDocument("1.0", "utf-8");
if($dom->loadHTML($html, LIBXML_NOWARNING)){
$h2s = $dom->getElementsByTagName('h2');
foreach ($h2s as $h2) {
$as = $h2->getElementsByTagName('a');
echo '<pre>';
//print_r($as);
foreach($as as $a){
print_r('link :'.$a->getAttribute('href')."\n");
$spans = $a->getElementsByTagName('span');
}
foreach($spans as $span){
print_r('content :'.$span->nodeValue."\n");
}
}
}

How to Exclude Class in Parent Xpath Class? PHP

Driving me up the wall, it's like this, the DOM
<div class="product-intro"><p class="product-desc"><span class="product-model">234</span>Product Description</p></div>
It's not anything like this... invalid argument, no nodes found:
$node3 = $xp->query("//p[#class='product-desc and not(#class='product-model']");
$node3 = $xp->query("//p[#class='product-desc'][not([#class='product-model'])]");
This alone:
$node3 = $xp->query("//p[#class='product-desc']");
Works perfectly well and fine-- as far as getting a result.
The output is
234Product Description
I know I could just do a string replace, but not ideally.. How do I get it to exclude the product-model class in my query?
Entire script:
$x = '<div class="product-item productContainer" data-itemno="234">
<div class="product-and-intro">
<div class="product">
<a href="/en/234.html" title="Product Description">
<img src="/ProductImages/106/234.jpg" alt="Product Description" class="itemImage" />
<div class="product-intro">
<p class="product-desc"><span class="product-model">234</span>Product Description</p>
<p class="price"><span class="us">US$</span>6.50 <span class="oldprice"><s>$ 13.00</s></span></p>
</div>
</a>
</div>
</div>
</div>';
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($x);
$xp = new DOMXPath($dom);
$node1 = $xp->query("//div[#class='product']//img");
$node2 = $xp->query("//span[#class='product-model']");
$node3 = $xp->query("//p[#class='product-desc']");
// $node3 = $xp->query("//p[#class='product-desc']/text()[2]");
// $node3 = $xp->query("//p[#class='product-desc' and not(span/#class='product-model')]");
$node4 = $xp->query("//p[#class='price']");
foreach ($node1 as $n) {
echo $n->getAttribute('src');
echo '<br>';
}
foreach ($node2 as $n2) {
echo $n2->nodeValue;
echo '<br>';
}
foreach ($node3 as $n3) {
echo $n3->nodeValue;
echo '<br>';
}
foreach ($node4 as $n4) {
echo $n4->nodeValue;
echo '<br>';
}

If you want to select all p elements with a #class attribute value of product-desc and then filter out those who have a span sub-element with the #class attribute value product-model you can use this XPath expression:
//p[#class='product-desc' and not(span/#class='product-model')]
Or, in a whole
$node3 = $xp->query("//p[#class='product-desc' and not(span/#class='product-model')]");

getting first images next to id with DOMXpath::query

<span class="byline">
<ul class="foobar"></ul>
<img alt="" src="resize_image.php?src=images/newsManagement/87600069ef0dffad5fd02f862ea3787b.jpg&w=675&h=675">
<p style="text-align: justify;">
<img alt="" src="resize_image.php?src=images/newsManagement/87600069ef0dffad5fd02f862ea3787b.jpg&w=675&h=675">
<hr>
Hi this is my html. I can fetch all images using DOMDocument but i want to get first images that comes after ul.foobar class. I don't want other images. How can I query for that.
I tried this for all images.
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($url);
//$xpath = new DomXpath($doc);
//$entries = $xpath->query("//div[#id='newsbox']/ul[#class='foobar']");
$elements = $dom->getElementsByTagName('img');
if (!is_null($elements)) {
foreach ($elements as $element) {
echo "<br/>". $element->getAttribute('src'). ": ";
}
}

I think you can use DOMXPath query with this xpath expression:
$image = $xpath->query('//ul[#class="foobar"]/following-sibling::img')->item(0);
This will get the following img siblings for <ul class="foobar"> using following-sibling and then get the first item.
The $image is of type DOMElement.
In this example I've used loadHTML to load the html from a string $source.
If you want to load your html from a file, you could for example use loadHTMLFile.
$source = <<<SOURCE
<span class="byline">
<ul class="foobar"></ul>
<img alt="" src="resize_image.php?src=images/newsManagement/87600069ef0dffad5fd02f862ea3787b.jpg&w=675&h=675">
<p style="text-align: justify;">
<img alt="" src="resize_image.php?src=images/newsManagement/87600069ef0dffad5fd02f862ea3787b.jpg&w=675&h=675">
<hr>
SOURCE;
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($source);
$xpath = new DomXpath($dom);
$image = $xpath->query('//ul[#class="foobar"]/following-sibling::img')->item(0);

how do I get sets of data with xpath

My below code retrieves a series of images from the search results of a site and also the corresponding age data. It works fine however I get a list of images followed by a list of the information in the age field.
img img img img age age age age and so on.
How do I combine these so I can display them in sets: img age img age img age
<?php
error_reporting(-1);
$html = new DOMDocument();
#$html->loadHtmlFile('http://www.site.com/searchresults.html');
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//div[#class='age']" );
$tags = $html->getElementsByTagName('img');
foreach ($tags as $tag) {
$image = $tag->getAttribute('src');
echo '<img src='. $image .' alt="image" ><br>';
}
foreach ($nodelist as $n)
{
echo $n->nodeValue."<br>";
}
?>
Sample page, I want to extract the img source title data from <div class="age" title="30 usa">:
<div id="sr-15763292" class="search-result">
<div class="thumb-wrapper">
<a class="bioLink" href="http://www.site.com/user/" title="View user"><img src="http://www.site.com/img/15763292.jpg" class="thumb" alt="user" width="140" height="105"></a>
<p class="status"><a href="http://www.site.com/user/" >Online</a></p>
</div>
<div class="rating">
<div class="rating-stars rating4"></div>
</div>
<div class="age" title="30 usa">
<p>30</p>
<p class="gender m">m</p>
<p>USA</p>
</div>
<div>
<p class="headline">Hello there.</p>
</div>
</div>

It's hard to answer if we don't know what the HTML looks like! Assuming it looks something like this
<div class="age"><p>21</p>
<img src="a.jpg" />
</div>
<div class="age"><p>51</p>
<img src="b.jpg" />
</div>
you need to find each div and then find the image inside each div. getElementsByTagName() will give you a list even if there's only one result, so use item() to fetch the first.
error_reporting(-1);
$html = new DOMDocument();
#$html->loadHtmlFile('results.html');
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//div[#class='age']" );
foreach ($nodelist as $node) {
$tags = $node->getElementsByTagName('img');
$image = $tags->item(0)->getAttribute('src');
echo '<img src="'. $image .'" alt="image" ><br>';
echo $node->textContent . '<br>';
}
If the HTML is like this
<div class="age"><p>21</p></div><img src="a.jpg" />
you can try
$node->nextSibling()
As a general point trace through the HTML and think how do I get from A to B? Go forwards? backwards? up to parent, to the next node and down again ...?

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP DOMDocument parse HTML - php

As a matter of fact, you do it the right way : html has to be parsed with a DOM object. Then some optimisation can be brough : $div = $xpath->query('//div'); is quite greedy, a getElementsByTagName should be more appropriate : $div = $dom->getElementsByTagName('div');

Related

Php Remove content html from specific class

Get H2 text and href values from inside all H2 tags on the page using xpath?

How to Exclude Class in Parent Xpath Class? PHP

getting first images next to id with DOMXpath::query

how do I get sets of data with xpath

Categories

Resources