Parse between comments in Simple HTML Dom - php

Can I fetch the data between two html comments using Simple HTML Dom ??
For example, See the below code:
<!-- start of comment -->
link1<br />
link2<br />
link3<br />
link4<br />
<!-- end of comment-->
link5<br />
link6<br />
There are totally six links and only 4 links are enclosed within a "" and "" tags.
I just want to get the links between the comment tags.

You can do this:
//get all comments
$comments = $html->find('comment');
...and use next_sibling() to get next element and check if it's an anchor tag till you get another comment tag, where the script will terminate.

Try this code
$dom = new DOMDocument();
$dom->loadHTML($html);
$elements = $dom->getElementsByTagName('a');
foreach ($elements as $child) {
echo $child->nodeValue;
}

Related

Fetch nested tags in php using simplehtmldom

Lets say I have this code. I want to fetch all p tag data from nested div tag. there can be 15 nested div tag. so want to write a script which can dig all the div and return p tag data from it.
<div>
<div>
<div>
<p>Hi</p>
</div>
<p>Hello</p>
</div>
<p>Hey</p>
</div>
required output(any order):
Hi
Hello
Hey
I have attempted the following:
function divDigger($div)
{
$internalP = $div->getElementsByTagName('p');
echo $internalP->innertext;
$internalDiv = $div->getElementsByTagName('div');
if (count($internalDiv) > 0) {
foreach ($internalDiv as $div) {
divDigger($div);
}
}
}
You may use the XPath API for this:
$doc = new \DOMDocument();
$doc->loadHTML($yourHtml);
$xpath = new \DOMXPath($doc);
foreach ($xpath->query('//div//p') as $pWithinDiv) {
echo $pWithinDiv->textContent, PHP_EOL;
}
This will find any <p> element under a <div> (not necessarily directly under it, otherwise you can change the expression to //div/p), and display its text content.
Demo: https://3v4l.org/43QqX

How can I strip html tags except some of them?

I need to remove all html codes from a php string except:
<p>
<em>
<small>
You know, strip_tags() function is good, but it strips all html tags, how can I tell it remove all html except those tags above?
You should check out the manual: Example #1 strip_tags() example
Syntax: strip_tags ( Your-string, Allowable-Tags )
If you pass the second parameter, these tags will not be stripped.
strip_tags($string, '<p><em><small>');
According to your comment, you want to remove HTML elements only if they have some class or attribute. You'll need to build up a DOM then:
<?php
$data = <<<DATA
<div>
<p>These line shall stay</p>
<p class="myclass">Remove this one</p>
<p>I will be deleted as well</p>
<p>But keep this</p>
</div>
DATA;
$dom = new DOMDOcument();
$dom->loadHTML($data, LIBXML_HTML_NOIMPLIED);
$xpath = new DOMXPath($dom);
$elements_to_be_removed = $xpath->query("//*[count(#*)>0]");
foreach ($elements_to_be_removed as $element) {
$element->parentNode->removeChild($element);
}
// just to check
echo $dom->saveHTML();
?>
To change which elements shall be removed, you'll need to change the query, ie to remove all elements with the class myclass, it must read "//*[class='myclass']".

PHP XPath how to wrap contents of p's in a span

I don't know if you can read JS Jquery but this is what I'd like to do server sided instead of client sided: $('p').wrapInner('<span class="contentsInP" />'); I'd like to take all existing paragraphs from a page and wrap their contents in a new span with a specific class.
Luckily all my documents are HTML5 in its XML flavour and are valid so that in PHP I can do this (simplified):
$xml=new DOMDocument();
$xml->loadXML($html);
$xpath = new DOMXPath($xml);
// How to go on in here to wrap my p's?
$output=$xml->saveXML();
How do I get PHP's DOMXPath to do my wrapping?
EDIT: Fiddled with this based on the comment but couldn't make it work
// based on http://stackoverflow.com/questions/8426391/wrap-all-images-with-a-div-using-domdocument
$xml=new DOMDocument();
$xml->loadXML(utf8_encode($temp));
$xpath = new DOMXPath($xml);
//Create new wrapper div
$new_span = $xml->createElement('span');
$new_span->setAttribute('class','contentsInP');
$ps = $xml->getElementsByTagName('p');
//Find all p
//Iterate though p
foreach ($ps AS $p) {
//Clone our created span
$new_span_clone = $new_span->cloneNode();
//Replace p with this wrapper span
$p->parentNode->replaceChild($new_span_clone,$p);
//Append the p's contents to wrapper span
// THIS IS THE PROBLEM RIGHT NOW:
$new_span_clone->appendChild($p);
}
$temp=$xml->saveXML();
The above wraps the p in a span but I need a span wrapping the p's contents while keeping the p around the span... Furthermore the above fails if the p has a class, then it won't be touched.
In attempting to adapt that other answer, the primary thing that needs to change with it is to get all child nodes of the <p> element, first remove them as children from <p> then append them as children onto the <span>. Then finally, append the <span> as a child node of the <p>.
$html = <<<HTML
<!DOCTYPE html>
<html>
<head><title>xyz</title></head>
<body>
<div>
<p><a>inner 1</a></p>
<p><a>inner 2</a><div>stuff</div><div>more stuff</div></p>
</div>
</body>
</html>
HTML;
$xml=new DOMDocument();
$xml->loadXML(utf8_encode($html));
//Create new wrapper div
$new_span = $xml->createElement('span');
$new_span->setAttribute('class','contentsInP');
$ps = $xml->getElementsByTagName('p');
//Find all p
//Iterate though p
foreach ($ps AS $p) {
//Clone our created span
$new_span_clone = $new_span->cloneNode();
// Get an array of child nodes from the <p>
// (because the foreach won't work properly over a live nodelist)
$children = array();
foreach ($p->childNodes as $child) {
$children[] = $child;
}
// Loop over that list of child nodes..
foreach ($children as $child) {
// Remove the child from the <p>
$p->removeChild($child);
// Append it to the span
$new_span_clone->appendChild($child);
}
// Lastly, append the <span> as a child to the <p>
$p->appendChild($new_span_clone);
}
$temp=$xml->saveXML();
Given the input HTML fragment, this should produce output like: (demonstration...)
<!DOCTYPE html>
<html>
<head><title>xyz</title></head>
<body>
<div>
<p><span class="contentsInP"><a>inner 1</a></span></p>
<p><span class="contentsInP"><a>inner 2</a><div>stuff</div><div>more stuff</div></span></p>
</div>
</body>
</html>

How to select Content of ALL div's with PHP

I want to select contents of every DIV tags in PHP.
Just imagine we have this HTML page :
<html>
<body>
<div class="one">Content1</div>
<span>blah..</span>
<div class="two">Content2</div>
</body>
</html>
Now , i want to have every DIV tag content, For example from that HTML code , I want to have Content1 in One variable and the Content2 in the other Variable and so on ....
Just need to access the parts easily. Just this.
Every page have random number of DIV tags, so i need a flexable Code to detect DIV tags and put the content of every one in array or any type of variable..
How to do it ?
DOMDocument
$divs = array();
$HTML = '<html>
<body>
<div class="one">Content1</div>
<span>blah..</span>
<div class="two">Content2</div>
</body>
</html>';
$doc = new DOMDocument();
$doc->loadHTML($HTML);
foreach($doc->getElementsByTagName('div') as $div) {
array_push($divs, $div->textContent);
}
var_dump($divs);
example
try to use strip_tags() function:
http://php.net/manual/en/function.strip-tags.php
You can download PHP Simple HTML DOM Parser
And access the div tags like this :
$html = file_get_html('urltopage.com');
foreach($html->find('div') as $e)
echo $e->innertext . '<br>';

PHP preg_match_all - group without returning a match

How would I get content from HTML between h3 tags inside an element that has class pricebox? For example, the following string fragment
<!-- snip a lot of other html content -->
<div class="pricebox">
<div class="misc_info">Some misc info</div>
<h3>599.99</h3>
</div>
<!-- snip a lot of other html content -->
The catch is 599.99 has to be the first match returned, that is if the function call is
preg_match_all($regex,$string,$matches)
the 599.99 has to be in $matches[0][1] (because I use the same script to get numbers from dissimilar looking strings with different $regex - the script looks for the first match).
Try using XPath; definitely NOT RegEx.
Code :
$html = new DOMDocument();
#$html->loadHtmlFile('http://www.path.to/your_html_file_html');
$xpath = new DOMXPath( $html );
$nodes = $xpath->query("//div[#class='pricebox']/h3");
foreach ($nodes as $node)
{
echo $node->nodeValue."";
}

Categories