Retrieve a text node with Simple HTML DOM Parser - php

I'm quite new to Simple HTML DOM Parser. I want to get a child element from the following HTML:
<div class="article">
<div style="text-align:justify">
<img src="image.jpg" title="image">
<br>
<br>
"Text to grab"
<div>......</div>
<br></br>
................
................
</div>
</div>
I'm trying to get the text "Text to grab"
So far I've tried the following query:
$html->find('div[class=article] div')->children(3);
But it's not working. Any idea how to solve this ?

You don't need simple_html_dom here. It can be done with DOMDocument and DOMXPath. Both are part of the PHP core.
Example:
// your sample data
$html = <<<EOF
<div class="article">
<div style="text-align:justify">
<img src="image.jpg" title="image">
<br>
<br>
"Text to grab"
<div>......</div>
<br></br>
................
................
</div>
</div>
EOF;
// create a document from the above snippet
// if you are loading from a remote url use:
// $doc->load($url);
$doc = new DOMDocument();
$doc->loadHTML($html);
// initialize a XPath selector
$selector = new DOMXPath($doc);
// get the text node (also text elements in xml/html are nodes
$query = '//div[#class="article"]/div/br[2]/following-sibling::text()[1]';
$textToGrab = $selector->query($query)->item(0);
// remove newlines on start and end using trim() and output the text
echo trim($textToGrab->nodeValue);
Output:
"Text to grab"

If it's always in the same place you can do:
$html->find('.article text', 4);

Related

How exclude html comments from text node xpath?

I have the follow html structure:
<a>
<div>
<div>
<span>
text node 1<br>
text node 2 <!--//comments-->
</span>
</div>
</div>
</a>
With the follow query, i get second node, but how get that node excluding comments?
$spanx = $xpath->query('//a/div/div/span/text()[2]');
$span = $spanx->item($l)->nodeValue;
echo "<td>".$span."</td></tr>";
I have that result:
text node 2 //comments
I search for:
text node 2
I've tested the following on my localhost. I've created the file named DOM_with_comment.html containing:
<a>
<div>
<div>
<span>
text node 1<br>
text node 2 <!--//comments-->
</span>
</div>
</div>
</a>
When I run:
<?php
$doc = new DOMDocument;
libxml_use_internal_errors(true);
$doc->preserveWhiteSpace = false;
$doc->loadHTMLFile('DOM_with_comment.html');
$xpath = new DOMXPath($doc);
echo "<pre>";
foreach ($xpath->query('//a/div/div/span/text()') as $item) {
var_dump($item->nodeValue);
}
The output is:
string(29) "
text node 1"
string(31) "
text node 2 "
string(14) "
"
So, by accessing the first qualifying result [0] from your xpath query then displaying the trim()ed ->nodeValue() with var_export() it is revealed that there are no comments or whitespaces on either side of the targeted substring.
var_export(trim($xpath->query('//a/div/div/span/text()[2]')[0]->nodeValue));
// outputs: 'text node 2'
p.s. If your input is not coming from a file, but a variable, this works the same way:
$html = <<<HTML
<a>
<div>
<div>
<span>
text node 1<br>
text node 2 <!--//comments-->
</span>
</div>
</div>
</a>
HTML;
$doc->loadHTML($html);

DOMXPath to find specific image tag

My question is very direct. Here is my html dom,
<html>
...
<div class="A B">
<div class="C">
<img src="..." >
</div>
<div>
...
<div class="A">
</div>
...
</html>
Now I want to get the image's src in div[class="A B"]-><div class="C">-><img> using DOMXPath in php code.
The main puzzle is that I do not know how to write it's path correctly.
Update
I have tried How to get data from HTML using regex, but it doesn't work still.
The actual html structure is :
My php code:
$doc = new DOMDocument();
$doc->loadHTML($html);
$title = $doc->getElementsByTagName('title')->item(0)->nodeValue;
$XPath = new DOMXPath($doc);
$vipImg = $XPath->query('//div[#class="show-midpic active-pannel"]/a/div[#class="zoomPad"]/img');
var_dump($vipImg);
foreach($vipImg as $vip)
{
var_dump($vip);
}
And the output is :
object(DOMNodeList)#2 (1) { ["length"]=> int(0) }

PHP Regex - Remove text from HTML Tags

How to remove all text between tags.
Input
<div>
<p>testing</p>
<div>my world</div>
</div>
Output
<div>
<p></p>
<div></div>
</div>
You can use either DOMDocument or PHP Simple HTML DOM Parser.
The following example uses the latter, although you may want to use what suits you best.
include("simple_html_dom.php");
$str = '
<div>
<p>testing</p>
<div>my world</div>
</div>
';
$html = str_get_html($str);
foreach($html->find("text") as $ht) {
$ht->innertext = "";
}
$html->save();
echo $html;
You could use two capture groups which would eliminate characters between them while replacing:
(\<.+\>).*(\<\/.+\>)
working example: http://ideone.com/Oq14El

Extracting parts of an html code

Let's say I had the below HTML code:
<p>Test text</p>
<p><img src="test.jpg" /></p>
<div id="test"><p>test</p></div>
<div class="block">
<img src="test2.jpg">
</div>
<p>test</p>
Parameters:
There will exist a div block with class "block"
There can be any amount of HTML code above or below the div block with class "block"
There could even be two div blocks with class "block"
I was using PHP's XPath to look at this HTML code using DOM. I want to be able to return two things:
The div block with class "block"
All the rest of the code without the div element with class "block" in it
Something like:
Block Code:
<div class="block">
<img src="test2.jpg">
</div>
Original without block code:
<p>Test text</p>
<p><img src="test.jpg" /></p>
<div id="test"><p>test</p></div>
<p>test</p>
By using DOMDocument you can do it like this :
$content = '<p>Test text</p>'.
'<p><img src="test.jpg" /></p>'.
'<div id="test"><p>test</p></div>'.
'<div class="block">'.
'<img src="test2.jpg">'.
'</div>'.
'<p>test</p>';
$blocks = array();
$doc = new DOMDocument();
$doc->loadHTML($content);
$elements = $doc->getElementsByTagName("*");
foreach ($elements as $element) {
if($element->hasAttributes()) {
if ($element->getAttribute('class') == 'block') {
//add block HTML to block array
$blocks[]=$doc->saveHTML($element);
//remove blocck element
$element->parentNode->removeChild($element);
}
}
}
echo '<pre>';
echo $blocks[0]; //iterate or print_r if multiple blocks
echo $doc->saveHTML();
echo '</pre>';
outputs the "block code" :
<div class="block"><img src="test2.jpg"></div>
and the "original without block code" :
<p>Test text</p><p><img src="test.jpg"></p><div id="test"><p>test</p></div><p>test</p>
If you simply cant accept that DOMDocument "enriches" the HTML with doctype, html and body, which can be very annoying when you want the complete document, not just some extracts, you can use this neat function and extract the body innerHTML with :
echo DOMinnerHTML($doc->getElementsByTagName('body')->item(0));

PHP Strip_tags for div with a specific ID?

Does anybody know if a modified strip_tags function exsists where you can specify the ID of the tags to be stripped, and possbile also specify to remove ALL THE DATA IN THE TAGS. Take for example:
<div id="one">
<div id="two">
bla bla bla
</div>
</div>
Running:
new_strip_tags($data, 'two', true);
Must return:
<div id="one">
</div>
Is there something like this out there?
You can use DOMDocument and DOMXPath for that.
<?php
$html = '<html><head><title>...</title></head><body>
<div id="one">
<div id="two">
bla bla bla
</div>
</div>
</body></html>';
$doc = new DOMDocument;
$doc->preserveWhiteSpace = false;
$doc->loadhtml($html);
$xpath = new DOMXPath($doc);
$ns = $xpath->query('//div[#id="two"]');
// there can be only one... but anyway
foreach($ns as $node) {
$node->parentNode->removeChild($node);
}
echo $doc->savehtml();
That's not exactly what strip_tags does, it strips the tags but leaves the content. What you want is something like this:
function remove_div_with_id($html, $id) {
return preg_replace('/<div[^>]+id="'.preg_quote($id, '/').'"[^>]*>(.*?)<\/div>/s', '', $html);
}
Note that this will not work correctly with nested tags. If you need that, you might want to use a DOM representation of your HTML.

Categories