getting element content with simpe-html-dom - php

I'm using simpile_html_dom for getting html pages elements.
I have some div elements like this. All i want is to get "Fine Thanks" sentence in each div (that is not inside any sub-element).
How can i do it?
<div class="right">
<h2>
Hello
</h2>
<br/>
<span>How Are You?</span>
<span>How Are You?</span>
<span>How Are You?</span>
Fine Thanks
</div>

It should be simply $html->find('div.right > text'), but that won't work because Simple HTML DOM Parser doesn't seem to support direct descendant queries.
So you'd have to find all <div> elements first and search the child nodes for a text node. Unfortunately, the ->childNodes() method is mapped to ->children() and thus only returns elements.
A working solution is to call ->find('text') on each <div> element, after which you filter the results based on the parent node.
foreach ($doc->find('div.right') as $parent) {
foreach ($parent->find('text') as $node) {
if ($node->parent() === $parent && strlen($t = trim($node->plaintext))) {
echo $t, PHP_EOL;
}
}
}
Using DOMDocument, this XPath expression will do the same work without the pain:
$doc = new DOMDocument;
$doc->loadHTML($content);
$xp = new DOMXPath($doc);
foreach ($xp->query('//div/text()') as $node) {
if (strlen($t = trim($node->textContent))) {
echo $t, PHP_EOL;
}
}

There is no built in method to read text property in simple_html_dom.php
But this should work;
include 'parser.php';
$html = str_get_html('<div class="right">
<h2>
Hello
</h2>
<br/>
<span>How Are You?</span>
<span>How Are You?</span>
<span>How Are You?</span>
Fine Thanks
</div>');
function readTextNode($element){
$local = $element;
$childs = count($element->childNodes());
for($i = 0; $i < $childs; $i++)
$local->childNodes($i)->outertext = '';
return $local->innertext;
}
echo readTextNode($html->find('div.right',0));

I would switch to phpquery for this one. You still need to use DOM but not too painful:
require('phpQuery.php');
$html =<<<EOF
<div class="right">
<h2>
Hello
</h2>
<br/>
<span>How Are You?</span>
<span>How Are You?</span>
<span>How Are You?</span>
Fine Thanks
</div>
EOF;
$dom = phpQuery::newDocumentHTML($html);
foreach($dom->find("div.right > *:last") as $last_element){
echo $last_element->nextSibling->nodeValue;
}
Update
These days I'm recommending this simple replacement which does let you avoid the dom ugliness:
$doc = str_get_html($html);
foreach($doc->find('div.right > text:last') as $el){
echo $el->text;
}

public function removeNode($selector)
{
foreach ($html->find($selector) as $node)
{
$node->outertext = '';
}
$this->load($this->save());
}
use this function to remove the h2 and span element from the div. Then get the div element data.
Reference URL : Simple HTML Dom: How to remove elements?

Related

Replace class content using php

I want to replace string from specific classes from HTML.
In HTML there is other content which I don't want to change.
In below code want to change data on class one and three only, class two content should be as it is.
I need to this in dynamic way.
<div class="one"> I want to change this </div>
<div class="two"> I don't want to change this </div>
<div class="three"> I want to change this </div> 
Dom functions are helpful
php manual
//your html file content
$str = '...<div class="one"> I want to change this </div>
<div class="two"> I don\'t want to change this </div>
<div class="three"> I want to change this </div>... ';
$dom = new DOMDocument();
$dom->loadHtml($str);
$domXpath = new DOMXPath($dom);
//query the nodes matched
$list = $domXpath->query('//div[#class!="two"]');
if ($list->length > 0) {
foreach ($list as $node) {
//change node value
$node->nodeValue = 'Content changed!';
}
}
//get the result
$new_str = $dom->saveHTML();
var_dump($new_str);

simple html dom traversal confusion when looping

I'm trying to use the php script simplehtmldom to loop over divs on a web page while scraping.
Right now I have this:
$url = "https://test.com/";
$html = new simple_html_dom();
$html->load_file($url);
$item_list = $html->find('div.main div[id]');
foreach ($item_list as $item)
{
echo $item->outertext . PHP_EOL;
}
This will give me many like this (from the echo in the loop above):
<div id=1>
<div>
stuff here
</div>
<div>
<span class="title">name</span>
</div>
</div>
<div id=2>
<div>
stuff here
</div>
<div>
<span class="title">name 2</span>
</div>
</div>
What I'm trying to do is loop over the span with class=title, but no matter what I can't seem to quite get the right selector. Could someone help me out?
You can get the spans adding span[class=title] as a selector:
$item_list = $html->find('div.main div[id] span[class=title]');
foreach ($item_list as $item)
{
echo $item->outertext . PHP_EOL;
}

Wired HTML DOM produced by PHP

I'm retrieving rss feed of blogs with this code
<?php
$xml = ("https://serembangirl.wordpress.com/feed/");
$xmlDoc = new DOMDocument();
$xmlDoc->load($xml);
$x=$xmlDoc->getElementsByTagName('item');
for ($i=0; $i<=5; $i++) {
$item_title=$x->item($i)->getElementsByTagName('title')
->item(0)->childNodes->item(0)->nodeValue;
$item_link=$x->item($i)->getElementsByTagName('link')
->item(0)->childNodes->item(0)->nodeValue;
$item_desc=$x->item($i)->getElementsByTagName('description')
->item(0)->childNodes->item(0)->nodeValue;
$item_content=$x->item($i)->getElementsByTagName('encoded')->item(0)->nodeValue;
?>
<a href='#'>
<div class="card">
<div class='inner'>
<p class='title'>
<?php echo $item_title;?>
</p>
<p class='desc'> <?php echo $item_desc; ?> </p>
</div>
</div>
</a>
<?php } ?>
With above code, supposedly the should wrap the but it produced this instead :
http://i.imgur.com/YspeRe3.png
I really scratched my head solving this.
I think div within anchor tag is not recommended.
Check the actual source code that is generated by PHP. It will have the div inside the a.
div, p or other block level elements are not allowed inside an a element. The browser tries to "fix" your document.
Hint 1
Use XPath to fetch data from the DOM.
$xpath = new DOMXPath($xmlDoc);
foreach ($xpath->evaluate('//item') as $item) {
$item_title = $xpath->evaluate('string(title)', $item);
// ...
}
Hint 2
Don't forget the escaping if you output data as HTML source.
...
<p class='title'>
<?php echo htmlspecialchars($item_title); ?>
</p>
...

Fetching Image from particular div Only via DOMDocument in PHP

I have website, where i have posted few images inside particular div :-
<div class="posts">
<div class="separator">
<img src="http://www.example.com/image.jpg" />
<p>Be, where I am today, and i will be one where you will search me tomorrow</p>
</div>
<div class="separator">
<img src="http://www.example.com/imagesda.jpg" />
<p>Be, where I am today, and i will be one where you will search me tomorrow</p>
</div>
.... few more images
</div>
And from my 2nd website, i want to fetch all images on that particular div.. I have below code.
<?php
$htmlget = new DOMDocument();
#$htmlget->loadHtmlFile('http://www.example.com');
$xpath = new DOMXPath( $htmlget);
$nodelist = $xpath->query( "//img/#src" );
foreach ($nodelist as $images){
$value = $images->nodeValue;
echo "<img src='".$value."' /><br />";
}
?>
But this is fetching all images from my website and not just particular div. It also prints out my RSS image, Social icon image, etc.,
Can i specify particular div in my php code, so that it only fetch image from div.posts class.
first give a "id" for the outer div container. Then get it by its id. Then get its child image nodes.
an example:
$tables = $dom->getElementsById('node_id');
$table = $tables->item(1);
//get the number of rows in the 2nd table
echo $table->childNodes->length;
//content of each child
foreach($table->childNodes as $child)
{
echo $child->ownerDocument->saveHTML($child);
}
may be this like will help you. It has a good tutorial.
http://www.binarytides.com/php-tutorial-parsing-html-with-domdocument/
With PHP Simple HTML Parser, this will be:
include('simple_html_dom.php');
$html=file_get_html("http://your_web_site.com");
foreach($html->find('div.posts img') as $img_posts){
echo $img_posts->src.<br>; // to show the source attribute
}
Still reading about PHP Simple HTML Dom parser. And so far, it's faster(in implementation) than regex.
Here is another code that may help. You are looking for
doc->getElementsByTagName
which can help target a tag directly.
<?php
$myhtml = <<<EOF
<html>
<body>
<div class="posts">
<div class="separator">
<img src="http://www.example.com/image.jpg" />
<p>Be, where I am today, and i will be one where you will search me tomorrow</p>
</div>
<div class="separator">
<img src="http://www.example.com/imagesda.jpg" />
<p>Be, where I am today, and i will be one where you will search me tomorrow</p>
</div>
.... few more images
</div>
</body>
EOF;
$doc = new DOMDocument();
$doc->loadHTML($myhtml);
$divs = $doc->getElementsByTagName('img');
foreach ($divs as $div) {
foreach ($div->attributes as $attr) {
$name = $attr->nodeName;
$value = $attr->nodeValue;
echo "Attribute '$name' :: '$value'<br />";
}
}
?>
Demo here http://codepad.org/keZkC377
Also the answer here can provide further insights
Not finding elements using getElementsByTagName() using DomDocument

PHP - GET tag from url

I want to get a specific tag from url, from example:
If I have this content:
<div id="hey">
<div id="bla"></div>
</div>
<div id="hey">
<div id="bla"></div>
</div>
And I want to get all divs with the id "hey", ( i think its with preg_match_all ), How can I do that?
The content inside the tag can be changed.
I recommend use DOMDocument class instead of regular expressions (is less resource consumer and more clear IMHO).
$content = '<div id="hey">
<div id="bla"></div>
</div>
<div id="hey">
<div id="bla"></div>
</div>';
$doc = new DOMDocument();
#$doc->loadHTML($content); // # for possible not standard HTML
$xpath = new DOMXPath($doc);
$elements = $xpath->query("//div[#id='hey']");
/*#var $elements DOMNodeList */
for ($i=0;$i<$elements->length;$i++) {
/*#var $curr_element DOMElement */
$curr_element = $elements->item($i);
// Here do what you want with the element
var_dump($curr_element);
}
If you want to get the content from an URL you can use this line instead to fill the variable $content:
$content = file_get_contents('http://yourserver/urls/page.php');

Categories