Get h2 html using Simple HTML DOM Parser - php

I have the HTML web page with this code:
<div class="col-sm-9 xs-box2">
<h2 class="title-medium br-bottom">Your Name</h2>
</div>
Now I want to use Simple HTML DOM Parser to get the text value of h2 in this div.
My code is:
$name = $html->find('h2[class="title-medium br-bottom"]');
echo $name;
But it always return an error: "
Notice: Array to string conversion in C:\xampp\htdocs\index.php on line 21
Array
How can I fix this error?

Can you try for Simple HTML DOM
foreach($html->find('h2') as $element){
$element->class;
}
There are other methods to parse
Method 1.
You can get the H2 tags using the following code snippet, using DOMDocument and getElementsByTagName
$received_str = '<div class="col-sm-9 xs-box2">
<h2 class="title-medium br-bottom">Your Name</h2>
</div>';
$dom = new DOMDocument;
#$dom->loadHTML($received_str);
$h2tags = $dom->getElementsByTagName('h2');
foreach ($h2tags as $_h2){
echo $_h2->getAttribute('class');
echo $_h2->nodeValue;
}
Method2
Using the Xpath you can parse it
$received_str = '<div class="col-sm-9 xs-box2">
<h2 class="title-medium br-bottom">Your Name</h2>
</div>';
$dom = new DOMDocument;
$dom->loadHTML($received_str);
$xpath = new DomXPath($dom);
$nodes = $xpath->query("//h2[#class='title-medium br-bottom']");
header("Content-type: text/plain");
foreach ($nodes as $i => $node) {
$node->nodeValue;
}

Related

Parse HTML with PHP do not remove all the html tag?

I want to parse html using the php.
My html file is like this
<div class="main">
<div class="text">
Welcom to Stackoverflow
</div>
</div>
now i want to extract the only this part
<div class="text">
Welcom to Stackoverflow
</div>
for this i create the code like this
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$tags = $xpath->query('//div[#class="main"]');
foreach ($tags as $tag) {
var_dump(trim($tag->nodeValue));
}
this code gives only the
Welcom to Stackoverflow
but i want the tag also. how to do this??
If you only want to have the div with class "text" try this:
Change your query to: $xpath->query('//div[#class="text"]');
For the output you need: echo $dom->saveHTML( $tag );
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$tags = $xpath->query('//div[#class="text"]');
foreach ($tags as $tag) {
echo $dom->saveHTML( $tag );
}
The Querypath library for html/xml parsing makes such things much much easier.

How to write regex rule for div with data tag

How do I grab data-src value with Regex in PHP for data-scale=small only ?
<div data-src="http://exampleurl.com/image_url_s.jpg" data-scale="small"></div>
<div data-src="http://exampleurl.com/image_url_b.jpg" data-scale="big"></div>
Instead of using regex, make effective use of DOM and XPath to do this for you.
$doc = DOMDocument::loadHTML('
<div data-src="http://exampleurl.com/image_url_a.jpg" data-scale="small"></div>
<div data-src="http://exampleurl.com/image_url_b.jpg" data-scale="small"></div>
<div data-src="http://exampleurl.com/image_url_c.jpg" data-scale="big"></div>
<div data-src="http://exampleurl.com/image_url_d.jpg" data-scale="big"></div>
');
$xpath = new DOMXPath($doc);
$nodes = $xpath->query('//div[#data-scale="small"]');
foreach ($nodes as $node) {
echo $node->getAttribute('data-src'), "\n";
}
Output
http://exampleurl.com/image_url_a.jpg
http://exampleurl.com/image_url_b.jpg

Iterate through elements with DOMDocument & DOMXPath

I am trying to iterate through every child element of the containing div:
$html = ' <div id="roothtml">
<h1>
Introduction</h1>
<p>text</p>
<h2>
text</h2>
<p>
test</p>
</div>';
And I have this PHP:
$dom = new DOMDocument();
$dom->loadHTML($html);
$dom->preserveWhitespace = false;
$xpath = new DOMXPath($dom);
$els = $xpath->query("/div");
print_r($els);
All I get though is DOMNodeList Object ( )
Having looked at the IBM tutorial I should be getting an array. What is it I am doing wrong?
Any help is appreciated.
You're using the wrong query string, you should be using //div.
Iterate over the list like this:
$els = $xpath->query("//div");
foreach( $els as $el) {
echo $el->textContent;
}

PHP: Fetch content from a html page using xpath()

I'm trying to fetch the content of a div in a html page using xpath and domdocument. This is the structure of the page:
<div id="content">
<div class="div1"></div>
<span class="span1></span>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<div class="div2"></div>
</div>
I want to get only the content of p, not spans and divs. I came thru this xpath expression .//*[#id='content']/p but guess something's not right because i'm getting only the first p. Tried using other expression with following-sibling and node() but all return the first p only.
.//*[#id='content']/span/following-sibling::p
.//*[#id='content']/node()[self::p]
This is how's used xpath:
$domDocument=new DOMDocument();
$domDocument->encoding = 'UFT8';
$domDocument->loadHTML($page);
$domXPath = new DOMXPath($domDocument);
$domNodeList = $domXPath->query($this->xpath);
$content = $this->GetHTMLFromDom($domNodeList);
And this is how i get html from nodes:
private function GetHTMLFromDom($domNodeList){
$domDocument = new DOMDocument();
$node = $domNodeList->item(0);
foreach($node->childNodes as $childNode)
$domDocument->appendChild($domDocument->importNode($childNode, true));
return $domDocument->saveHTML();
}
This XPath expression:
//div[#id='content']/p
Result in the wanted node set (five p elements)
EDIT: Now it's clear what is your problem. You need to iterate over the NodeList:
private function GetHTMLFromDom($domNodeList){
$domDocument = new DOMDocument();
foreach ($nodelist as $node) {
$domDocument->appendChild($domDocument->importNode($node, true));
}
return $domDocument->saveHTML();
}

how to use dom php parser

I'm new to DOM parsing in PHP:
I have a HTML file that I'm trying to parse. It has a bunch of DIVs like this:
<div id="interestingbox">
<div id="interestingdetails" class="txtnormal">
<div>Content1</div>
<div>Content2</div>
</div>
</div>
<div id="interestingbox">
......
I'm trying to get the contents of the many div boxes using php.
How can I use the DOM parser to do this?
Thanks!
First i have to tell you that you can't use the same id on two different divs; there are classes for that point. Every element should have an unique id.
Code to get the contents of the div with id="interestingbox"
$html = '
<html>
<head></head>
<body>
<div id="interestingbox">
<div id="interestingdetails" class="txtnormal">
<div>Content1</div>
<div>Content2</div>
</div>
</div>
<div id="interestingbox2">a link</div>
</body>
</html>';
$dom_document = new DOMDocument();
$dom_document->loadHTML($html);
//use DOMXpath to navigate the html with the DOM
$dom_xpath = new DOMXpath($dom_document);
// if you want to get the div with id=interestingbox
$elements = $dom_xpath->query("*/div[#id='interestingbox']");
if (!is_null($elements)) {
foreach ($elements as $element) {
echo "\n[". $element->nodeName. "]";
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
//OUTPUT
[div] {
Content1
Content2
}
Example with classes:
$html = '
<html>
<head></head>
<body>
<div class="interestingbox">
<div id="interestingdetails" class="txtnormal">
<div>Content1</div>
<div>Content2</div>
</div>
</div>
<div class="interestingbox">a link</div>
</body>
</html>';
//the same as before.. just change the xpath
[...]
$elements = $dom_xpath->query("*/div[#class='interestingbox']");
[...]
//OUTPUT
[div] {
Content1
Content2
}
[div] {
a link
}
Refer to the DOMXPath page for more details.
I got this to work using simplehtmldom as a start:
$html = file_get_html('example.com');
foreach ($html->find('div[id=interestingbox]') as $result)
{
echo $result->innertext;
}
Very nice function from http://www.sitepoint.com/forums/showthread.php?611393-php5-need-something-like-innerHTML-instead-of-nodeValue
function innerXML($node)
{
$doc = $node->ownerDocument;
$frag = $doc->createDocumentFragment();
foreach ($node->childNodes as $child)
{
$frag->appendChild($child->cloneNode(TRUE));
}
return $doc->saveXML($frag);
}
$dom = new DOMDocument();
$dom->loadXML('
<html>
<body>
<table>
<tr>
<td id="foo">
The first bit of Data I want
<br />The second bit of Data I want
<br />The third bit of Data I want
</td>
</tr>
</table>
<body>
<html>
');
$xpath = new DOMXPath($dom);
$node = $xpath->evaluate("/html/body//td[#id='foo' ]");
$dataString = innerXML($node->item(0));
$dataArr = explode("<br />", $dataString);
$dataUno = $dataArr[0];
$dataDos = $dataArr[1];
$dataTres = $dataArr[2];
echo "firstdata = $nameUno<br />seconddata = $nameDos<br />thirddata = $nameTres<br />"
WebExtractor: https://github.com/knyga/webextractor
It can parse page with css, regex, xpath selectors.
Look package and tests for examples:
use WebExtractor\DataExtractor\DataExtractorFactory; use
WebExtractor\DataExtractor\DataExtractorTypes; use
WebExtractor\Client\Client;
$factory = DataExtractorFactory::getFactory(); $extractor =
$factory->createDataExtractor(DataExtractorTypes::CSS); $client = new
Client; $content =
$client->get('https://en.wikipedia.org/wiki/2014_Winter_Olympics');
$extractor->setContent($content); $h1 =
$extractor->setSelector('h1')->extract();

Categories