I'm new to DOM parsing in PHP:
I have a HTML file that I'm trying to parse. It has a bunch of DIVs like this:
<div id="interestingbox">
<div id="interestingdetails" class="txtnormal">
<div>Content1</div>
<div>Content2</div>
</div>
</div>
<div id="interestingbox">
......
I'm trying to get the contents of the many div boxes using php.
How can I use the DOM parser to do this?
Thanks!
First i have to tell you that you can't use the same id on two different divs; there are classes for that point. Every element should have an unique id.
Code to get the contents of the div with id="interestingbox"
$html = '
<html>
<head></head>
<body>
<div id="interestingbox">
<div id="interestingdetails" class="txtnormal">
<div>Content1</div>
<div>Content2</div>
</div>
</div>
<div id="interestingbox2">a link</div>
</body>
</html>';
$dom_document = new DOMDocument();
$dom_document->loadHTML($html);
//use DOMXpath to navigate the html with the DOM
$dom_xpath = new DOMXpath($dom_document);
// if you want to get the div with id=interestingbox
$elements = $dom_xpath->query("*/div[#id='interestingbox']");
if (!is_null($elements)) {
foreach ($elements as $element) {
echo "\n[". $element->nodeName. "]";
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
//OUTPUT
[div] {
Content1
Content2
}
Example with classes:
$html = '
<html>
<head></head>
<body>
<div class="interestingbox">
<div id="interestingdetails" class="txtnormal">
<div>Content1</div>
<div>Content2</div>
</div>
</div>
<div class="interestingbox">a link</div>
</body>
</html>';
//the same as before.. just change the xpath
[...]
$elements = $dom_xpath->query("*/div[#class='interestingbox']");
[...]
//OUTPUT
[div] {
Content1
Content2
}
[div] {
a link
}
Refer to the DOMXPath page for more details.
I got this to work using simplehtmldom as a start:
$html = file_get_html('example.com');
foreach ($html->find('div[id=interestingbox]') as $result)
{
echo $result->innertext;
}
Very nice function from http://www.sitepoint.com/forums/showthread.php?611393-php5-need-something-like-innerHTML-instead-of-nodeValue
function innerXML($node)
{
$doc = $node->ownerDocument;
$frag = $doc->createDocumentFragment();
foreach ($node->childNodes as $child)
{
$frag->appendChild($child->cloneNode(TRUE));
}
return $doc->saveXML($frag);
}
$dom = new DOMDocument();
$dom->loadXML('
<html>
<body>
<table>
<tr>
<td id="foo">
The first bit of Data I want
<br />The second bit of Data I want
<br />The third bit of Data I want
</td>
</tr>
</table>
<body>
<html>
');
$xpath = new DOMXPath($dom);
$node = $xpath->evaluate("/html/body//td[#id='foo' ]");
$dataString = innerXML($node->item(0));
$dataArr = explode("<br />", $dataString);
$dataUno = $dataArr[0];
$dataDos = $dataArr[1];
$dataTres = $dataArr[2];
echo "firstdata = $nameUno<br />seconddata = $nameDos<br />thirddata = $nameTres<br />"
WebExtractor: https://github.com/knyga/webextractor
It can parse page with css, regex, xpath selectors.
Look package and tests for examples:
use WebExtractor\DataExtractor\DataExtractorFactory; use
WebExtractor\DataExtractor\DataExtractorTypes; use
WebExtractor\Client\Client;
$factory = DataExtractorFactory::getFactory(); $extractor =
$factory->createDataExtractor(DataExtractorTypes::CSS); $client = new
Client; $content =
$client->get('https://en.wikipedia.org/wiki/2014_Winter_Olympics');
$extractor->setContent($content); $h1 =
$extractor->setSelector('h1')->extract();
Related
I have the HTML web page with this code:
<div class="col-sm-9 xs-box2">
<h2 class="title-medium br-bottom">Your Name</h2>
</div>
Now I want to use Simple HTML DOM Parser to get the text value of h2 in this div.
My code is:
$name = $html->find('h2[class="title-medium br-bottom"]');
echo $name;
But it always return an error: "
Notice: Array to string conversion in C:\xampp\htdocs\index.php on line 21
Array
How can I fix this error?
Can you try for Simple HTML DOM
foreach($html->find('h2') as $element){
$element->class;
}
There are other methods to parse
Method 1.
You can get the H2 tags using the following code snippet, using DOMDocument and getElementsByTagName
$received_str = '<div class="col-sm-9 xs-box2">
<h2 class="title-medium br-bottom">Your Name</h2>
</div>';
$dom = new DOMDocument;
#$dom->loadHTML($received_str);
$h2tags = $dom->getElementsByTagName('h2');
foreach ($h2tags as $_h2){
echo $_h2->getAttribute('class');
echo $_h2->nodeValue;
}
Method2
Using the Xpath you can parse it
$received_str = '<div class="col-sm-9 xs-box2">
<h2 class="title-medium br-bottom">Your Name</h2>
</div>';
$dom = new DOMDocument;
$dom->loadHTML($received_str);
$xpath = new DomXPath($dom);
$nodes = $xpath->query("//h2[#class='title-medium br-bottom']");
header("Content-type: text/plain");
foreach ($nodes as $i => $node) {
$node->nodeValue;
}
I refered this question
But, i want to iterate and get all the elements between the html tag
This is what i did
$homepage = file_get_contents('http://www.example.com');
Which will print the following
<html>
<body>
<div class = "alpha">hey</div>
<div class = "beta">one</div>
<div class = "beta">two</div>
</body>
</html>
Here i need to get all the elements with the class beta.
How can i do this ?
Here's the code that i tried so far
$dom = new DOMDocument();
$dom->loadHTML($homepage);
foreach($dom->getAllElements as $element ){
if(!$element->hasClass('beta')){
echo $element;
}
}
But it says DOMDocument::loadHTML(): Tag nav invalid in Entity,
Try this
<?php
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML("<html>
<body>
<div class = 'alpha'>hey</div>
<div class = 'beta'>one</div>
<div class = 'beta'>two</div>
</body>
</html>");
libxml_clear_errors();
$classname="beta";
$finder = new DomXPath($dom);
$spaner = $finder->query("//*[contains(#class, '$classname')]");
foreach($spaner as $element ){
print_r($element);
}
?>
i want to extract html inside <p class="js-swt-text" without simplehtmldom
bellow is the working example i did with simplehtmldom:
<?php
include('simple_html_dom.php');
$html = file_get_html("http://blabla.com");
foreach($html->find('.js-swt-text') as $ret) {
echo $ret;
}
?>
please help me to do with domdocument
<?php
$html = file_get_contents("http://blabla.com");
and echo html from every <p class="js-swt-text"
You can use:
$html = file_get_contents("http://blabla.com");
$dom = new DomDocument();
$dom->loadHtml($html);
$xpath = new DomXpath($dom);
$items = $xpath->query('//p[#class="js-swt-text"]');
foreach ($items as $item) {
echo $item->textContent . "\n";
}
Consider the following html:
<html>
<title>Xyz</title>
<body>
<div>
<div class='mycls'>
<div>1 Books</div>
<div>2 Papers</div>
<div>3 Pencils</div>
</div>
</div>
<body>
</html>
$dom = new DOMDocument();
$dom->loadHTML([loaded html of remote url through curl]);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('html/body/div[#class="mycls"]');
till here its working fine, i need to replace the node to get following:
<body>
<div>
<span>
<div>1 Books</div>
<div>2 Papers</div>
<div>3 Pencils</div>
</span>
</div>
<body>
Something like the following should work for you:
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$oldNode = $xpath->query('//div[#class="mycls"]')->item(0);
$span = $dom->createElement('span');
if ($oldNode->hasChildNodes()) {
$children = [];
foreach ($oldNode->childNodes as $child) {
$children[] = $child;
}
foreach ($children as $child) {
$span->appendChild($child->parentNode->removeChild($child));
}
}
$oldNode->parentNode->replaceChild($span, $oldNode);
echo htmlspecialchars($dom->saveHTML());
Demo: http://codepad.viper-7.com/WNTrR5
Note that in the demo I also have fixed your HTML which was utterly broken :-)
If you demo is really the HTML you are getting back from the cURL call and you cannot change it (no control over it) you can do:
$libxmlErrors = libxml_use_internal_errors(true); // at the start
and
libxml_use_internal_errors($libxmlErrors); // at the end
To prevent errors popping up
how to get the header value like h1 or h2 which lies inside a div having some class name using simple html parser dom?
ex: - <html>
<body>
<div class="somename">
<h1>MyText</h1>
</div>
</body>
</html>
See the Document Object Model
$doc = new DOMDocument();
$doc->loadHTML('<html> <body> <div class="somename"> <h1>MyText</h1> </div> </body> </html>');
$els = $doc->getElementsByTagName('h1');
foreach ($els as $el) {
echo $el->nodeValue;
}
You can use xPath to locate h1 and then remove them by looping like this:
$doc = ...; // your DOM document
$xPath = new DOMXpath($doc);
$elements = $xpath->query("*[#class='somename']/h1");
if( !is_null( $elements)){
foreach ($elements as $element){
echo $element->nodeValue;
$element->parentNode->removeChild($element); //you may also delete elements
}
}
NOTE: I've written the code out of my head, please check documentation and examples.