how to use dom php parser - php

I'm new to DOM parsing in PHP:
I have a HTML file that I'm trying to parse. It has a bunch of DIVs like this:
<div id="interestingbox">
<div id="interestingdetails" class="txtnormal">
<div>Content1</div>
<div>Content2</div>
</div>
</div>
<div id="interestingbox">
......
I'm trying to get the contents of the many div boxes using php.
How can I use the DOM parser to do this?
Thanks!

First i have to tell you that you can't use the same id on two different divs; there are classes for that point. Every element should have an unique id.
Code to get the contents of the div with id="interestingbox"
$html = '
<html>
<head></head>
<body>
<div id="interestingbox">
<div id="interestingdetails" class="txtnormal">
<div>Content1</div>
<div>Content2</div>
</div>
</div>
<div id="interestingbox2">a link</div>
</body>
</html>';
$dom_document = new DOMDocument();
$dom_document->loadHTML($html);
//use DOMXpath to navigate the html with the DOM
$dom_xpath = new DOMXpath($dom_document);
// if you want to get the div with id=interestingbox
$elements = $dom_xpath->query("*/div[#id='interestingbox']");
if (!is_null($elements)) {
foreach ($elements as $element) {
echo "\n[". $element->nodeName. "]";
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
//OUTPUT
[div] {
Content1
Content2
}
Example with classes:
$html = '
<html>
<head></head>
<body>
<div class="interestingbox">
<div id="interestingdetails" class="txtnormal">
<div>Content1</div>
<div>Content2</div>
</div>
</div>
<div class="interestingbox">a link</div>
</body>
</html>';
//the same as before.. just change the xpath
[...]
$elements = $dom_xpath->query("*/div[#class='interestingbox']");
[...]
//OUTPUT
[div] {
Content1
Content2
}
[div] {
a link
}
Refer to the DOMXPath page for more details.

I got this to work using simplehtmldom as a start:
$html = file_get_html('example.com');
foreach ($html->find('div[id=interestingbox]') as $result)
{
echo $result->innertext;
}

Very nice function from http://www.sitepoint.com/forums/showthread.php?611393-php5-need-something-like-innerHTML-instead-of-nodeValue
function innerXML($node)
{
$doc = $node->ownerDocument;
$frag = $doc->createDocumentFragment();
foreach ($node->childNodes as $child)
{
$frag->appendChild($child->cloneNode(TRUE));
}
return $doc->saveXML($frag);
}
$dom = new DOMDocument();
$dom->loadXML('
<html>
<body>
<table>
<tr>
<td id="foo">
The first bit of Data I want
<br />The second bit of Data I want
<br />The third bit of Data I want
</td>
</tr>
</table>
<body>
<html>
');
$xpath = new DOMXPath($dom);
$node = $xpath->evaluate("/html/body//td[#id='foo' ]");
$dataString = innerXML($node->item(0));
$dataArr = explode("<br />", $dataString);
$dataUno = $dataArr[0];
$dataDos = $dataArr[1];
$dataTres = $dataArr[2];
echo "firstdata = $nameUno<br />seconddata = $nameDos<br />thirddata = $nameTres<br />"

WebExtractor: https://github.com/knyga/webextractor
It can parse page with css, regex, xpath selectors.
Look package and tests for examples:
use WebExtractor\DataExtractor\DataExtractorFactory; use
WebExtractor\DataExtractor\DataExtractorTypes; use
WebExtractor\Client\Client;
$factory = DataExtractorFactory::getFactory(); $extractor =
$factory->createDataExtractor(DataExtractorTypes::CSS); $client = new
Client; $content =
$client->get('https://en.wikipedia.org/wiki/2014_Winter_Olympics');
$extractor->setContent($content); $h1 =
$extractor->setSelector('h1')->extract();

Related

Get h2 html using Simple HTML DOM Parser

I have the HTML web page with this code:
<div class="col-sm-9 xs-box2">
<h2 class="title-medium br-bottom">Your Name</h2>
</div>
Now I want to use Simple HTML DOM Parser to get the text value of h2 in this div.
My code is:
$name = $html->find('h2[class="title-medium br-bottom"]');
echo $name;
But it always return an error: "
Notice: Array to string conversion in C:\xampp\htdocs\index.php on line 21
Array
How can I fix this error?
Can you try for Simple HTML DOM
foreach($html->find('h2') as $element){
$element->class;
}
There are other methods to parse
Method 1.
You can get the H2 tags using the following code snippet, using DOMDocument and getElementsByTagName
$received_str = '<div class="col-sm-9 xs-box2">
<h2 class="title-medium br-bottom">Your Name</h2>
</div>';
$dom = new DOMDocument;
#$dom->loadHTML($received_str);
$h2tags = $dom->getElementsByTagName('h2');
foreach ($h2tags as $_h2){
echo $_h2->getAttribute('class');
echo $_h2->nodeValue;
}
Method2
Using the Xpath you can parse it
$received_str = '<div class="col-sm-9 xs-box2">
<h2 class="title-medium br-bottom">Your Name</h2>
</div>';
$dom = new DOMDocument;
$dom->loadHTML($received_str);
$xpath = new DomXPath($dom);
$nodes = $xpath->query("//h2[#class='title-medium br-bottom']");
header("Content-type: text/plain");
foreach ($nodes as $i => $node) {
$node->nodeValue;
}

Getting all elements between html tag in php

I refered this question
But, i want to iterate and get all the elements between the html tag
This is what i did
$homepage = file_get_contents('http://www.example.com');
Which will print the following
<html>
<body>
<div class = "alpha">hey</div>
<div class = "beta">one</div>
<div class = "beta">two</div>
</body>
</html>
Here i need to get all the elements with the class beta.
How can i do this ?
Here's the code that i tried so far
$dom = new DOMDocument();
$dom->loadHTML($homepage);
foreach($dom->getAllElements as $element ){
if(!$element->hasClass('beta')){
echo $element;
}
}
But it says DOMDocument::loadHTML(): Tag nav invalid in Entity,
Try this
<?php
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML("<html>
<body>
<div class = 'alpha'>hey</div>
<div class = 'beta'>one</div>
<div class = 'beta'>two</div>
</body>
</html>");
libxml_clear_errors();
$classname="beta";
$finder = new DomXPath($dom);
$spaner = $finder->query("//*[contains(#class, '$classname')]");
foreach($spaner as $element ){
print_r($element);
}
?>

extract html inside <p class=" without simplehtmldom

i want to extract html inside <p class="js-swt-text" without simplehtmldom
bellow is the working example i did with simplehtmldom:
<?php
include('simple_html_dom.php');
$html = file_get_html("http://blabla.com");
foreach($html->find('.js-swt-text') as $ret) {
echo $ret;
}
?>
please help me to do with domdocument
<?php
$html = file_get_contents("http://blabla.com");
and echo html from every <p class="js-swt-text"
You can use:
$html = file_get_contents("http://blabla.com");
$dom = new DomDocument();
$dom->loadHtml($html);
$xpath = new DomXpath($dom);
$items = $xpath->query('//p[#class="js-swt-text"]');
foreach ($items as $item) {
echo $item->textContent . "\n";
}

Php DOM and Xpath - Replace node but keep children of old node

Consider the following html:
<html>
<title>Xyz</title>
<body>
<div>
<div class='mycls'>
<div>1 Books</div>
<div>2 Papers</div>
<div>3 Pencils</div>
</div>
</div>
<body>
</html>
$dom = new DOMDocument();
$dom->loadHTML([loaded html of remote url through curl]);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('html/body/div[#class="mycls"]');
till here its working fine, i need to replace the node to get following:
<body>
<div>
<span>
<div>1 Books</div>
<div>2 Papers</div>
<div>3 Pencils</div>
</span>
</div>
<body>
Something like the following should work for you:
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$oldNode = $xpath->query('//div[#class="mycls"]')->item(0);
$span = $dom->createElement('span');
if ($oldNode->hasChildNodes()) {
$children = [];
foreach ($oldNode->childNodes as $child) {
$children[] = $child;
}
foreach ($children as $child) {
$span->appendChild($child->parentNode->removeChild($child));
}
}
$oldNode->parentNode->replaceChild($span, $oldNode);
echo htmlspecialchars($dom->saveHTML());
Demo: http://codepad.viper-7.com/WNTrR5
Note that in the demo I also have fixed your HTML which was utterly broken :-)
If you demo is really the HTML you are getting back from the cURL call and you cannot change it (no control over it) you can do:
$libxmlErrors = libxml_use_internal_errors(true); // at the start
and
libxml_use_internal_errors($libxmlErrors); // at the end
To prevent errors popping up

How to get the html header values like h1 using html parser dom?

how to get the header value like h1 or h2 which lies inside a div having some class name using simple html parser dom?
ex: - <html>
<body>
<div class="somename">
<h1>MyText</h1>
</div>
</body>
</html>
See the Document Object Model
$doc = new DOMDocument();
$doc->loadHTML('<html> <body> <div class="somename"> <h1>MyText</h1> </div> </body> </html>');
$els = $doc->getElementsByTagName('h1');
foreach ($els as $el) {
echo $el->nodeValue;
}
You can use xPath to locate h1 and then remove them by looping like this:
$doc = ...; // your DOM document
$xPath = new DOMXpath($doc);
$elements = $xpath->query("*[#class='somename']/h1");
if( !is_null( $elements)){
foreach ($elements as $element){
echo $element->nodeValue;
$element->parentNode->removeChild($element); //you may also delete elements
}
}
NOTE: I've written the code out of my head, please check documentation and examples.

Categories