Using DOMDocument to extract from HTML document by class

Using DOMDocument to extract from HTML document by class - php

In the DOMDocument class there are methods to get elements by by id and by tag name (getElementById & getElementsByTagName) but not by class. Is there a way to do this?
As an example, how would I select the div from the following markup?
<html>
...
<body>
...
<div class="foo">
...
</div>
...
</body>
</html>

The simple answer is to use xpath:
$dom = new DomDocument();
$dom->loadHtml($html);
$xpath = new DomXpath($dom);
$div = $xpath->query('//*[#class="foo"]')->item(0);
But that won't accept spaces. So to select by space separated class, use this query:
//*[contains(concat(' ', normalize-space(#class), ' '), ' class ')

$html = '<html><body><div class="foo">Test</div><div class="foo">ABC</div><div class="foo">Exit</div><div class="bar"></div></body></html>';
$dom = new DOMDocument();
#$dom->loadHtml($html);
$xpath = new DOMXPath($dom);
$allClass = $xpath->query("//#class");
$allClassBar = $xpath->query("//*[#class='bar']");
echo "There are " . $allClass->length . " with a class attribute<br>";
echo "There are " . $allClassBar->length . " with a class attribute of 'bar'<br>";

In addition to ircmaxell's answer if you need to select by space separated class:
$dom = new DomDocument();
$dom->loadHtml($html);
$xpath = new DomXpath($dom);
$classname='foo';
$div = $xpath->query("//table[contains(#class, '$classname')]")->item(0);

Related

How can I add an element into the middle of a text node's text?

Given the following HTML:
$content = '<html>
<body>
<div>
<p>During the interim there shall be nourishment supplied</p>
</div>
</body>
</html>';
How can I alter it to the following HTML:
<html>
<body>
<div>
<p>During the <span>interim</span> there shall be nourishment supplied</p>
</div>
</body>
</html>
I need to do this using DomDocument. Here's what I've tried:
$dom = new DomDocument();
$dom->loadHTML($content);
$dom->preserveWhiteSpace = false;
$xpath = new DOMXpath($dom);
$elements = $xpath->query("//*[contains(text(),'interim')]");
if (!is_null($elements)) {
foreach ($elements as $element) {
$text = $element->nodeValue;
$element->nodeValue = str_replace('interim','<span>interim</span>',$text);
}
}
echo $dom->saveHTML();
However, this outputs literal html entities so it renders like this in the browser:
During the <span>interim</span> there shall be nourishment supplied
I imagine one should use createElement and appendChild methods instead of assigning nodeValue directly but I can't see how to insert an element in the middle of a textNode string?

Marcus Harrison's answer using splitText is a good one, but it can be simplified and needs to use mb_* methods to work with UTF-8 input:
<?php
$html = <<<END
<html>
<meta charset="utf-8">
<body>
<div>
<p>During € the interim there shall be nourishment supplied</p>
</div>
</body>
</html>
END;
$replace = 'interim';
$doc = new DOMDocument;
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$nodes = $xpath->query(sprintf('//text()[contains(., "%s")]', $replace));
foreach ($nodes as $node) {
$start = mb_strpos($node->textContent, $replace);
$end = $start + mb_strlen($replace);
$node->splitText($end); // do this first
$node->splitText($start); // do this last
$newnode = $doc->createElement('span');
$node->parentNode->insertBefore($newnode, $node->nextSibling);
$newnode->appendChild($newnode->nextSibling);
}
$doc->encoding = 'UTF-8';
print $doc->saveHTML($doc->documentElement);

Create a new DomDocument with modified element and replace the old one
foreach ($elements as $element) {
$text = $element->nodeValue;
$el = new DomDocument();
$el->loadHTML('<iframe>'. str_replace('interim','<span>interim</span>',$text) . '</iframe>');
$new = $dom->importNode($el->getElementsByTagName('iframe')->item(0), true);
unset($el);
$element->parentNode->replaceChild($new, $element);
}

In order to do this, you must use the DOMString's splitText interface. This accepts an offset, which can be retrieved by using strpos:
$dom = new DomDocument();
$dom->loadHTML($content);
$dom->preserveWhiteSpace = false;
$xpath = new DOMXpath($dom);
$elements = $xpath->query("//*[contains(text(),'interim')]");
if (!is_null($elements)) {
foreach ($elements as $element) {
$text = $element->childNodes->item(0);
$text->splitText(strpos($text->textContent, "interim"));
$text2 = $element->childNodes->item(1);
$text2->splitText(strpos($text2->textContent, " "));
$element->removeChild($text2);
$span = $dom->createElement("span");
$span->appendChild($dom->createTextNode("interim"));
$element->insertBefore($span, $element->childNodes->item(1));
}
}
echo $dom->saveHTML();
Edits: having just tested it, I realise I hadn't removed the original "interim" in the second text node. Edited this answer to do that. I have also edited this code to be as compatible with old versions of PHP as I can think of making it: as I don't run an old version of PHP it isn't possible for me to test that.

How to extract the contents inside a div based on its class?

I tried with this code,
$html= file_get_contents("page.html");
$dom = new DOMDocument;
$dom->loadHTML($html);
$div = $dom->getElementsByClassName('mydiv1');
$result = $dom->saveHTML($div);
echo $result;
page.html
<html>
<body>
<div id="test">
<div class="mydiv1">Hello</div>
<div class="mydiv2">How are you</div>
</div>
</body>
</html>
But when I tried with Id its works. like,
$html= file_get_contents("page.html");
$dom = new DOMDocument;
$dom->loadHTML($html);
$div = $dom->getElementById('test');
$result = $dom->saveHTML($div);
echo $result;
How can I get the content based on class ?

Try this code,
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$div = $xpath->query('//div[#class="mydiv1"]');
$div = $div->item(0);
$result = $dom->saveXML($div);
echo $result;

There is no actual getElementsByClassName (yet) in DOMDocument, but the same results can be produced using DOMXpath as :
$dom = new DomDocument();
$dom->load($filePath);
$finder = new DomXPath($dom);
$nodes= $finder->query('//div[#class="mydiv1"]');

How to get PHP DOM getElementsByTagName('body') with html tags

Im getting the body content but without html tags(It is cleaned up) inside the body.I need with all html tags inside the body. what do I want to change on my code?
$doc = new DOMDocument();
#$doc->loadHTMLFile($myURL);
$elements2 = $doc->getElementsByTagName('body');
foreach ($elements2 as $el2) {
echo $el2->nodeValue, PHP_EOL;
echo "<br/>";
}

You will need to save the body child nodes as HTML. I suggest using Xpath to fetch the nodes, this avoids the outer loop:
$html = <<<'HTML'
<html>
<body>
Foo
<p>Bar</p>
</body>
</html>
HTML;
$document = new DOMDocument();
$document->loadHtml($html);
$xpath = new DOMXpath($document);
$result = '';
foreach ($xpath->evaluate('//body/node()') as $node) {
$result .= $document->saveHtml($node);
}
var_dump($result);
Output:
string(29) "
Foo
<p>Bar</p>
"

DOM XPath Selector not grabbing classes

I was looking through the following stackoverflow question: Getting Dom Elements By Class name and it referenced that I can get class names with this code:
$text = '<html><body><div class="someclass someclass2">sometext</div></body></html>';
$dom = new DomDocument();
$dom->loadHTML($text);
$classname = 'someclass someclass2';
$finder = new DomXPath($dom);
$nodes = $finder->query("//*[contains(concat(' ', normalize-space(#class), ' '), ' $classname ')]");
print "<pre>".print_r($nodes,true)."</pre>";
I also tried changing $classname to just one class:
$classname = 'someclass2';
I'm getting empty results. Any idea why?

You'll have to loop trough the results as print_r() will not print the members of a DOMNodeList. Like this:
$text = '<html><body><div class="someclass someclass2">sometext</div></body></html>';
$dom = new DomDocument();
$dom->loadHTML($text);
$classname = 'someclass someclass2';
$finder = new DomXPath($dom);
$nodes = $finder->query("//*[contains(concat(' ', normalize-space(#class), ' '), ' $classname ')]");
// iterate through the result. print_r will not suffer
foreach($nodes as $node) {
echo $node->nodeValue;
}

find class name of html source using php

I am new to PHP. I want to write code to find the id specified in the html code below, which is 1123. Can any one give me some idea?
<span class="miniprofile-container /companies/1123?miniprofile="
data-tracking="NUS_CMPY_FOL-nhre"
data-li-getjs="http://s.c.lnkd.licdn.com/scds/concat/common/js?h=dyt8o4nwtaujeutlgncuqe0dn&fc=2">
<strong>
<a href="http://www.linkedin.com/nus-trk?trkact=viewCompanyProfile&pk=biz-overview-public&pp=1&poster=&uid=5674666402166894592&ut=NUS_UNIU_FOLLOW_CMPY&r=&f=0&url=http%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fcompany%2F1123%3Ftrk%3DNUS_CMPY_FOL-nhre&urlhash=7qbc">
Bank of America
</a>
</strong>
</span> has a new Project Manager
Note: I don't need the content in the span class. I need the id in the span class name.
I tried the following:
$dom = new DOMDocument('1.0', 'UTF-8');
#$dom->loadHTML($html);
$xmlElements = simplexml_import_dom($dom);
$id = $xmlElements->xpath("//span [#class='miniprofile-container /companies/$data_id?miniprofile=']");
... but I don't know how to proceed further.

dependent of your need, you could do
$matches = array();
preg_match('|<span class="miniprofile-container /companies/(\d+)\?miniprofile|', $html, $matches);
print_r($matches);
this is a very trivial regex, but could serve as a first suggestion. If you want to go via DomDocument or simplexml, you mustn't mix both like you did in your example.
What is your preferred way, we can narrow this down then.
//edit: pretty much what #fireeyedboy said, but this is what I just fiddled together:
<?php
$html = <<<EOD
<html><head></head>
<body>
<span class="miniprofile-container /companies/1123?miniprofile="
data-tracking="NUS_CMPY_FOL-nhre"
data-li-getjs="http://s.c.lnkd.licdn.com/scds/concat/common/js?h=dyt8o4nwtaujeutlgncuqe0dn&fc=2">
<strong>
<a href="#">
Bank of America
</a>
</strong>
</span> has a new Project Manager
</body>
</html>
EOD;
$domDocument = new DOMDocument('1.0', 'UTF-8');
$domDocument->recover = TRUE;
$domDocument->loadHTML($html);
$xPath = new DOMXPath($domDocument);
$relevantElements = $xPath->query('//span[contains(#class, "miniprofile-container")]');
$foundId = NULL;
foreach($relevantElements as $match) {
$pregMatches = array();
if (preg_match('|/companies/(\d+)\?miniprofile|', $match->getAttribute('class'), $pregMatches)) {
if (isset($pregMatches[1])) {
$foundId = $pregMatches[1];
break;
}
};
}
echo $foundId;
?>

This should do what you are after:
$dom = new DOMDocument('1.0', 'UTF-8');
#$dom->loadHTML( $html );
$xpath = new DOMXPath( $dom );
/*
* the following xpath query will find all class attributes of span elements
* whose class attribute contain the strings " miniprofile-container " and " /companies/"
*/
$nodes = $xpath->query( "//span[contains(concat(' ', #class, ' '), ' miniprofile-container ') and contains(concat(' ', #class, ' '), ' /companies/')]/#class" );
foreach( $nodes as $node )
{
// extract the number found between "/companies/" and "?miniprofile" in the node's nodeValue
preg_match( '#/companies/(\d+)\?miniprofile#', $node->nodeValue, $matches );
var_dump( $matches[ 1 ] );
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Using DOMDocument to extract from HTML document by class - php

In addition to ircmaxell's answer if you need to select by space separated class: $dom = new DomDocument(); $dom->loadHtml($html); $xpath = new DomXpath($dom); $classname='foo'; $div = $xpath->query("//table[contains(#class, '$classname')]")->item(0);

Related

How can I add an element into the middle of a text node's text?

How to extract the contents inside a div based on its class?

How to get PHP DOM getElementsByTagName('body') with html tags

DOM XPath Selector not grabbing classes

find class name of html source using php

Categories

Resources