DOMDocument removing html elements - php

Here is my code:
$text = '<div class="cgus_post"><div class="imgbox"><img src="/cgmedia/default.gif"></div>
<h2 id="post-15055">
Willie Nelson Celebrates 80th Birthday Stoned and Auditioning for Gandalf</h2>
<p>This video pretty much sums up why Willie Nelson is fucking awesome. Willie decided to celebrate his 80th birthday by recording an ‘audition’ for Peter Jackson. Willie wants to take the reigns from Ian McKellan in The Hobbit 2, and decided to show off his acting skills and give some of his own wizardly advice. The result is hilarious. Watch …</p>
<br class="clear">
</div>';
$dom = new DomDocument();
$dom->loadHTML($text);
$classname = 'cgus_post';
$finder = new DomXPath($dom);
$nodes = $finder->query("//*[contains(concat(' ', normalize-space(#class), ' '), ' $classname ')]");
foreach($nodes as $node){
echo $node->nodeValue;
}
The problem I am having is I am querying for the div that contains the class cgus_post and its returning just the text. How do I have it return the HTML elements also?

Here's my innerHTML function that I always use:
function innerHTML(DOMNode $node, $trim = true, $decode = true) {
$innerHTML = '';
foreach ($node->childNodes as $inner_node) {
$temp_container = new DOMDocument();
$temp_container->appendChild($temp_container->importNode($inner_node, true));
$innerHTML .= ($trim ? trim($temp_container->saveHTML()) : $temp_container->saveHTML());
}
return ($decode ? html_entity_decode($innerHTML) : $innerHTML);
}
So then you do:
$dom = new DOMDocument();
$dom->loadHTML($html);
echo htmlentities(innerHTML($dom->documentElement->childNodes->item(0)->firstChild));

Related

Xpath nodeValue/textContent unable to see <BR> tag

HTML is as follows:
ABC<BR>DEF
However, both nodeValue and textContent attributes show "ABCDEF" as the value.
Any way to show or parse the <BR>?
Maybe this'll help you: DOMNode::C14N
It'll return the HTML of the node.
<?php
$a = 'ABC<BR>DEF';
$doc = new DOMDocument();
#$doc->loadHTML($a);
$finder = new DomXPath($doc);
$nodes = $finder->query("//a");
foreach ($nodes as $node) {
var_dump($node->c14n());
}
Demo
I know you have already solved your problem, but I wanted to add a more direct way of solving it...
$a = 'ABC<BR>DEF';
$doc = new DOMDocument();
$doc->loadHTML($a);
$xp = new DomXPath($doc);
$nodes = $xp->query("//a/node()");
$text = '';
foreach ($nodes as $node) {
$text .= $doc->saveHTML($node);
}
echo $text;
Outputs...
ABC<br>DEF

How can I add an element into the middle of a text node's text?

Given the following HTML:
$content = '<html>
<body>
<div>
<p>During the interim there shall be nourishment supplied</p>
</div>
</body>
</html>';
How can I alter it to the following HTML:
<html>
<body>
<div>
<p>During the <span>interim</span> there shall be nourishment supplied</p>
</div>
</body>
</html>
I need to do this using DomDocument. Here's what I've tried:
$dom = new DomDocument();
$dom->loadHTML($content);
$dom->preserveWhiteSpace = false;
$xpath = new DOMXpath($dom);
$elements = $xpath->query("//*[contains(text(),'interim')]");
if (!is_null($elements)) {
foreach ($elements as $element) {
$text = $element->nodeValue;
$element->nodeValue = str_replace('interim','<span>interim</span>',$text);
}
}
echo $dom->saveHTML();
However, this outputs literal html entities so it renders like this in the browser:
During the <span>interim</span> there shall be nourishment supplied
I imagine one should use createElement and appendChild methods instead of assigning nodeValue directly but I can't see how to insert an element in the middle of a textNode string?
Marcus Harrison's answer using splitText is a good one, but it can be simplified and needs to use mb_* methods to work with UTF-8 input:
<?php
$html = <<<END
<html>
<meta charset="utf-8">
<body>
<div>
<p>During € the interim there shall be nourishment supplied</p>
</div>
</body>
</html>
END;
$replace = 'interim';
$doc = new DOMDocument;
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$nodes = $xpath->query(sprintf('//text()[contains(., "%s")]', $replace));
foreach ($nodes as $node) {
$start = mb_strpos($node->textContent, $replace);
$end = $start + mb_strlen($replace);
$node->splitText($end); // do this first
$node->splitText($start); // do this last
$newnode = $doc->createElement('span');
$node->parentNode->insertBefore($newnode, $node->nextSibling);
$newnode->appendChild($newnode->nextSibling);
}
$doc->encoding = 'UTF-8';
print $doc->saveHTML($doc->documentElement);
Create a new DomDocument with modified element and replace the old one
foreach ($elements as $element) {
$text = $element->nodeValue;
$el = new DomDocument();
$el->loadHTML('<iframe>'. str_replace('interim','<span>interim</span>',$text) . '</iframe>');
$new = $dom->importNode($el->getElementsByTagName('iframe')->item(0), true);
unset($el);
$element->parentNode->replaceChild($new, $element);
}
In order to do this, you must use the DOMString's splitText interface. This accepts an offset, which can be retrieved by using strpos:
$dom = new DomDocument();
$dom->loadHTML($content);
$dom->preserveWhiteSpace = false;
$xpath = new DOMXpath($dom);
$elements = $xpath->query("//*[contains(text(),'interim')]");
if (!is_null($elements)) {
foreach ($elements as $element) {
$text = $element->childNodes->item(0);
$text->splitText(strpos($text->textContent, "interim"));
$text2 = $element->childNodes->item(1);
$text2->splitText(strpos($text2->textContent, " "));
$element->removeChild($text2);
$span = $dom->createElement("span");
$span->appendChild($dom->createTextNode("interim"));
$element->insertBefore($span, $element->childNodes->item(1));
}
}
echo $dom->saveHTML();
Edits: having just tested it, I realise I hadn't removed the original "interim" in the second text node. Edited this answer to do that. I have also edited this code to be as compatible with old versions of PHP as I can think of making it: as I don't run an old version of PHP it isn't possible for me to test that.

Get all elements by class name using DOMDocument

This question seems to have been answered numerous times but i still cant seem to put the pieces together.
I would like to get node value of every class by name. for example
<td class="thename"><strong>32</strong></td>
<td class="thename"><strong>12</strong></td>
i would like to grab the 32 and the 12. I assume this requires for sort of for loop but not sure exactly how to go about implementing it. Here's what i have so far
$domain = "http://domain.com";
$dom = new DOMDocument();
$dom->loadHTMLFile($domain);
$xpath = new DomXpath($dom);
$div = $xpath->query('//*[#class="thename"]')->item(0);
$stuff = $div ->textContent;
echo($stuff);
Is this what your are looking for?
$result = array();
$doc = <<< HTML
<html>
<body>
<div>1
<span>2</span>
</div>
<div>3</div>
<div>4
<span class="class1"><strong>5</strong></span>
<span class="class1"><strong>6</strong></span>
<span>7</span>
</div>
</body>
</html>
HTML;
$classname = "class1";
$domdocument = new DOMDocument();
$domdocument->loadHTML($doc);
$a = new DOMXPath($domdocument);
$spans = $a->query("//*[contains(concat(' ', normalize-space(#class), ' '), ' $classname ')]");
for ($i = $spans->length - 1; $i > -1; $i--) {
$result[] = $spans->item($i)->firstChild->nodeValue;
}
echo "<pre>";
print_r($result);
exit();
i simply did this in php
$dom = new DOMDocument('1.0');
$classname = "product-name";
#$dom->loadHTMLFile("http://shophive.com/".$query);
$nodes = array();
$nodes = $dom->getElementsByTagName("div");
foreach ($nodes as $element)
{
$classy = $element->getAttribute("class");
if (strpos($classy, "product") !== false)
{
echo $classy;
echo '<br>';
}
}

XPATH/PHP - Smarter way to acommplish this?

I have the following:
$html = "<img src="path/to/image.jpg" alt="Alt name" />Page name"
I need to extract href and src attribute and anchor text
My solution:
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $node) {
$href = $node->getAttribute('href');
$title = $node->nodeValue;
}
foreach ($dom->getElementsByTagName('img') as $node) {
$img = $node->getAttribute('src');
}
What would be the smarter way?
You can avoid the loops if you use DOMXPath to grab the elements directly:
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXpath( $dom);
$a = $xpath->query( '//a')->item( 0); // Get the first <a> node
$img = $xpath->query( '//img', $a)->item( 0); // Get the <img> child of that <a>
Now, you can do:
echo $a->getAttribute('href');
echo $a->nodeValue;
echo $img->getAttribute('src');
This will print:
/path/to/page.html
Page name
path/to/image.jpg
Possible alternative approach:
$domXpath = new DOMXPath(DOMDocument::loadHTML($html));
$href = $domXpath->query('a/#href')->item(0)->nodeValue;
$src = $domXpath->query('img/#src')->item(0)->nodeValue;
Empty/null checks are up to you.
http://ca2.php.net/manual/en/function.preg-match.php - if you want to use regex
or
http://php.net/manual/en/book.simplexml.php
if you need to use xml parsing.
// Simple xml
$xml = simplexml_load_string($html);
$attr = $xml->attributes();
echo 'href: ' . $attr['href'] . PHP_EOL;

find class name of html source using php

I am new to PHP. I want to write code to find the id specified in the html code below, which is 1123. Can any one give me some idea?
<span class="miniprofile-container /companies/1123?miniprofile="
data-tracking="NUS_CMPY_FOL-nhre"
data-li-getjs="http://s.c.lnkd.licdn.com/scds/concat/common/js?h=dyt8o4nwtaujeutlgncuqe0dn&fc=2">
<strong>
<a href="http://www.linkedin.com/nus-trk?trkact=viewCompanyProfile&pk=biz-overview-public&pp=1&poster=&uid=5674666402166894592&ut=NUS_UNIU_FOLLOW_CMPY&r=&f=0&url=http%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fcompany%2F1123%3Ftrk%3DNUS_CMPY_FOL-nhre&urlhash=7qbc">
Bank of America
</a>
</strong>
</span> has a new Project Manager
Note: I don't need the content in the span class. I need the id in the span class name.
I tried the following:
$dom = new DOMDocument('1.0', 'UTF-8');
#$dom->loadHTML($html);
$xmlElements = simplexml_import_dom($dom);
$id = $xmlElements->xpath("//span [#class='miniprofile-container /companies/$data_id?miniprofile=']");
... but I don't know how to proceed further.
dependent of your need, you could do
$matches = array();
preg_match('|<span class="miniprofile-container /companies/(\d+)\?miniprofile|', $html, $matches);
print_r($matches);
this is a very trivial regex, but could serve as a first suggestion. If you want to go via DomDocument or simplexml, you mustn't mix both like you did in your example.
What is your preferred way, we can narrow this down then.
//edit: pretty much what #fireeyedboy said, but this is what I just fiddled together:
<?php
$html = <<<EOD
<html><head></head>
<body>
<span class="miniprofile-container /companies/1123?miniprofile="
data-tracking="NUS_CMPY_FOL-nhre"
data-li-getjs="http://s.c.lnkd.licdn.com/scds/concat/common/js?h=dyt8o4nwtaujeutlgncuqe0dn&fc=2">
<strong>
<a href="#">
Bank of America
</a>
</strong>
</span> has a new Project Manager
</body>
</html>
EOD;
$domDocument = new DOMDocument('1.0', 'UTF-8');
$domDocument->recover = TRUE;
$domDocument->loadHTML($html);
$xPath = new DOMXPath($domDocument);
$relevantElements = $xPath->query('//span[contains(#class, "miniprofile-container")]');
$foundId = NULL;
foreach($relevantElements as $match) {
$pregMatches = array();
if (preg_match('|/companies/(\d+)\?miniprofile|', $match->getAttribute('class'), $pregMatches)) {
if (isset($pregMatches[1])) {
$foundId = $pregMatches[1];
break;
}
};
}
echo $foundId;
?>
This should do what you are after:
$dom = new DOMDocument('1.0', 'UTF-8');
#$dom->loadHTML( $html );
$xpath = new DOMXPath( $dom );
/*
* the following xpath query will find all class attributes of span elements
* whose class attribute contain the strings " miniprofile-container " and " /companies/"
*/
$nodes = $xpath->query( "//span[contains(concat(' ', #class, ' '), ' miniprofile-container ') and contains(concat(' ', #class, ' '), ' /companies/')]/#class" );
foreach( $nodes as $node )
{
// extract the number found between "/companies/" and "?miniprofile" in the node's nodeValue
preg_match( '#/companies/(\d+)\?miniprofile#', $node->nodeValue, $matches );
var_dump( $matches[ 1 ] );
}

Categories