Replace Tag in HTML with DOMDocument

Replace Tag in HTML with DOMDocument - php

I'm trying to edit html tags with DOMDocument::loadHTML in php. The html data is a part of html and not the whole page. I followed what this page (PHP - DOMDocument - need to change/replace an existing HTML tag w/ a new one) says.
This should convert pre tags into div tags but it gives "Fatal error: Uncaught exception 'DOMException' with message 'Not Found Error'."
<?php
$contents = <<<STR
<pre>hi</pre>
<pre>hello</pre>
<pre>bye</pre>
STR;
$dom = new DOMDocument;
#$dom->loadHTML($contents);
foreach( $dom->getElementsByTagName("pre") as $nodePre ) {
$nodeDiv = $dom->createElement("div", $nodePre->nodeValue);
$dom->replaceChild($nodeDiv, $nodePre);
}
echo $dom->saveHTML();
?>
[Edit]
While I'm trying to iterate the node object backwards, I get this error, 'Notice: Trying to get property of non-object...'
<?php
$contents = <<<STR
<pre>hi</pre>
<pre>hello</pre>
<pre>bye</pre>
STR;
$dom = new DOMDocument;
#$dom->loadHTML($contents);
$domPre = $dom->getElementsByTagName('pre');
$length = $domPre->length;
For ($i = $length; $i > -1 ; $i--) {
$nodePre = $domPre->item($i);
echo $nodePre->nodeValue . '<br />';
// $nodeDiv = $dom->createElement("div", $nodePre->nodeValue);
// $dom->replaceChild($nodeDiv, $nodePre);
}
// echo $dom->saveHTML();
?>
[Edit]
Okey, solved. Since the answered code has some error I post the solution here. Thanks all.
Solution:
<?php
$contents = <<<STR
<pre>hi</pre>
<pre>hello</pre>
<pre>bye</pre>
STR;
$dom = new DOMDocument;
#$dom->loadHTML($contents);
$domPre = $dom->getElementsByTagName('pre');
$length = $domPre->length;
For ($i = $length - 1; $i > -1 ; $i--) {
$nodePre = $domPre->item($i);
$nodeDiv = $dom->createElement("div", $nodePre->nodeValue);
$nodePre->parentNode->replaceChild($nodeDiv, $nodePre);
}
echo $dom->saveHTML();
?>

The problem is the call to replaceChild(). Rather than
$dom->replaceChild($nodeDiv, $nodePre);
use
$nodePre->parentNode->replaceChild($nodeDiv, $nodePre);
update
Here is a working code. Seems there is some issue with replacing multiple nodes (more info here: http://php.net/manual/en/domnode.replacechild.php) so you'll have to use a regressive loop to replace the elements.
$contents = <<<STR
<pre>hi</pre>
<pre>hello</pre>
<pre>bye</pre>
STR;
$dom = new DOMDocument;
#$dom->loadHTML($contents);
$elements = $dom->getElementsByTagName("pre");
for ($i = $elements->length - 1; $i >= 0; $i --) {
$nodePre = $elements->item($i);
$nodeDiv = $dom->createElement("div", $nodePre->nodeValue);
$nodePre->parentNode->replaceChild($nodeDiv, $nodePre);
}

Another way with paquettg/php-html-parser (didn't find the way to change name, so had to use hack with re-binding $this):
use PHPHtmlParser\Dom;
use PHPHtmlParser\Dom\HtmlNode;
$dom = new Dom;
$dom->load($text);
/** #var HtmlNode[] $tags */
foreach($dom->find('pre') as $tag) {
$changeTag = function() {
$this->name = 'div';
};
$changeTag->call($tag->tag);
};
echo (string)$dom;

Related

Change outerHTML of a php DOMElement?

How do I change the outerHtml of an element using PHP DomDocument class? Make sure, no third party library is used such as Simple PHP Dom or else.
For example:
I want to do something like this.
$dom = new DOMDocument;
$dom->loadHTML($html);
$tag = $dom->getElementsByTagName('h3');
foreach ($tag as $e) {
$e->outerHTML = '<h5>Hello World</h5>';
}
libxml_clear_errors();
$html = $dom->saveHTML();
echo $html;
And the output should be like this:
Old Output: <h3>Hello World</h3>
But I need this new output: <p>Hello World</p>

You can create a copy of the element content and attributes in a new node (with the new name you need), and use the function replaceChild().
The current code will work only with simple elements (a text inside a node), if you have nested elements, you will need to write a recursive function.
$dom = new DOMDocument;
$dom->loadHTML($html);
$titles = $dom->getElementsByTagName('h3');
for($i = $titles->length-1 ; $i >= 0 ; $i--)
{
$title = $titles->item($i);
$titleText = $title->textContent ; // get original content of the node
$newTitle = $dom->createElement('h5'); // create a new node with the correct name
$newTitle->textContent = $titleText ; // copy the content of the original node
// copy the attribute (class, style, ...)
$attributes = $title->attributes ;
for($j = $attributes->length-1 ; $j>= 0 ; --$j)
{
$attributeName = $attributes->item($j)->nodeName ;
$attributeValue = $attributes->item($j)->nodeValue ;
$newAttribute = $dom->createAttribute($attributeName);
$newAttribute->nodeValue = $attributeValue ;
$newTitle->appendChild($newAttribute);
}
$title->parentNode->replaceChild($newTitle, $title); // replace original node per our copy
}
libxml_clear_errors();
$html = $dom->saveHTML();
echo $html;

Getting link tag via DOMDocument

I convert an atom feed into RSS using atom2rss.xsl. Works fine.
Then, using DOMDocument, I try to get the post title and URL:
$feed = new DOMDocument();
$feed->loadHTML('<?xml encoding="utf-8" ?>' . $html);
if (!empty($feed) && is_object($feed) ) {
foreach ($feed->getElementsByTagName("item") as $item){
echo 'url: '. $item->getElementsByTagName("link")->item(0)->nodeValue;
echo 'title'. $item->getElementsByTagName("title")->item(0)->nodeValue;
}
return;
}
But the post URL is empty.
See this eval which contains HTML. What am I doing wrong? I suspect I am not getting the link tag properly via $item->getElementsByTagName("link")->item(0)->nodeValue.

I think the problem is that there are several <link> elements in each item and the one (I think) your interested in is the one with rel="self" as an attribute. The quickest way (without messing around with XPath) is to loop over each <link> element checking for the right rel value and then take the href attribute from that...
if (!empty($feed) && is_object($feed) ) {
foreach ($feed->getElementsByTagName("item") as $item){
$url = "";
// Look for the 'right' link tag and extract URL from that
foreach ( $item->getElementsByTagName("link") as $link ) {
if ( $link->getAttribute("rel") == "self" ) {
$url = $link->getAttribute("href");
break;
}
}
echo 'url: '. $url;
echo 'title'. $item->getElementsByTagName("title")->item(0)->nodeValue;
}
return;
}
which gives...
url: https://www.blogger.com/feeds/2984353310628523257/posts/default/1947782625877709813titleExtraordinary Genius - Cp274

function get_links($link)
{
$ret = array();
$dom = new DOMDocument();
#$dom->loadHTML(file_get_contents($link));
$dom->preserveWhiteSpace = false;
$links = $dom->getElementsByTagName('a');
foreach ($links as $tag){
$ret[$tag->getAttribute('href')] = $tag->childNodes->item(0)->nodeValue;
}
return $ret;
}
print_r(get_links('http://www.google.com'));
OR u can use DOMXpath
$html = file_get_contents('http://www.google.com');
$dom = new DOMDocument();
#$dom->loadHTML($html);
// take all links
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
echo $url.'
';

PHP: Remove a hyperlink from element but retain the text and class

I need to process a DOM and remove all hyperlinks to a particular site while retaining the underlying text. Thus, something ling text changes into text. Taking cue from this thread, I wrote this:
$as = $dom->getElementsByTagName('a');
for ($i = 0; $i < $as->length; $i++) {
$node = $as->item($i);
$link_href = $node->getAttribute('href');
if (strpos($link_href,'offendinglink.com') !== false) {
$cl = $node->getAttribute('class');
$text = new DomText($node->nodeValue);
$node->parentNode->insertBefore($text, $node);
$node->parentNode->removeChild($node);
$i--;
}
}
This works fine except that I also need to retain the class attributed to the offending <a> tag and maybe turn it into a <div> or a <span>. Thus, I need this:
text
to turn into this:
<div class="nice">text</div>
How do I access the new element after it's been added (like in my code snippet)?

quote "How do I access the new element after it's been added (like in my code snippet)?" - your element is in $text i think.. anyway, i think this should work, if you need to save the class and the textContent, but nothing else
foreach($dom->getElementsByTagName('a') as $url){
if(parse_url($url->getAttribute("href"),PHP_URL_HOST)!=='badsite.com') {
continue;
}
$ele = $dom->createElement("div");
$ele->textContent = $url->textContent;
$ele->setAttribute("class",$url->getAttribute("class"));
$url->parentNode->insertBefore($ele,$url);
$url->parentNode->removeChild($url);
}

Tested solution:
<?php
$str = "<b>Dummy</b> <a href='http://google.com' target='_blank' class='nice' id='nicer'>Google.com</a> <a href='http://yandex.ru' target='_blank' class='nice' id='nicer'>Yandex.ru</a>";
$doc = new DOMDocument();
$doc->loadHTML($str);
$anchors = $doc->getElementsByTagName('a');
$l = $anchors->length;
for ($i = 0; $i < $l; $i++) {
$anchor = $anchors->item(0);
$link = $doc->createElement('div', $anchor->nodeValue);
$link->setAttribute('class', $anchor->getAttribute('class'));
$anchor->parentNode->replaceChild($link, $anchor);
}
echo preg_replace(['/^\<\!DOCTYPE.*?<html><body>/si', '!</body></html>$!si'], '', $doc->saveHTML());
Or see runnable.

Get Element by ClassName with DOMdocument() Method

Here is what I am trying to achieve : retrieve all products on a page and put them into an array. Here is the code I am using :
$page2 = curl_exec($ch);
$doc = new DOMDocument();
#$doc->loadHTML($page2);
$nodes = $doc->getElementsByTagName('title');
$noders = $doc->getElementsByClassName('productImage');
$title = $nodes->item(0)->nodeValue;
$product = $noders->item(0)->imageObject.src;
It works for the $title but not for the product. For info, in the HTML code the img tag looks like this :
<img alt="" class="productImage" data-altimages="" src="xxxx">
I have been looking at this (PHP DOMDocument how to get element?) but I still don't understand how to make it work.
PS : I get this error :
Call to undefined method DOMDocument::getElementsByclassName()

I finally used the following solution :
$classname="blockProduct";
$finder = new DomXPath($doc);
$spaner = $finder->query("//*[contains(#class, '$classname')]");

https://stackoverflow.com/a/31616848/3068233
Linking this answer as it helped me the most with this problem.
function getElementsByClass(&$parentNode, $tagName, $className) {
$nodes=array();
$childNodeList = $parentNode->getElementsByTagName($tagName);
for ($i = 0; $i < $childNodeList->length; $i++) {
$temp = $childNodeList->item($i);
if (stripos($temp->getAttribute('class'), $className) !== false) {
$nodes[]=$temp;
}
}
return $nodes;
}
Theres the code and heres the usage
$dom = new DOMDocument('1.0', 'utf-8');
$dom->loadHTML($html);
$content_node=$dom->getElementById("content_node");
$div_a_class_nodes=getElementsByClass($content_node, 'div', 'a');

function getElementsByClassName($dom, $ClassName, $tagName=null) {
if($tagName){
$Elements = $dom->getElementsByTagName($tagName);
}else {
$Elements = $dom->getElementsByTagName("*");
}
$Matched = array();
for($i=0;$i<$Elements->length;$i++) {
if($Elements->item($i)->attributes->getNamedItem('class')){
if($Elements->item($i)->attributes->getNamedItem('class')->nodeValue == $ClassName) {
$Matched[]=$Elements->item($i);
}
}
}
return $Matched;
}
// usage
$dom = new \DOMDocument('1.0');
#$dom->loadHTML($html);
$elementsByClass = getElementsByClassName($dom, $className, 'h1');

Print an array after DOM extraction?

I need to print out my array, but print_r($test) doesn't work at last...
Here is a simple code :
$code = '<html><head></head><body><div class="list"><img src="http://google.com/564308080517287.jpg" alt="my title"></div></body></html>'; // Code is simplified here, but imagine you've got much more contents inside
$doc = new DOMDocument();
$doc->loadHTML( $code );
//
$test = array();
foreach($doc->getElementsByTagName('div') as $div){
if($div->getAttribute('class') == "list"){
$ads_count = $div->getElementsByTagName('a')->length;
for ($i=0; $i<=$ads_count; $i++) {
$ad = $div->getElementsByTagName('a')->item($i);
$ad_img = trim($ad->getElementsByTagName('img')->item(0)->getAttribute('src'));
$test[$i]['img'] = $ad_img;
}
}
}
print_r($test); // doesn't work !!
Any idea ?

<?php
$code = '<html><head></head><body><div class="list">
<img src="http://google.com/564308080517287.jpg" alt="my title"></div></body></html>'; // Code is simplified here, but imagine you've got much more contents inside
$dom = new DOMDocument();
$dom->loadHtml($code);
$selector = new DOMXPath($dom);
$parceiltable = $selector->query("//div[#class='list']/a/img");
foreach($parceiltable as $key=>$tds){
$test[]['img'] = $tds->getAttribute('src');
}
print_r($test);
?>

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Replace Tag in HTML with DOMDocument - php

Related

Change outerHTML of a php DOMElement?

Getting link tag via DOMDocument

PHP: Remove a hyperlink from element but retain the text and class

Get Element by ClassName with DOMdocument() Method

Print an array after DOM extraction?

Categories

Resources