PHP: Remove a hyperlink from element but retain the text and class - php

I need to process a DOM and remove all hyperlinks to a particular site while retaining the underlying text. Thus, something ling text changes into text. Taking cue from this thread, I wrote this:
$as = $dom->getElementsByTagName('a');
for ($i = 0; $i < $as->length; $i++) {
$node = $as->item($i);
$link_href = $node->getAttribute('href');
if (strpos($link_href,'offendinglink.com') !== false) {
$cl = $node->getAttribute('class');
$text = new DomText($node->nodeValue);
$node->parentNode->insertBefore($text, $node);
$node->parentNode->removeChild($node);
$i--;
}
}
This works fine except that I also need to retain the class attributed to the offending <a> tag and maybe turn it into a <div> or a <span>. Thus, I need this:
text
to turn into this:
<div class="nice">text</div>
How do I access the new element after it's been added (like in my code snippet)?

quote "How do I access the new element after it's been added (like in my code snippet)?" - your element is in $text i think.. anyway, i think this should work, if you need to save the class and the textContent, but nothing else
foreach($dom->getElementsByTagName('a') as $url){
if(parse_url($url->getAttribute("href"),PHP_URL_HOST)!=='badsite.com') {
continue;
}
$ele = $dom->createElement("div");
$ele->textContent = $url->textContent;
$ele->setAttribute("class",$url->getAttribute("class"));
$url->parentNode->insertBefore($ele,$url);
$url->parentNode->removeChild($url);
}

Tested solution:
<?php
$str = "<b>Dummy</b> <a href='http://google.com' target='_blank' class='nice' id='nicer'>Google.com</a> <a href='http://yandex.ru' target='_blank' class='nice' id='nicer'>Yandex.ru</a>";
$doc = new DOMDocument();
$doc->loadHTML($str);
$anchors = $doc->getElementsByTagName('a');
$l = $anchors->length;
for ($i = 0; $i < $l; $i++) {
$anchor = $anchors->item(0);
$link = $doc->createElement('div', $anchor->nodeValue);
$link->setAttribute('class', $anchor->getAttribute('class'));
$anchor->parentNode->replaceChild($link, $anchor);
}
echo preg_replace(['/^\<\!DOCTYPE.*?<html><body>/si', '!</body></html>$!si'], '', $doc->saveHTML());
Or see runnable.

Related

Change outerHTML of a php DOMElement?

How do I change the outerHtml of an element using PHP DomDocument class? Make sure, no third party library is used such as Simple PHP Dom or else.
For example:
I want to do something like this.
$dom = new DOMDocument;
$dom->loadHTML($html);
$tag = $dom->getElementsByTagName('h3');
foreach ($tag as $e) {
$e->outerHTML = '<h5>Hello World</h5>';
}
libxml_clear_errors();
$html = $dom->saveHTML();
echo $html;
And the output should be like this:
Old Output: <h3>Hello World</h3>
But I need this new output: <p>Hello World</p>
You can create a copy of the element content and attributes in a new node (with the new name you need), and use the function replaceChild().
The current code will work only with simple elements (a text inside a node), if you have nested elements, you will need to write a recursive function.
$dom = new DOMDocument;
$dom->loadHTML($html);
$titles = $dom->getElementsByTagName('h3');
for($i = $titles->length-1 ; $i >= 0 ; $i--)
{
$title = $titles->item($i);
$titleText = $title->textContent ; // get original content of the node
$newTitle = $dom->createElement('h5'); // create a new node with the correct name
$newTitle->textContent = $titleText ; // copy the content of the original node
// copy the attribute (class, style, ...)
$attributes = $title->attributes ;
for($j = $attributes->length-1 ; $j>= 0 ; --$j)
{
$attributeName = $attributes->item($j)->nodeName ;
$attributeValue = $attributes->item($j)->nodeValue ;
$newAttribute = $dom->createAttribute($attributeName);
$newAttribute->nodeValue = $attributeValue ;
$newTitle->appendChild($newAttribute);
}
$title->parentNode->replaceChild($newTitle, $title); // replace original node per our copy
}
libxml_clear_errors();
$html = $dom->saveHTML();
echo $html;

How to get text(file names) present between anchor tags in PHP?

I've following strings with me in which the file name is present in between anchor tags:
$test1 = test<div class="comment_attach_file">
<a class="comment_attach_file_link" href="http://52.1.47.143/feed/download/year_2015/month_04/file_3b701923a804ed6f28c61c4cdc0ebcb2.txt" >phase2 screen.txt</a><br>
<a class="comment_attach_file_link_dwl" href="http://52.1.47.143/feed/download/year_2015/month_04/file_3b701923a804ed6f28c61c4cdc0ebcb2.txt" >Download</a>
</div>;
$test2 = This is a holiday list.<div class="comment_attach_file">
<a class="comment_attach_file_link" href="http://52.1.47.143/feed/download/year_2015/month_04/file_2c96b997f03eefab317811e368731bb6.pdf" >Holiday List-2013.pdf</a><br>
<a class="comment_attach_file_link_dwl" href="http://52.1.47.143/feed/download/year_2015/month_04/file_2c96b997f03eefab317811e368731bb6.pdf" >Download</a>
</div>;
$test3 = <div class="comment_attach_file">
<a class="comment_attach_file_link" href="http://52.1.47.143/feed/download/year_2015/month_04/file_8479c0b60867fdce35ae94a668dfbba9.docx" >sample2.docx</a><br>
</div>;
From the first string I want text(i.e. file name) "phase2 screen.txt"
From the second string I want text(i.e. file name) "Holiday List-2013.pdf"
From the third string I want text(i.e. file name) "sample2.docx"
How should I do in PHP using $dom = new DOMDocument;?
Please someone help me.
Thanks.
You can use DOMxpath to target that link that contains the text you want, use its class to point to it:
$dom = new DOMDocument;
for($i = 1; $i <= 3; $i++) {
#$dom->loadHTML(${"test{$i}"});
$xpath = new DOMXpath($dom);
$file_name = $xpath->evaluate('string(//a[#class="comment_attach_file_link"])');
echo $file_name , '<br/>';
}
Or if you don't want to use xpath, you can get the anchor elements and check for its class, if its that one, get the ->nodeValue:
$dom = new DOMDocument;
for($i = 1; $i <= 3; $i++) {
#$dom->loadHTML(${"test{$i}"});
foreach($dom->getElementsByTagName('a') as $anchor) {
if($anchor->getAttribute('class') === 'comment_attach_file_link') {
echo $anchor->nodeValue, '<br/>';
break;
}
}
}
Sample Output
If you want to get the value after the hash mark or anchor as shown in a user's browser: This isn't possible with "standard" HTTP as this value is never sent to the server (hence it won't be available in $_SERVER["REQUEST_URI"] or similar predefined variables). You would need some sort of JavaScript magic on the client side, e.g. to include this value as a POST parameter.
In dom you can use some function like this to get the links and change it according to your need
function findAnchors($html)
{
$links = array();
$doc = new DOMDocument();
$doc->loadHTML($html);
$navbars = $doc->getElementsByTagName('div');
foreach ($navbars as $navbar) {
$id = $navbar->getAttribute('id');
if ($id === "anchors") {
$anchors = $navbar->getElementsByTagName('a');
foreach ($anchors as $a) {
$links[] = $doc->saveHTML($a);
}
}
}
return $links;
}

Extracting multiple strong tags using PHP Simple HTML DOM Parser

I have over 500 pages (static) containing content structures this way,
<section>
Some text
<strong>Dynamic Title (Different on each page)</strong>
<strong>Author name (Different on each page)</strong>
<strong>Category</strong>
(<b>Content</b> <b>MORE TEXT HERE)</b>
</section>
And I need to extract the data as formatted below, using PHP Simple HTML DOM Parser
$title = <strong>Dynamic Title (Different on each page)</strong>
$authot = <strong>Author name (Different on each page)</strong>
$category = <strong>Category</strong>
$content = (<b>Content</b> <b>MORE TEXT HERE</b>)
I have failed so far and can't get my head around it, appreciate any advice or code snippet to help me going on.
EDIT 1,
I have now solved the part with strong tags using,
$html = file_get_html($url);
$links = array();
foreach($html->find('strong') as $a) {
$content[] = $a->innertext;
}
$title= $content[0];
$author= $content[1];
the only remaining issue is --> How to extract content within parentheses? using similar method?
OK first you want to get all of the tags
Then you want to search through those again for the tags and tags
Something like this:
// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');
$strong = array();
// Find all <sections>
foreach($html->find('section') as $element) {
$section = $element->src;
// get <strong> tags from <section>
foreach($section->find('strong') as $strong) {
$strong[] = $strong->src;
}
$title = $strong[0];
$authot = $strong[1];
$category = $strong[2];
}
To get the parts in parentheses - just get the b tag text and then add the () brackets.
Or if you're asking how to get parts in between the brackets - use explode then remove the closing bracket:
$pieces = explode("(", $title);
$different_on_each_page = str_replace(")","",$pieces[1]);
$html_code = 'html';
$dom = new \DOMDocument();
$dom->LoadHTML($html_code);
$xpath = new \DOMXPath($this->dom);
$nodelist = $xpath->query("//strong");
for($i = 0; $i < $nodelist->length; $i++){
$nodelist->item($i)->nodeValue; //gives you the text inside
}
My final code that works now looks like this.
$html = file_get_html($url);
$links = array();
foreach($html->find('strong') as $a) {
$content[] = $a->innertext;
}
$title= $content[0];
$author= $content[1];
$category = $content[2];
$details = file_get_html($url)->plaintext;
$input = $details;
preg_match_all("/\(.*?\)/", $input, $matches);
print_r($matches[0]);

adding div to html code got with File_get_contents

I am using file_get_contents to get the html source of remote page, the code got consist of many tables.
what i am trying to do is the code has many <td> like the one below
<td colspan="2">
<b>Video </b>
<span class="section">Sports</span><b>: </b>
<span id="category466" class="category">Motor Sports</span>
</td>
I want to add the div below just before closing </td>
<div style="float: right; padding-right: 2px;"><a class="open_event_tab" target="_blank" href="page123.html" >open event</a></div>
my code now look like this:
<?php
//Get the url
$url = "http://remotesite.com/page1.html";
$html = file_get_contents($url);
$doc = new DOMDocument(); // create DOMDocument
libxml_use_internal_errors(true);
$doc->loadHTML($html); // load HTML you can add $html
$elements = $doc->getElementsByTagName('td');
?>
and i am stopped at getElementsByTagName then i dont know waht to do to add the div as discriped above.
Read the documentation!
The DOMDocument::getElementsByTagName() method returns an instance of DOMNodeList.
DOMNodeList implements the Traversible interface, which means that it can be used in a foreach loop. You can also loop over it using the DOMNodeList::$length property and the DOMNodeList::item($index) method.
Looping over the DOMNodeList you will be working with instances of DOMNode. The DOMNode class has a method called DOMNode::appendChild(), which, funnily enough, takes a DOMNode as its argument.
Now you just have to create the DOMNode and append it. It may not be intuitive to work with the DOM, but at least it is simple once you get acquainted with the documentation.
Put this page under your pillow.
This code works now with the updated HTML (below the code). It inserts the div at the places, where you want them do be.
<?php
//Get the url
$url = "http://remotesite.com/page1.html";
$html = file_get_contents($url);
$doc = new DOMDocument('1.0'); // create DOMDocument
libxml_use_internal_errors(false);
$doc->loadXML($html); // load HTML you can add $html
$domxpath = new DOMXPath($doc);
$filtered = $domxpath->query("//td[#colspan='2']");
$nodeList = $doc->getElementsByTagName('td');
$length = $filtered->length;
$nodes = array();
for ($i = $length - 1; $i >= 0; --$i) {
$node = $filtered->item($i);
$lastChildHTML = $doc->saveXML($node->lastChild);
if (strpos($lastChildHTML, 'class="category"') !== false) {
$nodes[] = $node;
}
}
$allTDNodes = $doc->getElementsByTagName('td');
$tdNodes = array();
foreach ($allTDNodes as $tdNode) {
if (in_array($tdNode, $nodes, true)) {
$tdNodes[] = $tdNode;
}
}
$tdNodes = array_reverse($tdNodes);
$length = count($nodes, 0);
for ($i = 0; $i < $length; $i++) {
$replacement = $doc->createDocumentFragment();
$nodeContent = $doc->saveXML($tdNodes[$i]);
$replacement->appendXML($nodeContent);
$divNode = createDivNode($doc);
$replacement->firstChild->appendChild($divNode);
$tdNodes[$i]->appendChild($divNode);
}
echo $doc->saveXML();
function createDivNode($doc) {
$divNode = $doc->createElement('div');
$divNode->setAttribute('style', 'float: right; padding-right: 2px;');
$aNode = $doc->createElement('a', 'openEvent');
$aNode->setAttribute('class', 'open_event_tab');
$aNode->setAttribute('target', '_blank');
$aNode->setAttribute('href', 'page123.html');
$divNode->appendChild($aNode);
return $divNode;
}
I have updated the used HTML to make it XHTML compliant and fixed a style issue (the relevant areas had css property height: 0px attached to them).

Replace Tag in HTML with DOMDocument

I'm trying to edit html tags with DOMDocument::loadHTML in php. The html data is a part of html and not the whole page. I followed what this page (PHP - DOMDocument - need to change/replace an existing HTML tag w/ a new one) says.
This should convert pre tags into div tags but it gives "Fatal error: Uncaught exception 'DOMException' with message 'Not Found Error'."
<?php
$contents = <<<STR
<pre>hi</pre>
<pre>hello</pre>
<pre>bye</pre>
STR;
$dom = new DOMDocument;
#$dom->loadHTML($contents);
foreach( $dom->getElementsByTagName("pre") as $nodePre ) {
$nodeDiv = $dom->createElement("div", $nodePre->nodeValue);
$dom->replaceChild($nodeDiv, $nodePre);
}
echo $dom->saveHTML();
?>
[Edit]
While I'm trying to iterate the node object backwards, I get this error, 'Notice: Trying to get property of non-object...'
<?php
$contents = <<<STR
<pre>hi</pre>
<pre>hello</pre>
<pre>bye</pre>
STR;
$dom = new DOMDocument;
#$dom->loadHTML($contents);
$domPre = $dom->getElementsByTagName('pre');
$length = $domPre->length;
For ($i = $length; $i > -1 ; $i--) {
$nodePre = $domPre->item($i);
echo $nodePre->nodeValue . '<br />';
// $nodeDiv = $dom->createElement("div", $nodePre->nodeValue);
// $dom->replaceChild($nodeDiv, $nodePre);
}
// echo $dom->saveHTML();
?>
[Edit]
Okey, solved. Since the answered code has some error I post the solution here. Thanks all.
Solution:
<?php
$contents = <<<STR
<pre>hi</pre>
<pre>hello</pre>
<pre>bye</pre>
STR;
$dom = new DOMDocument;
#$dom->loadHTML($contents);
$domPre = $dom->getElementsByTagName('pre');
$length = $domPre->length;
For ($i = $length - 1; $i > -1 ; $i--) {
$nodePre = $domPre->item($i);
$nodeDiv = $dom->createElement("div", $nodePre->nodeValue);
$nodePre->parentNode->replaceChild($nodeDiv, $nodePre);
}
echo $dom->saveHTML();
?>
The problem is the call to replaceChild(). Rather than
$dom->replaceChild($nodeDiv, $nodePre);
use
$nodePre->parentNode->replaceChild($nodeDiv, $nodePre);
update
Here is a working code. Seems there is some issue with replacing multiple nodes (more info here: http://php.net/manual/en/domnode.replacechild.php) so you'll have to use a regressive loop to replace the elements.
$contents = <<<STR
<pre>hi</pre>
<pre>hello</pre>
<pre>bye</pre>
STR;
$dom = new DOMDocument;
#$dom->loadHTML($contents);
$elements = $dom->getElementsByTagName("pre");
for ($i = $elements->length - 1; $i >= 0; $i --) {
$nodePre = $elements->item($i);
$nodeDiv = $dom->createElement("div", $nodePre->nodeValue);
$nodePre->parentNode->replaceChild($nodeDiv, $nodePre);
}
Another way with paquettg/php-html-parser (didn't find the way to change name, so had to use hack with re-binding $this):
use PHPHtmlParser\Dom;
use PHPHtmlParser\Dom\HtmlNode;
$dom = new Dom;
$dom->load($text);
/** #var HtmlNode[] $tags */
foreach($dom->find('pre') as $tag) {
$changeTag = function() {
$this->name = 'div';
};
$changeTag->call($tag->tag);
};
echo (string)$dom;

Categories