im being played by php and DomDocument.... basically i have some html saved in db. With anchor tags with different urls.... i want to force anchor tag hrefs not within allowedurl list to be replaced with #
eg
$allowed_url_basenames = array('viewprofile.php','viewalbum.php');
sample content from db1
<table cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td valign="top">
Edrine Kasasa has
</td>
<td valign="top">
invited 10 friend(s) to veepiz using the Invite Tool
</td>
</tr>
</tbody>
i want a php function that will leave first anchor tag href intact and change the second to href='#'.
This should be pretty straight-forward.
First, let's grab all of the anchor tags. $doc is the document you've created with your HTML as the source.
$anchors = $doc->getElementsByTagName('a');
Now we'll go through them one-by-one and inspect the href attribute. Let's pretend that the function contains_bad_url returns true when the passed string is on your blacklist. You'll need to write that yourself.
foreach($anchors as $anchor)
if($anchor->hasAttribute('href') && contains_bad_url($anchor->getAttribute('href'))) {
$anchor->setAttribute('href', '#');
}
}
Tada. That should be all there is to it. You should be able to get the results back as an XML string and do whatever you need to do with the rest.
Thanx Charles.... came up with this
function contains_bad_urls($href,$allowed_urls)
{
$x=pathinfo($href);
$bn=$x['filename'];
if (array_search($bn,$allowed_urls)>-1)
{
return false;
}
return true;
}
function CleanHtmlUrls($str)
{
$allow_urls = array('viewprofile','viewwall');//change these to whatever filename
$doc = new DOMDocument();
$doc->loadHTML($str);
$doc->formatOutput = true;
$anchors = $doc->getElementsByTagName('a');
foreach($anchors as $anchor)
{
$anchor->setAttribute('onclick','#');
if(contains_bad_urls($anchor->getAttribute('href'),$allow_urls))
{
$anchor->setAttribute('href', '#');
}
}
$ret=$doc->saveHTML();
return $ret
}
Related
I am trying to work with DOMDocument but I am encountering some problems. I have a string like this:
Some Content to keep
<span class="ice-cts-1 ice-del" data-changedata="" data-cid="5" data-time="1414514760583" data-userid="1" data-username="Site Administrator" undefined="Site Administrator">
This content should remain, but span around it should be stripped
</span>
Keep this content too
<span>
<span class="ice-cts-1 ice-ins" data-changedata="" data-cid="2" data-time="1414512278297" data-userid="1" data-username="Site Administrator" undefined="Site Administrator">
This whole node should be deleted
</span>
</span>
What I want to do is, if the span has a class like ice-del keep the inner content but remove the span tags. If it has ice-ins, remove the whole node.
If it is just an empty span <span></span> remove it as well. This is the code I have:
//this get the above mentioned string
$getVal = $array['body'][0][$a];
$dom = new DOMDocument;
$dom->loadHTML($getVal );
$xPath = new DOMXPath($dom);
$delNodes = $xPath->query('//span[#class="ice-cts-1 ice-del"]');
$insNodes = $xPath->query('//span[#class="ice-cts-1 ice-ins"]');
foreach($insNodes as $span){
//reject these changes, so remove whole node
$span->parentNode->removeChild($span);
}
foreach($delNodes as $span){
//accept these changes, so just strip out the tags but keep the content
}
$newString = $dom->saveHTML();
So, my code works to delete the entire span node, but how do I take a node and strip out it tags but keep its content?
Also, how would I just delete and empty span? I'm sure I could do this using regex or replace but I kind of want to do this using the dom.
thanks
No, I wouldn't recommend regex, I strongly recommend build on what you have right now with the use of this beautiful HTML Parser. You could use ->replaceChild in this case:
$dom = new DOMDocument;
$dom->loadHTML($getVal);
$xPath = new DOMXPath($dom);
$spans = $xPath->query('//span');
foreach ($spans as $span) {
$class = $xPath->evaluate('string(./#class)', $span);
if(strpos($class, 'ice-ins') !== false || $class == '') {
$span->parentNode->removeChild($span);
} elseif(strpos($class, 'ice-del') !== false) {
$span->parentNode->replaceChild(new DOMText($span->nodeValue), $span);
}
}
$newString = $dom->saveHTML();
More generic solution to delete any HTML tag from a DOM tree use this;
$dom = new DOMDocument;
$dom->loadHTML($getVal);
$xPath = new DOMXPath($dom);
$tagName = $xPath->query('//table'); //use what you want like div, span etc.
foreach ($tagName as $t) {
$t->parentNode->removeChild($span);
}
$newString = $dom->saveHTML();
Example html:
<html>
<head></head>
<body>
<table>
<tr><td>Hello world</td></tr>
</table>
</body>
</html>
Output after process;
<html>
<head></head>
<body></body>
</html>
I'm working with a row of html table cells that look like:
<td align="left" class="info">my message goes here</td>
<td align="left" class="info">my message goes here</td>
I would like to modify these cells by inserting a clickable anchor tag into each.
I have written the following function:
public function modifyAttribute2($domDoc) {
//We use xpath to search ChildElement:
$domXPath = new DOMXPath($domDoc);
$items = $domXPath->query("//td[#class='moreinfo']");
foreach ($items as $item) {
echo $item->nodeValue . "\n";
$item->nodeValue = "hi";
$doc = new DOMDocument();
$valid_elem = $doc->createElement('a');
$valid_attr = $doc->createAttribute('href');
$valid_attr->value = base_url();
$valid_elem->appendChild($valid_attr);
// We insert the new element as root (child of the document)
$xmlcontent = $domDoc->importNode($valid_elem, true);
$item->appendChild($xmlcontent);
$domDoc->saveXML($item); }
echo $domDoc->saveXML();
exit;
}
addendum:
I'm trying to create and import a new domdocument node into my original document $domDoc as suggested but I do not see any sign of the imported node after saving and inspecting the html. What am I doing wrong?
<!-- This Div repeated in HTML with different properties value -->
<div style="position:absolute; overflow:hidden; left:220px; top:785px; width:347px; height:18px; z-index:36">
<!-- Only Unique Thing is This in few pages -->
<a href="http://link.domain.com/?id=123" target="_parent">
<!-- OR in some pages Only Unique Thing is This, ending with mp3 extension -->
<a href="http://domain.com/song-title.mp3" target="_parent">
<!-- This Div also repeated multiple in HTML -->
<FONT style="font-size:10pt" color=#000000 face="Tahoma">
<DIV><B>Harjaiyaan</B> - Nandini Srikar</DIV>
</FONT>
</a>
</DIV>
We have very dirty html markup, its generated by some program or application. We want to extract 'Urls' from this code and as well as 'Text'.
in a href we are using two types urls, Url 1 pattern: 'http://link.domain.com/id=123', Url 2 pattern: 'http://domain.com/sons-title.mp3'
in the first match, we are specific pattern for but in second url we have not pattern just urls ending with '.mp3' extension.
Is there some function to extract url from this pattern and text code as well?
Note: without DOM, is there any way to match a href and between text with regular expression ? preg_match ?
Make use of DOMDocument Class and proceed like this.
$dom = new DOMDocument;
$dom->loadHTML($html); //<------- Pass ur HTML source here
foreach ($dom->getElementsByTagName('a') as $tag) {
echo $tag->getAttribute('href');
echo $tag->nodeValue; // to get the content in between of tags...
}
Expanding on #Shankar Damodaran's answer:
$html = file_get_contents('source.htm');
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $tag) {
if (strstr($tag->getAttribute('href'),'?id=') !== false) {
echo $tag->getAttribute('href') . "<br>\n";
}
}
Then do the same for the MP3:
$html = file_get_contents('source.htm');
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $tag) {
if (strstr($tag->getAttribute('href'),'.mp3') !== false) {
echo $tag->getAttribute('href') . "<br>\n";
}
}
When a link in A.html is clicked :
<tr>
<td class="book"><a class="booklink" ref="../collection?file=book1.pdf">
Good Read
</a>
-Blah blah blah
</td>
</tr>
?file=book1.pdf is passed to B.html:
<?php
$src = $_GET['file'];
?>
<iframe src="<?php echo $src; ?>" >
</iframe>
QUESTION:- How to retrieve the text "Good Read-Blah blah blah" from A.html and paste it into the meta description in B.html by using simple html dom? (Please know that there are thousand of listed data in the table in A.html)
Thank you.
Use DOM to load your HTML document and XPath to search it.
// note: if the HTML to parse has bad syntax, use: libxml_use_internal_errors(true);
$doc = new DOMDocument;
$doc->loadHTML(file_get_contents('A.html'));
if ($doc === false) {
throw new RuntimeException('Could not load HTML');
}
$xpath = new DOMXPath($doc);
$xpathResult = $xpath->query("//a[#href = '../collection?file={$_GET['file']}']/..");
if ($xpathResult === false) {
throw new LogicException('Something went wrong querying the document!');
}
foreach ($xpathResult as $domNode) {
echo 'Link text: ' . htmlentities($domNode->textContent) . PHP_EOL;
}
Tasks:
All occurrences must be replaced with the value of constant DEVICE which I have already per-defined using some other function. BUT should not replace from the code wrapped within the <pre> and <code> tags.
This task need to be accomplished without the use of any JavaScript.
Sample Mark-up:
<div class="wrapper">
<img src="img/demo/desktop-img/sample1.jpg" width="200" height="200" alt="image">
<img src="img/demo/desktop-img/sample2.jpg" width="200" height="200" alt="image">
<pre>
<img src="img/demo/desktop-img/sample1.jpg" width="200" height="200" alt="image">
<img src="img/demo/desktop-img/sample2.jpg" width="200" height="200" alt="image">
</pre>
<code>
<img src="img/demo/desktop-img/sample1.jpg" width="200" height="200" alt="image">
<img src="img/demo/desktop-img/sample2.jpg" width="200" height="200" alt="image">
</code>
</div>
I tried the following but it is replacing the occurrences within the <pre> and <code> tags too and that is exactly what should not happen. I am using PHP as my programming language.
$content = str_replace('/desktop-img/', '/' . DEVICE . '-img/', $content);
I even tried the following but no luck
<?php
function changeImagePaths ($content) { // content here comes as string via some other functon.
$dom = new DOMDocument;
$dom->loadHTML($content);
$nodes = $dom->getElementsByTagName('[self::img][not(ancestor::pre) and not(ancestor::code)]');
foreach( $nodes as $node ) {
// What is the correct way to retrive all image-paths
// that are not wrapped within <pre> or <code> tags and how to use them in the code?
$node->nodeValue = str_replace('desktop-img', DEVICE, $node->nodeValue);
}
$content = $dom->saveHTML();
return $content;
}
?>
Challenge:
I think it should be possible using the DOM method but I am not able to figure out the correct syntax.
I am very new to programming hence still in the learning process. Please be gentle and illustrate you answer with code examples for easy understanding.
My Questions:
What should be the correct syntax for DOM method?
Is using DOM the correct way?
Are there going to be any challenges / performance hits, using the DOM method?
Is there any other better way of doing it without using JavaScript?
This is a quick work, so it may not look nice, but hopefully you get the idea.
Online demo
function walkDOM($node)
{
if($node->nodeName=="pre" || $node->nodeName=="code")
{
return;
}
elseif($node->nodeName=="img")
{
$node->attributes->getNamedItem("src")->nodeValue=str_replace('desktop-img','mobile-img',$node->attributes->getNamedItem("src")->nodeValue);
}
elseif($node->hasChildNodes())
{
foreach($node->childNodes as $child)
{
walkDOM($child);
}
}
}
function changeImagePath($html)
{
$dom=new DOMDocument;
$dom->preserveWhiteSpace=true;
$dom->loadHTML($html);
$root=$dom->documentElement;
walkDOM($root);
$dom->formatOutput=false;
return $dom->saveHTML($root);
}
The thought is to recursively walk through the DOM tree, skip every <pre> and <code>, and change all <img> that encountered.
This is pretty straight forward, but you may notice that since you treat it as HTML, DOM automatically add some tags to "fulfill" it, and format it in a (IMO) quite strange manner.
Xpath is your friend:
$xpath = new DOMXpath($dom);
foreach($xpath->query('//img[not(ancestor::pre) and not(ancestor::code)]') as $img){
$img->setAttribute('src', 'foo');
}
$path1='C:/ff/ss.html';
$file=file_get_contents($path1);
$dom = new DOMDocument;
#$dom->loadHTML($file);
$links = $dom->getElementsByTagName('img');
//extract img src path from the html page
foreach ($links as $link) {
$re= $link->getAttribute('src');
$a[]=$re;
}
$oldpth=explode('/',$path1);
$c=count($oldpth)-1;
$fname=$oldpth[$c];
$pth=array_slice($oldpth,0,$c);
$cpth=implode('/',$pth);
foreach($a as $v) {
if(is_file ($cpth.'/'.$v)) {
$c=explode('/',$v);
$c[0]="xyz007";
$f=implode('/',$c);
$file=str_replace ($v,$f,$file);
}
}
$path2='D:/mail/newpath/';
$wnew=fopen($path2.$fname,'w+');
fwrite($wnew,$file);