How to remove consecutive links from a webpage?

How to remove consecutive links from a webpage? - php

I wish to remove consecutive links on a webpage
Here is a sample
<div style="font-family: Arial;">
<br>
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
Google is a search
engine
In the above html I want to remove the first 2 A tags and not the third one (My script should only remove consecutive tags)

Don't use a regex for this. They are extremely powerful but not for finding this kind of "consecutive" tags.
I suggest you use DOM. Then you can browse the HTML as a tree.
Here is an example (not tested):
$doc = new DOMDocument();
// avoid blank nodes when parsing
$doc->preserveWhiteSpace = false;
// reads HTML in a string, loadHtmlFile() also exists
$doc->loadHTML($html);
// find all "a" tags
$links = $doc->getElementsByTagName('a');
// remove the first link
$parent = $links->item(0)->parentNode;
$parent->removeChild($links->item(0));
// test the node following the second link
if ($links->item(1)->nextSibling->nodeType != XML_TEXT_NODE) {
// delete this node ...
}
// print the modified HTML
// See DOMDocument's attributes if you want to format the output
echo $doc->saveHTML();

Related

Prepend HTML text using DOMDocument without parent container

Let's say I have <p>Text</p>
I'd like to create a function using DOMDocument to be able to insert text, eg:
insertText('<p>Text</p>', '<strong>1.</strong> ')
So that the result was <p><strong>1.</strong> Text<p>
I'm already accessing this paragraph tag, so I think I'm almost there, I just cannot figure out how to append plain text that can be read as HTML
$dom = new DOMDocument();
$question_paragraphs = array();
$dom->loadHTML($str);
$par = $dom->getElementsByTagName('p');
if ($par->length == 1) {
$par->item(0)->setAttribute("class", "first last");
###
### How do I do this here?
###
}
Is it possible to inject text this way?

You can get use the method insertBefore (Official Documentation) as follows
You create your strong element
You insert this node before the text node
$span = $dom->createElement('strong', '1.');
$par->item(0)->insertBefore($span, $par->item(0)->firstChild);
Please note that the second parameter of the insertBefore function is the child to which you want to prepend your tag. So in this case you can use firstChild as your <p> only contains the Text.
This will finally output
<p class="first last"><span>1.</span>Text</p>

Changing a tag <a> to <div> with DOMDocument on WordPress

I'm a beginner in PHP and I would like to set up several functions to replace specific code bits on WordPress (including plugin elements that I can't edit directly).
Below is an example (first line: initial result, second line: desired result):
<span class="fn" itemprop="name">Gael Beyries</span>
<div class="vcard author"><span class="fn" itemprop="name">Gael Beyries</span></div>
PS: I came across this topic: Parsing WordPress post content but the example is too complicated for what I want to do. Could you present me an example code that solves this problem so I can try to modify it to modify other html elements?

Although I'm not sure how this fits into WP, I have basically taken the code from the linked answer and adapted it to your requirements.
I've assumed you want to find the <a> tags with class="vcard author" and this is the basis of the XPath expression. The code in the foreach() loop just copies the data into a new node and replaces the old one...
function replaceAWithDiv($content){
$dom = new DOMDocument();
$dom->loadHTML($content);
$xpath = new DOMXPath($dom);
$aTags = $xpath->query('//a[#class="vcard author"]');
foreach($aTags as $a){
// Create replacement element
$div = $dom->createElement("div");
$div->setAttribute("class", "vcard author");
// Copy contents from a tag to div
foreach ($a->childNodes as $child ) {
$div->appendChild($child);
}
// Replace a tag with div
$a->parentNode->replaceChild($div, $a);
}
return $dom->saveHTML();
}

Extract text from HTML <p> with a particular title

I have a huge file with lots of entries, they have one thing in common, the first line. I want to extract all of the text from a paragraph where the first line is:
Type of document: Contract Notice
The HTML code I am working on is here:
<!-- other HTML -->
<p>
<b>Type of document:</b>
" Contract Notice" <br>
<b>Country</b> <br>
... rest of text ...
</p>
<!-- other HTML -->
I have put the HTML into a DOM like this:
$dom = new DOMDocument;
$dom->loadHTML($content);
I need to return all of the text in the paragraph node where the first line is 'Type of document: Contract Notice' I am sure there is a simple way of doing this using DOM methods or XPath, please advise!

Speaking of XPath, try the following expression which selects<p> elements:
whose <b> child element (first one) has the value Type of document:
whose next sibling text node (first one) contains the text Contract Notice
//p[
b[1][.="Type of document:"]
/following-sibling::text()[1][contains(., "Contract Notice")]
]

With this XPath expression, you select the text of all children of the p element:
//b[text()="Type of document:"]/parent::p/*/text()

I don't like using DomDocument parsing unless I need to heavily parse a document, but if you want to do so then it could be something like:
//Using DomDocument
$doc = new DOMDocument();
$doc->loadHTML($content);
$xpath = new DOMXpath($doc);
$matchedDoms = $xpath->query('//b[text()="Type of document:"]/parent::p//text()');
$data = '';
foreach($matchedDoms as $domMatch) {
$data .= $domMatch->data . ' ';
}
var_dump($data);
I would prefer a simple regex line to do it all, after all it's just one piece of the document you are looking for:
//Using a Regular Expression
preg_match('/<p>.*<b>Type of document:<\/b>.*Contract Notice(?<data>.*)<\/p>/si', $content, $matches);
var_dump($matches['data']); //If you want everything in there
var_dump(strip_tags($matches['data'])); //If you just want the text

php dom document remove some html tags but keep inner tags and text

I need to remove some tags (e.g. <div></div>) in HTML document and keep inner tags and text.
I managed to do that with Simple HTML Dom Parser. But it can't process big files due to huge memory requirements.
I would prefer to use native PHP tools like DOMDocument cause I read that it's more optimized and quicker in processing HTML documents.
But I struggle at the first stage - how to remove some tags while keeping inner text and tags.
Source HTML sample is:
<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>
I try this code:
$htmltext='<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>';
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($htmltext);
$oldnodes = $doc->getElementsByTagName('div');
foreach ($oldnodes as $node) {
$fragment = $doc->createDocumentFragment();
while($node->childNodes->length > 0) {
$fragment->appendChild($node->childNodes->item(0));
}
$node->parentNode->replaceChild($fragment, $node);
}
echo $doc->saveHTML();
It produces the output:
<html><body>00000aaaaa<div>bbbbbbccc<a>link</a>cccdddddd</div>eeeee<div>1111</div></body></html>
I need the following:
<html><body>00000aaaaabbbbbbccc<a>link</a>cccddddddeeeee1111</body></html>
Could someone please help me with proper code for the task?

You can use strip_tags function in PHP.
$thmltext = '<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>';
strip_tags($htmltext, '<html>,<body>,<a>');
This remove all tags except html,body,a
And output is:
<html><body>00000aaaaabbbbbbccc<a>link</a>cccddddddeeeee1111</body></html>
EDIT:
If it is input from user, it's better for security reason to use whitelist tags and not blacklist.

If your code only contains simple HTML tags without any attributes you can keep it simple like:
$value = '<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>';
$pattern = '/<[\/]*(div|h1)>/';
$removedTags = preg_replace($pattern, '', $value);
Since you wrote in your comment that there are more than just div tags you want to remove, I added a h1 tag to the pattern in case you also want to remove h1 tags.
This code snippet is only for simple code, but fits to your HTML input and output example.

Try this..
Just replace the for loop with the below code.
foreach ($oldnodes as $node) {
$children = $node->childNodes;
$string = "";
foreach($children as $child) {
$childString = $doc->saveXML($child);
$string = $string."".$childString;
}
$fragment = $doc->createDocumentFragment();
$fragment->appendXML($string);
$node->parentNode->insertBefore($fragment,$node);
$node->parentNode->removeChild($node);
}

I found a way to make it work.
The reason code in question not working is the manipulation with nodes in nodelist ruin nodelist. So "foreach" function wents through only 2 out of 4 items in nodelist - the rest 2 become distorted.
So I had to deal with only the 1st element of the list and then rebuild list until there are some items in the list left.
The code is:
$htmltext='<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>';
echo "<!--
".$htmltext."
-->
";
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($htmltext);
$oldnodes = $doc->getElementsByTagName('div');
while ($oldnodes->length>0){
$node=$oldnodes->item(0);
$fragment = $doc->createDocumentFragment();
while($node->childNodes->length > 0) {
$fragment->appendChild($node->childNodes->item(0));
}
$node->parentNode->replaceChild($fragment, $node);
$oldnodes = $doc->getElementsByTagName('div');
}
echo $doc->saveHTML();
I hope that will be helpful for someone who finds same difficulties.

Both Copy and Remove Node with PHP DOMDocument

I have a seemingly unique situation in which I want to use DOMDocument to find a node on page, store it's value into variable (working), then remove it from the output. I am not able to figure out how to remove the node from the DOMDocument output and still save it's value first.
I am able to either remove the node completely first, which means nothing is stored in the variable, or I receive a 'Not Found Error' when trying to remove the node.
There is only one node (<h6>) on the page that needs to be removed. The code I have so far (with not found error) is below.
// Strip Everything Before and After Header Tags
$domdoc = new DOMDocument;
$docnew = new DOMDocument;
// Disable errors for <article> tag
libxml_use_internal_errors(true);
$domdoc->loadHTML(file_get_contents($file));
libxml_clear_errors();
$body = $domdoc->getElementsByTagName('body')->item(0);
foreach ($body->childNodes as $child){
$docnew->appendChild($docnew->importNode($child, true));
}
// Get the Page Title
$ppretitle = $docnew->getElementsByTagName('h6')->item(0);
$pagetitle = $ppretitle->nodeValue;
// Remove Same Element From Output
$trunctitl = $docnew->removeChild($ppretitle);
// Save Cleaned Output In Var
$pagecontent = $docnew->saveHTML();

The h6 element might not be a direct child node of the body element: try $ppretitle->parentNode->removeChild($ppretitle) instead of $trunctitl = $docnew->removeChild($ppretitle);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to remove consecutive links from a webpage? - php

Related

Prepend HTML text using DOMDocument without parent container

Changing a tag <a> to <div> with DOMDocument on WordPress

Extract text from HTML <p> with a particular title

php dom document remove some html tags but keep inner tags and text

Both Copy and Remove Node with PHP DOMDocument

Categories

Resources