Replace all matches that are not part of HTML code

Replace all matches that are not part of HTML code - php

I have input such as:
<h2 class="role">He played and an important role</h2>
And need to replace the role, but not in the class.
Tricky is, that it may be class="group role something" or so, so I essentially only want to search the real text and not the html, but I need to give back everything.
I'm in PHP and do not have a real good starting point ...

Better no preg_ for parsing HTML, use dom:
$input = '<h2 class="role">He played and an important role</h2>';
$dom = new domDocument('1.0', 'utf-8');
$dom->loadHTML($input);
$dom->preserveWhiteSpace = false;
$element = $dom->getElementsByTagName('h2'); // <--- change tag name as appropriate
$value = $element->item(0)->nodeValue;
// change $value here...

It is better to use the DOM to manipulate HTML, but here is a regex solution.
It will not make the replacement if > appears before < ahead in the string.
$input = '<h2 class="role">He played and an important role</h2>';
$input = preg_replace( '/role(?![^<>]*>[^<>]*(?:<|$))/', 'new role', $input );
echo $input;
// <h2 class="role">He played and an important new role</h2>

Related

How i can find 100% sure a JS inside of HTML tag?

I need to save some data with some HTML tags, so I can not use strip_tags for all text and I can not use htmlentities because the text must be modified by the tags. To defend other users against XSS I must remove any JavaScript from inside of the tags.
What is the best way to do this?

If you need to save HTML tags in your database, and latter want to print it back to browser, there is no 100% secure way to achieve this using built in PHP functions. Its easy when there is no html tags, when you have text only you can use built in PHP functions to clear text.
There are some functions that clear XSS from text but they are not 100% secure and there is always a way for XSS to go unnoticed. And your regex example is fine but what if i use lets say < script>alert('xss')</script>, or some other combination that regex could miss and browser would execute.
The best way to do this is to use something like HTML Purifier
Also note that there are two levels of security, first is when things go into your database, and second when they are going out of your database.
Hope this helps!

You have to parse the HTML if you want to allow specific tags.
There is already a nice library for that purpose: HTML Purifier (Opensource under LGPL)

I suggest that you use DOMDocument (with loadHTML) to load said HTML, remove every kind of tag and every attribute you don't want to see, and save back the HTML (using saveXML or saveHTML). You can do that by recursively iterating over the children of the document's root, and replacing tags you don't want by their inner contents. Since loadHTML loads code in a similar way browsers do, it's a much safer way to do it than using regular expressions.
EDIT Here's a "purifying" function I made:
<?php
function purifyNode($node, $whitelist)
{
$children = array();
// copy childNodes since we're going to iterate over it and modify the collection
foreach ($node->childNodes as $child)
$children[] = $child;
foreach ($children as $child)
{
if ($child->nodeType == XML_ELEMENT_NODE)
{
purifyNode($child, $whitelist);
if (!isset($whitelist[strtolower($child->nodeName)]))
{
while ($child->childNodes->length > 0)
$node->insertBefore($child->firstChild, $child);
$node->removeChild($child);
}
else
{
$attributes = $whitelist[strtolower($child->nodeName)];
// copy attributes since we're going to iterate over it and modify the collection
$childAttributes = array();
foreach ($child->attributes as $attribute)
$childAttributes[] = $attribute;
foreach ($childAttributes as $attribute)
{
if (!isset($attributes[$attribute->name]) || !preg_match($attributes[$attribute->name], $attribute->value))
$child->removeAttribute($attribute->name);
}
}
}
}
}
function purifyHTML($html, $whitelist)
{
$doc = new DOMDocument();
$doc->loadHTML($html);
// make sure <html> doesn't have any attributes
while ($doc->documentElement->hasAttributes())
$doc->documentElement->removeAttributeNode($doc->documentElement->attributes->item(0));
purifyNode($doc->documentElement, $whitelist);
$html = $doc->saveHTML();
$fragmentStart = strpos($html, '<html>') + 6; // 6 is the length of <html>
return substr($html, $fragmentStart, -8); // 8 is the length of </html> + 1
}
?>
You would call purifyHTML with an unsafe HTML string and a predefined whitelist of tags and attributes. The whitelist format is 'tag' => array('attribute' => 'regex'). Tags that don't exist in the whitelist are stripped, with their contents inlined in the parent tag. Attributes that don't exist for a given tag in the whitelist are removed as well; and attributes that exist in the whitelist, but that don't match the regex, are removed as well.
Here's an example:
<?php
$html = <<<HTML
<p>This is a paragraph.</p>
<p onclick="alert('xss')">This is an evil paragraph.</p>
<p>Evil link</p>
<p><script>evil()</script></p>
<p>This is an evil image: <img src="error.png" onerror="evil()"/></p>
<p>This is nice <b>bold text</b>.</p>
<p>This is a nice image: <img src="http://example.org/image.png" alt="Nice image"></p>
HTML;
// whitelist format: tag => array(attribute => regex)
$whitelist = array(
'b' => array(),
'i' => array(),
'u' => array(),
'p' => array(),
'img' => array('src' => '#\Ahttp://.+\Z#', 'alt' => '#.*#'),
'a' => array('href' => '#\Ahttp://.+\Z#')
);
$purified = purifyHTML($html, $whitelist);
echo $purified;
?>
The result is:
<p>This is a paragraph.</p>
<p>This is an evil paragraph.</p>
<p><a>Evil link</a></p>
<p>evil()</p>
<p>This is an evil image: <img></p>
<p>This is nice <b>bold text</b>.</p>
<p>This is a nice image: <img src="http://example.org/image.png" alt="Nice image"></p>
Obviously, you don't want to allow any on* attribute, and I would advise against style because of weird proprietary properties like behavior. Make sure all URL attributes are validated with a decent regex that matches the full string (\Aregex\Z).

i wrote this code for this you can set list of tag and attribute for remove
function RemoveTagAttribute($Dom,$Name){
$finder = new DomXPath($Dom);
if(!is_array($Name))$Name=array($Name);
foreach($Name as $Attribute){
$Attribute=strtolower($Attribute);
do{
$tag=$finder->query("//*[#".$Attribute."]");
//print_r($tag);
foreach($tag as $T){
if($T->hasAttribute($Attribute)){
$T->removeAttribute($Attribute);
}
}
}while($tag->length>0);
}
return $Dom;
}
function RemoveTag($Dom,$Name){
if(!is_array($Name))$Name=array($Name);
foreach($Name as $tagName){
$tagName=strtolower($tagName);
do{
$tag=$Dom->getElementsByTagName($tagName);
//print_r($tag);
foreach($tag as $T){
//
$T->parentNode->removeChild($T);
}
}while($tag->length>0);
}
return $Dom;
}
example:
$dom= new DOMDocument;
$HTML = str_replace("&", "&", $HTML); // disguise &s going IN to loadXML()
// $dom->substituteEntities = true; // collapse &s going OUT to transformToXML()
$dom->recover = TRUE;
#$dom->loadHTML('<?xml encoding="UTF-8">' .$HTML);
// dirty fix
foreach ($dom->childNodes as $item)
if ($item->nodeType == XML_PI_NODE)
$dom->removeChild($item); // remove hack
$dom->encoding = 'UTF-8'; // insert proper
$dom=RemoveTag($dom,"script");
$dom=RemoveTagAttribute($dom,array("onmousedown","onclick"));
echo $dom->saveHTML();

strip_tags disallow some tags

Based on the strip_tags documentation, the second parameter takes the allowable tags. However in my case, I want to do the reverse. Say I'll accept the tags the script_tags normally (default) accept, but strip only the <script> tag. Any possible way for this?
I don't mean somebody to code it for me, but rather an input of possible ways on how to achieve this (if possible) is greatly appreciated.

EDIT
To use the HTML Purifier HTML.ForbiddenElements config directive, it seems you would do something like:
require_once '/path/to/HTMLPurifier.auto.php';
$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.ForbiddenElements', array('script','style','applet'));
$purifier = new HTMLPurifier($config);
$clean_html = $purifier->purify($dirty_html);
http://htmlpurifier.org/docs
HTML.ForbiddenElements should be set to an array. What I don't know is what form the array members should take:
array('script','style','applet')
Or:
array('<script>','<style>','<applet>')
Or... Something else?
I think it's the first form, without delimiters; HTML.AllowedElements uses a form of configuration string somewhat common to TinyMCE's valid elements syntax:
tinyMCE.init({
...
valid_elements : "a[href|target=_blank],strong/b,div[align],br",
...
});
So my guess is it's just the term, and no attributes should be provided (since you're banning the element... although there is a HTML.ForbiddenAttributes, too). But that's a guess.
I'll add this note from the HTML.ForbiddenAttributes docs, as well:
Warning: This directive complements %HTML.ForbiddenElements,
accordingly, check out that directive for a discussion of why you
should think twice before using this directive.
Blacklisting is just not as "robust" as whitelisting, but you may have your reasons. Just beware and be careful.
Without testing, I'm not sure what to tell you. I'll keep looking for an answer, but I will likely go to bed first. It is very late. :)
Although I think you really should use HTML Purifier and utilize it's HTML.ForbiddenElements configuration directive, I think a reasonable alternative if you really, really want to use strip_tags() is to derive a whitelist from the blacklist. In other words, remove what you don't want and then use what's left.
For instance:
function blacklistElements($blacklisted = '', &$errors = array()) {
if ((string)$blacklisted == '') {
$errors[] = 'Empty string.';
return array();
}
$html5 = array(
"<menu>","<command>","<summary>","<details>","<meter>","<progress>",
"<output>","<keygen>","<textarea>","<option>","<optgroup>","<datalist>",
"<select>","<button>","<input>","<label>","<legend>","<fieldset>","<form>",
"<th>","<td>","<tr>","<tfoot>","<thead>","<tbody>","<col>","<colgroup>",
"<caption>","<table>","<math>","<svg>","<area>","<map>","<canvas>","<track>",
"<source>","<audio>","<video>","<param>","<object>","<embed>","<iframe>",
"<img>","<del>","<ins>","<wbr>","<br>","<span>","<bdo>","<bdi>","<rp>","<rt>",
"<ruby>","<mark>","<u>","<b>","<i>","<sup>","<sub>","<kbd>","<samp>","<var>",
"<code>","<time>","<data>","<abbr>","<dfn>","<q>","<cite>","<s>","<small>",
"<strong>","<em>","<a>","<div>","<figcaption>","<figure>","<dd>","<dt>",
"<dl>","<li>","<ul>","<ol>","<blockquote>","<pre>","<hr>","<p>","<address>",
"<footer>","<header>","<hgroup>","<aside>","<article>","<nav>","<section>",
"<body>","<noscript>","<script>","<style>","<meta>","<link>","<base>",
"<title>","<head>","<html>"
);
$list = trim(strtolower($blacklisted));
$list = preg_replace('/[^a-z ]/i', '', $list);
$list = '<' . str_replace(' ', '> <', $list) . '>';
$list = array_map('trim', explode(' ', $list));
return array_diff($html5, $list);
}
Then run it:
$blacklisted = '<html> <bogus> <EM> em li ol';
$whitelist = blacklistElements($blacklisted);
if (count($errors)) {
echo "There were errors.\n";
print_r($errors);
echo "\n";
} else {
// Do strip_tags() ...
}
http://codepad.org/LV8ckRjd
So if you pass in what you don't want to allow, it will give you back the HTML5 element list in an array form that you can then feed into strip_tags() after joining it into a string:
$stripped = strip_tags($html, implode('', $whitelist)));
Caveat Emptor
Now, I've kind've hacked this together and I know there are some issues I haven't thought out yet. For instance, from the strip_tags() man page for the $allowable_tags argument:
Note:
This parameter should not contain whitespace. strip_tags() sees a tag
as a case-insensitive string between < and the first whitespace or >.
It means that strip_tags("<br/>", "<br>") returns an empty string.
It's late and for some reason I can't quite figure out what this means for this approach. So I'll have to think about that tomorrow. I also compiled the HTML element list in the function's $html5 element from this MDN documentation page. Sharp-eyed reader's might notice all of the tags are in this form:
<tagName>
I'm not sure how this will effect the outcome, whether I need to take into account variations in the use of a shorttag <tagName/> and some of the, ahem, odder variations. And, of course, there are more tags out there.
So it's probably not production ready. But you get the idea.

First, see what others have said on this topic:
Strip <script> tags and everything in between with PHP?
and
remove script tag from HTML content
It seems you have 2 choices, one is a Regex solution, both the links above give them. The second is to use HTML Purifier.
If you are stripping the script tag for some other reason than sanitation of user content, the Regex could be a good solution. However, as everyone has warned, it is a good idea to use HTML Purifier if you are sanitizing input.

PHP(5 or greater) solution:
If you want to remove <script> tags (or any other), and also you want to remove the content inside tags, you should use:
OPTION 1 (simplest):
preg_replace('#<script(.*?)>(.*?)</script>#is', '', $text);
OPTION 2 (more versatile):
<?php
$html = "<p>Your HTML code</p><script>With malicious code</script>"
$dom = new DOMDocument();
$dom->loadHTML($html);
$script = $dom->getElementsByTagName('script');
$remove = [];
foreach($script as $item)
{
$item->parentNode->removeChild($item);
}
$html = $dom->saveHTML();
Then $html will be:
"<p>Your HTML code</p>"

This is what I use to strip out a list of forbidden tags, can do both removing of tags wrapping content and tags including content, Plus trim off leftover white space.
$description = trim(preg_replace([
# Strip tags around content
'/\<(.*)doctype(.*)\>/i',
'/\<(.*)html(.*)\>/i',
'/\<(.*)head(.*)\>/i',
'/\<(.*)body(.*)\>/i',
# Strip tags and content inside
'/\<(.*)script(.*)\>(.*)<\/script>/i',
], '', $description));
Input example:
$description = '<html>
<head>
</head>
<body>
<p>This distinctive Mini Chopper with Desire styling has a powerful wattage and high capacity which makes it a very versatile kitchen accessory. It also comes equipped with a durable glass bowl and lid for easy storage.</p>
<script type="application/javascript">alert('Hello world');</script>
</body>
</html>';
Output result:
<p>This distinctive Mini Chopper with Desire styling has a powerful wattage and high capacity which makes it a very versatile kitchen accessory. It also comes equipped with a durable glass bowl and lid for easy storage.</p>

I use the following:
function strip_tags_with_forbidden_tags($input, $forbidden_tags)
{
foreach (explode(',', $forbidden_tags) as $tag) {
$tag = preg_replace(array('/^</', '/>$/'), array('', ''), $tag);
$input = preg_replace(sprintf('/<%s[^>]*>([^<]+)<\/%s>/', $tag, $tag), '$1', $input);
}
return $input;
}
Then you can do:
echo strip_tags_with_forbidden_tags('<cancel>abc</cancel>xpto<p>def></p><g>xyz</g><t>xpto</t>', 'cancel,g');
Output: 'abcxpto<p>def></p>xyz<t>xpto</t>'
echo strip_tags_with_forbidden_tags('<cancel>abc</cancel> xpto <p>def></p> <g>xyz</g> <t>xpto</t>', 'cancel,g');
Outputs: 'abc xpto <p>def></p> xyz <t>xpto</t>'

Remove empty HTML from a document

I need some help stripping empty tags in my HTML. There is a solution here:
Remove empty tags using RegEx
But I can't use JS, and I should never use Regular expressions to parse HTML.
I need to clean inputs with PHP, and I also need to get more than just empty tags.
I also need to catch tags like this:
<p> </p> (variable whitespace with nothing in the tag)
<p> </p>
<p><br/><p>
<p><br /></p>
What can I do to catch bad markup like that before it makes it to the database (WYSIWYGs)?

Parse it with a document object model parser, check the text content of nodes, remove nodes that don't meet your criteria (parses as a script tag, contains whitespace, is an iframe, etc).
Quite a lot of sample code in the comments section as well.
Here's a bunch of code that does something like that (adopted from random cut+paste on php.net)
<?php
$sampleHTML = "
<p> </p>
<p> <p>
<p><br/></p>
<p><br /></p>
<span>Non-empty span<p id='NestedEmptyElement'></p></span>
";
$doc = new DOMDocument();
$doc->loadHTML($sampleHTML);
$domNodeList = $doc->getElementsByTagname('*');
$domElemsToRemove = array();
foreach ( $domNodeList as $domElement ) {
$domElement->normalize();
if (trim($domElement->textContent, "\xc2\xa0 \n \t ") == "") {
$domElemsToRemove[] = $domElement;
}
}
foreach( $domElemsToRemove as $domElement ){
try {
$domElement->parentNode->removeChild($domElement);
} catch (Exception $e) {
//node was already deleted.
//There's a better way to do this, it's recursive.
}
}
$domNodeList = $doc->getElementsByTagname('body')->item(0);
$childNodes = $domNodeList->childNodes;
foreach ( $childNodes as $domElement ) {
echo trim($domElement->C14N());
}
echo "\n\n";
Then we run..
$ php foo.php -v
<span>Non-empty span</span>

That matches your examples and a little more:
^<p>\s*(?:(?: |<br\s*/>)\s*)*</p>$
But are you looking only for p tags? Can there be several per line?
Yet another use of normal* (special normal*)* with:
normal: \s,
special: ( |<br\s*/>)
(with non capturing groups)

I worked on this for about a day and saw a lot of "dont use regex" which I agree with.
I however had huge problems with DOMDocument messing with my html entities. I would carefully filter text so that all TM symbols were converted to HTML entities such as ™ but it would convert them back to the TM symbol.
I battled with preventing this behavior for some time. There were some hacks mentioned for this. After a day of battling I thought "why should I work so hard to hack it to work? It should just work.." then I wrote this function using simplehtmldom in like 10 minutes:
function stripEmptyTags($html){
$dom = new simple_html_dom();
$dom->load($html);
foreach($dom->find("*") as $e)
if( trim( str_replace( array(' ',' '), "", $e->innertext )) == "" )
$e->outertext = "";
$dom->load($dom->save());
return $dom->save();
}

Ignore html tags in preg_replace

How do I ignore html tags in this preg_replace.
I have a foreach function for a search, so if someone searches for "apple span" the preg_replace also applies a span to the span and the html breaks:
preg_replace("/($keyword)/i","<span class=\"search_hightlight\">$1</span>",$str);
Thanks in advance!

I assume you should make your function based on DOMDocument and DOMXPath rather than using regular expressions. Even those are quite powerful, you run into problems like the one you describe which are not (always) easily and robust to solve with regular expressions.
The general saying is: Don't parse HTML with regular expressions.
It's a good rule to keep in mind and albeit as with any rule, it does not always apply, it's worth to make up one's mind about it.
XPath allows you so find all texts that contain the search terms within texts only, ignoring all XML elements.
Then you only need to wrap those texts into the <span> and you're done.
Edit: Finally some code ;)
First it makes use of xpath to locate elements that contain the search text. My query looks like this, this might be written better, I'm not a super xpath pro:
'//*[contains(., "'.$search.'")]/*[FALSE = contains(., "'.$search.'")]/..'
$search contains the text to search for, not containing any " (quote) character (this would break it, see Cleaning/sanitizing xpath attributes for a workaround if you need quotes).
This query will return all parents that contain textnodes which put together will be a string that contain your search term.
As such a list is not easy to process further as-is, I created a TextRange class that represents a list of DOMText nodes. It is useful to do string-operations on a list of textnodes as if they were one string.
This is the base skeleton of the routine:
$str = '...'; # some XML
$search = 'text that span';
printf("Searching for: (%d) '%s'\n", strlen($search), $search);
$doc = new DOMDocument;
$doc->loadXML($str);
$xp = new DOMXPath($doc);
$anchor = $doc->getElementsByTagName('body')->item(0);
if (!$anchor)
{
throw new Exception('Anchor element not found.');
}
// search elements that contain the search-text
$r = $xp->query('//*[contains(., "'.$search.'")]/*[FALSE = contains(., "'.$search.'")]/..', $anchor);
if (!$r)
{
throw new Exception('XPath failed.');
}
// process search results
foreach($r as $i => $node)
{
$textNodes = $xp->query('.//child::text()', $node);
// extract $search textnode ranges, create fitting nodes if necessary
$range = new TextRange($textNodes);
$ranges = array();
while(FALSE !== $start = strpos($range, $search))
{
$base = $range->split($start);
$range = $base->split(strlen($search));
$ranges[] = $base;
};
// wrap every each matching textnode
foreach($ranges as $range)
{
foreach($range->getNodes() as $node)
{
$span = $doc->createElement('span');
$span->setAttribute('class', 'search_hightlight');
$node = $node->parentNode->replaceChild($span, $node);
$span->appendChild($node);
}
}
}
For my example XML:
<html>
<body>
This is some <span>text</span> that span across a page to search in.
and more text that span</body>
</html>
It produces the following result:
<html>
<body>
This is some <span><span class="search_hightlight">text</span></span><span class="search_hightlight"> that span</span> across a page to search in.
and more <span class="search_hightlight">text that span</span></body>
</html>
This shows that this even allows to find text that is distributed across multiple tags. That's not that easily possible with regular expressions at all.
You find the full code here: http://codepad.viper-7.com/U4bxbe (including the TextRange class that I have taken out of the answers example).
It's not working properly on the viper codepad because of an older LIBXML version that site is using. It works fine for my LIBXML version 20707. I created a related question about this issue: XPath query result order.
A note of warning: This example uses binary string search (strpos) and the related offsets for splitting textnodes with the DOMText::splitText function. That can lead to wrong offsets, as the functions needs the UTF-8 character offset. The correct method is to use mb_strpos to obtain the UTF-8 based value.
The example works anyway because it's only making use of US-ASCII which has the same offsets as UTF-8 for the example-data.
For a real life situation, the $search string should be UTF-8 encoded and mb_strpos should be used instead of strpos:
while(FALSE !== $start = mb_strpos($range, $search, 0, 'UTF-8'))

PHP DOM - stripping span tags, leaving their contents

I am looking to take markup like:
<span class="test">Some text that is <strong>bolded</strong> and contains a link.</span>
and find the best method in PHP for stripping the span so that what is left is this:
Some text that is <strong>bolded</strong> and contains a link.
I have read many of the other questions regarding parsing HTML using PHP DOM instead of regex, but have been unable to figure out a way to strip the spans with PHP DOM, leaving the HTML contents intact. The ultimate goal is to be able to strip the document of all span tags, leaving their contents. Can this be done with PHP DOM? Is there a method that provides better performance and does not rely on string parsing instead of DOM parsing?
I've used regex to do so, without any issues thus far:
/<(\/)?(span)[^>]*>/i
But my interest here is in becoming a better PHP programmer. And since it is always possible to trip up a regex with badly formatted markup, I'm looking for a better way. I have also considered using strip_tags(), doing something like the following:
public function strip_tags( $content, $tags_to_strip = array() )
{
// All Valid XHTML tags
$valid_tags = array(
'a','abbr','acronym','address','area','b','base','bdo','big','blockquote','body','br','button','caption','cite',
'code','col','colgroup','dd','del','dfn','div','dl','DOCTYPE','dt','em','fieldset','form','h1','h2','h3','h4',
'h5','h6','head','html','hr','i','img','input','ins','kbd','label','legend','li','link','map','meta','noscript',
'object','ol','optgroup','option','p','param','pre','q','samp','script','select','small','span','strong','style',
'sub','sup','table','tbody','td','textarea','tfoot','th','thead','title','tr','tt','ul','var'
);
// Remove each tag to strip from the valid_tags array
foreach ( $tags_to_strip as $tag ){
$ndx = array_search( $tag, $valid_tags );
if ( $ndx !== false ){
unset( $valid_tags[ $ndx ] );
}
}
// convert valid_tags array into param for strip_tags
$valid_tags = implode( '><', $valid_tags );
$valid_tags = "<$valid_tags>";
$content = strip_tags( $content, $valid_tags );
return $content;
}
But this is still parsing the string, and not DOM parsing. So if the text is mal-formed, it is possible to strip too much. Many people are quick to suggest using Simple HTML DOM Parser, but looking at the source code, it seems to be using regex to parse the html as well.
Can this be done with PHP5's DOM, or is there a better way to strip tags leaving their contents intact. Would it be bad practice to use Tidy or HTML Purifier to clean the text and then use regex / HTML Simple HTML DOM parser on it?
Libraries like phpQuery seem to be too heavy weight for what seems like it should be a simple task.

I use the following function to remove a node without removing its children:
function DOMRemove(DOMNode $from) {
$sibling = $from->firstChild;
do {
$next = $sibling->nextSibling;
$from->parentNode->insertBefore($sibling, $from);
} while ($sibling = $next);
$from->parentNode->removeChild($from);
}
Per example:
$dom = new DOMDocument;
$dom->load('myhtml.html');
$nodes = $dom->getElementsByTagName('span');
foreach ($nodes as $node) {
DOMRemove($node);
}
echo $dom->saveHTML();
Would give you:
Some text that is <strong>bolded</strong> and contains a link.
While this:
$nodes = $dom->getElementsByTagName('a');
foreach ($nodes as $node) {
DOMRemove($node);
}
echo $dom->saveHTML();
Would give you:
<span class="test">Some text that is <strong>bolded</strong> and contains a link.</span>

Well,
In my experience, every time I worked with DOM, I los a little bit in performance when comparing with simple stri operations.
With your function, you tried to filter strictly the valid XHTML tags, but you don't need a loop with manual comparison since you can assign all this task to PHP interpreter through native functions.
Of course, you have combined well to achieve a very good performance (to me, 0.0002 miliseconds), but you could try to combine functions, in a single line, allowing each function do your own natural job.
Take a look and you will understand what I'm talking about:
$text = '<span class="test">Some text that is <strong>bolded</strong> and contains a link.</span>';
$validTags = array( 'a','abbr','acronym','address','area','b','base','bdo','big','blockquote','body','br','button','caption','cite',
'code','col','colgroup','dd','del','dfn','div','dl','DOCTYPE','dt','em','fieldset','form','h1','h2','h3','h4',
'h5','h6','head','html','hr','i','img','input','ins','kbd','label','legend','li','link','map','meta','noscript',
'object','ol','optgroup','option','p','param','pre','q','samp','script','select','small','span','strong','style',
'sub','sup','table','tbody','td','textarea','tfoot','th','thead','title','tr','tt','ul','var'
);
$tagsToStrip = array( 'span' );
var_dump( strip_tags( $text, sprintf( '<%s>', implode( '><', array_diff( $validTags, $tagsToStrip ) ) ) ) );
I used your own list, but I combined sprintf(), implode() and array_diff() to do specific tasks for, together, achieve the goal.
Hope it helped.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Replace all matches that are not part of HTML code - php

Related

How i can find 100% sure a JS inside of HTML tag?

strip_tags disallow some tags

Remove empty HTML from a document

Ignore html tags in preg_replace

PHP DOM - stripping span tags, leaving their contents

Categories

Resources