PHP: Removing duplicate words from between quotes - php

How can I remove the duplicates from between class="" in the following string?
<li class="active active">Sample Page</li>
Please note that the classes shown can change and be in different positions.

You can use DOM parser then explode and array_unique:
$html = '<li class="active active">
Sample Page</li>';
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query("//li");
for($i=0; $i < $nodelist->length; $i++) {
$node = $nodelist->item($i);
$tok = explode(' ', $node->getAttribute('class'));
$tok = array_unique($tok);
$node->setAttribute('class', implode(' ', $tok));
}
$html = $doc->saveHTML();
echo $html;
OUTPUT:
<html><body>
<li class="active">Sample Page</li>
</body></html>
Online Demo

With regex you could use a lookbehind and lookahead for finding duplicates:
$pattern = '/(?<=class=")(?:([-\w]+) (?=\1[ "]))+/i';
This would replace multiple instances of capture group 1 ([-\w]+) in a sequence.
$str = '<li class="active active">';
echo preg_replace($pattern, "", $str);
output:
<li class="active">
Test at regex101
EDIT 08.04.2014
To remove duplicates, that are not directly after the lookbehind (?<=class=")...
The problem is, that a lookbehind assertion can only be of fixed length. so something like (?<=class="[^"]*?) is not possible. As an alternative \K could be used, which resets the beginning of the match. A pattern could be:
$pattern = '/class="[^"]*?\K(?<=[ "])(?:([-\w]+) (?=\1[ "]))+/i';
You could imagine everything before \K as a virtual lookbehind of variable length.
This regex, as the first one, would only replace multiple instances of one duplicate in a sequence.
EDIT 11.09.2014
Finally I think a single regex, that would strip out all of different duplicates is getting rather complex:
/(?>(?<=class=")|(?!^)\G)(?>\b([-\w]++)\b(?=[^"]*?\s\1[\s"])\s+|[-\w]+\s+\K)/
This one uses continuous matching, as soon class=" is found.
Test at regex101; Also see SO Regex FAQ
A more simple way using regex would be a preg_replace_callback():
$html = '<li class="a1 a1 li li-home active li li active a1">';
$html = preg_replace_callback('/\sclass="\K[^"]+/', function ($m) {
return trim(implode(" ",array_unique(preg_split('~\s+~', $m[0]))));
}, $html);
Note that older PHP-versions don't support anonymous functions (if so, change to a normal function).

A way to do it would be to add these values into an array and to filter them. Here is how it can be made.
<?php
preg_match_all('/class="([A-Za-z0-9 ]+)"/',$htmlString, $result);
$classes = explode(" ",$result[0]);
$classes = array_unique($classes);
echo "<li class=\"".implode(" ",$classes)."\">Sample Page</li>";
?>

Related

Validation like Strip_tags for HTML Attributes [duplicate]

I have this html code:
<p style="padding:0px;">
<strong style="padding:0;margin:0;">hello</strong>
</p>
How can I remove attributes from all tags? I'd like it to look like this:
<p>
<strong>hello</strong>
</p>
Adapted from my answer on a similar question
$text = '<p style="padding:0px;"><strong style="padding:0;margin:0;">hello</strong></p>';
echo preg_replace("/<([a-z][a-z0-9]*)[^>]*?(\/?)>/si",'<$1$2>', $text);
// <p><strong>hello</strong></p>
The RegExp broken down:
/ # Start Pattern
< # Match '<' at beginning of tags
( # Start Capture Group $1 - Tag Name
[a-z] # Match 'a' through 'z'
[a-z0-9]* # Match 'a' through 'z' or '0' through '9' zero or more times
) # End Capture Group
[^>]*? # Match anything other than '>', Zero or More times, not-greedy (wont eat the /)
(\/?) # Capture Group $2 - '/' if it is there
> # Match '>'
/is # End Pattern - Case Insensitive & Multi-line ability
Add some quoting, and use the replacement text <$1$2> it should strip any text after the tagname until the end of tag /> or just >.
Please Note This isn't necessarily going to work on ALL input, as the Anti-HTML + RegExp will tell you. There are a few fallbacks, most notably <p style=">"> would end up <p>"> and a few other broken issues... I would recommend looking at Zend_Filter_StripTags as a more full proof tags/attributes filter in PHP
Here is how to do it with native DOM:
$dom = new DOMDocument; // init new DOMDocument
$dom->loadHTML($html); // load HTML into it
$xpath = new DOMXPath($dom); // create a new XPath
$nodes = $xpath->query('//*[#style]'); // Find elements with a style attribute
foreach ($nodes as $node) { // Iterate over found elements
$node->removeAttribute('style'); // Remove style attribute
}
echo $dom->saveHTML(); // output cleaned HTML
If you want to remove all possible attributes from all possible tags, do
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//#*');
foreach ($nodes as $node) {
$node->parentNode->removeAttribute($node->nodeName);
}
echo $dom->saveHTML();
I would avoid using regex as HTML is not a regular language and instead use a html parser like Simple HTML DOM
You can get a list of attributes that the object has by using attr. For example:
$html = str_get_html('<div id="hello">World</div>');
var_dump($html->find("div", 0)->attr); /
/*
array(1) {
["id"]=>
string(5) "hello"
}
*/
foreach ( $html->find("div", 0)->attr as &$value ){
$value = null;
}
print $html
//<div>World</div>
$html_text = '<p>Hello <b onclick="alert(123)" style="color: red">world</b>. <i>Its beautiful day.</i></p>';
$strip_text = strip_tags($html_text, '<b>');
$result = preg_replace('/<(\w+)[^>]*>/', '<$1>', $strip_text);
echo $result;
// Result
string 'Hello <b>world</b>. Its beautiful day.'
Another way to do it using php's DOMDocument class (without xpath) is to iterate over the attributes on a given node. Please note, due to the way php handles the DOMNamedNodeMap class, you must iterate backward over the collection if you plan on altering it. This behaviour has been discussed elsewhere and is also noted in the documentation comments. The same applies to the DOMNodeList class when it comes to removing or adding elements. To be on the safe side, I always iterate backwards with these objects.
Here is a simple example:
function scrubAttributes($html) {
$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
for ($els = $dom->getElementsByTagname('*'), $i = $els->length - 1; $i >= 0; $i--) {
for ($attrs = $els->item($i)->attributes, $ii = $attrs->length - 1; $ii >= 0; $ii--) {
$els->item($i)->removeAttribute($attrs->item($ii)->name);
}
}
return $dom->saveHTML();
}
Here's a demo: https://3v4l.org/M2ing
Optimized regular expression from the top rated answer on this issue:
$text = '<div width="5px">a is less than b: a<b, ya know?</div>';
echo preg_replace("/<([a-z][a-z0-9]*)[^<|>]*?(\/?)>/si",'<$1$2>', $text);
// <div>a is less than b: a<b, ya know?</div>
UPDATE:
It works better when allow only some tags with PHP strip_tags() function. Let's say we want to allow only <br>, <b> and <i> tags, then:
$text = '<i style=">">Italic</i>';
$text = strip_tags($text, '<br><b><i>');
echo preg_replace("/<([a-z][a-z0-9]*)[^<|>]*?(\/?)>/si",'<$1$2>', $text);
//<i>Italic</i>
As we can see it fixes flaws connected with tag symbols in attribute values.
Regex's are too fragile for HTML parsing. In your example, the following would strip out your attributes:
echo preg_replace(
"|<(\w+)([^>/]+)?|",
"<$1",
"<p style=\"padding:0px;\">\n<strong style=\"padding:0;margin:0;\">hello</strong>\n</p>\n"
);
Update
Make to second capture optional and do not strip '/' from closing tags:
|<(\w+)([^>]+)| to |<(\w+)([^>/]+)?|
Demonstrate this regular expression works:
$ phpsh
Starting php
type 'h' or 'help' to see instructions & features
php> $html = '<p style="padding:0px;"><strong style="padding:0;margin:0;">hello<br/></strong></p>';
php> echo preg_replace("|<(\w+)([^>/]+)?|", "<$1", $html);
<p><strong>hello</strong><br/></p>
php> $html = '<strong>hello</strong>';
php> echo preg_replace("|<(\w+)([^>/]+)?|", "<$1", $html);
<strong>hello</strong>
Hope this helps. It may not be the fastest way to do it, especially for large blocks of html.
If anyone has any suggestions as to make this faster, let me know.
function StringEx($str, $start, $end)
{
$str_low = strtolower($str);
$pos_start = strpos($str_low, $start);
$pos_end = strpos($str_low, $end, ($pos_start + strlen($start)));
if($pos_end==0) return false;
if ( ($pos_start !== false) && ($pos_end !== false) )
{
$pos1 = $pos_start + strlen($start);
$pos2 = $pos_end - $pos1;
$RData = substr($str, $pos1, $pos2);
if($RData=='') { return true; }
return $RData;
}
return false;
}
$S = '<'; $E = '>'; while($RData=StringEx($DATA, $S, $E)) { if($RData==true) {$RData='';} $DATA = str_ireplace($S.$RData.$E, '||||||', $DATA); } $DATA = str_ireplace('||||||', $S.$E, $DATA);
To do SPECIFICALLY what andufo wants, it's simply:
$html = preg_replace( "#(<[a-zA-Z0-9]+)[^\>]+>#", "\\1>", $html );
That is, he wants to strip anything but the tag name out of the opening tag. It won't work for self-closing tags of course.
Here's an easy way to get rid of attributes. It handles malformed html pretty well.
<?php
$string = '<p style="padding:0px;">
<strong style="padding:0;margin:0;">hello</strong>
</p>';
//get all html elements on a line by themselves
$string_html_on_lines = str_replace (array("<",">"),array("\n<",">\n"),$string);
//find lines starting with a '<' and any letters or numbers upto the first space. throw everything after the space away.
$string_attribute_free = preg_replace("/\n(<[\w123456]+)\s.+/i","\n$1>",$string_html_on_lines);
echo $string_attribute_free;
?>

PHP regex - Find the highest value

I need find the highest number on a string like this:
Example
<div id='pages'>
<a href='pages.php?start=0&end=20'>Page 1</a>
<a href='pages.php?start=20&end=40'>Page 2</a>
<a href='pages.php?start=40&end=60'>Page 3</a>
<a href='pages.php?start=60&end=80'>Page 4</a>
<a href='pages.php?start=80&end=89'>Page 5</a>
</div>
In this example, I should get 89, because it's the highest number on "end" value.
I think I should use regex, but I don't know how :(
Any help would be very appreciated!
You shouldn't be doing this with a regex. In fact, I don't even know how you would. You should be using an HTML parser, parsing out the end parameter from each <a> tag's href attribute with parse_str(), and then finding the max() of them, like this:
$doc = new DOMDocument;
$doc->loadHTML( $str); // All & should be encoded as &
$xpath = new DOMXPath( $doc);
$end_vals = array();
foreach( $xpath->query( '//div[#id="pages"]/a') as $a) {
parse_str( $a->getAttribute( 'href'), $params);
$end_vals[] = $params['end'];
}
echo max( $end_vals);
The above will print 89, as seen in this demo.
Note that this assumes your HTML entities are properly escaped, otherwise DOMDocument will issue a warning.
One optimization you can do is instead of keeping an array of end values, just compare the max value seen with the current value. However this will only be useful if the number of <a> tags grows larger.
Edit: As DaveRandom points out, if we can make the assumption that the <a> tag that holds the highest end value is the last <a> tag in this list, simply due to how paginated links are presented, then we don't need to iterate or keep a list of other end values, as shown in the following example.
$doc = new DOMDocument;
$doc->loadHTML( $str);
$xpath = new DOMXPath( $doc);
parse_str( $xpath->evaluate( 'string(//div[#id="pages"]/a[last()]/#href)'), $params);
echo $params['end'];
To find the highest number in the entire string, regardless of position, you can use
preg_split — Split string by a regular expression
max — Find highest value
Example (demo)
echo max(preg_split('/\D+/', $html, -1, PREG_SPLIT_NO_EMPTY)); // prints 89
This works by splitting the string by anything that is not a number, leaving you with an array containing all the numbers in the string and then fetching the highest number from that array.
first extract all the numbers from the links then apply max function:
$str = "<div id='pages'>
<a href='pages.php?start=0&end=20'>Page 1</a>
<a href='pages.php?start=20&end=40'>Page 2</a>
<a href='pages.php?start=40&end=60'>Page 3</a>
<a href='pages.php?start=60&end=80'>Page 4</a>
<a href='pages.php?start=80&end=89'>Page 5</a>
</div>";
if(preg_match_all("/href=['][^']+end=([0-9]+)[']/i", $str, $matches))
{
$maxVal = max($matches[1]);
echo $maxVal;
}
function getHighest($html) {
$my_document = new DOMDocument();
$my_document->loadHTML($html);
$nodes = $my_document->getElementsByTagName('a');
$numbers = array();
foreach ($nodes as $node) {
if (preg_match('\d+$', $node->getAttribute('href'), $match) == 1) {
$numbers[]= intval($match[0])
}
}
return max($numbers);
}

How to remove link with preg_replace();?

I'm not sure how to explain this, so I'll show it on my code.
First and
Second and
Third
how can I delete opening and closing but not the rest?
I'm asking for preg_replace(); and I'm not looking for DomDocument or others methods to do it. I just want to see example on preg_replace();
how is it achievable?
Only pick the groups you want to preserve:
$pattern = '~()([^<]*)()~';
// 1 2 3
$result = preg_replace($pattern, '$2', $subject);
You find more examples on the preg_replace manual page.
Since you asked me in the comments to show any method of doing this, here it is.
$html =<<<HTML
First and
Second and
Third
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$elems = $xpath->query("//a[#class='delete']");
foreach ($elems as $elem) {
$elem->parentNode->removeChild($elem);
}
echo $dom->saveHTML();
Note that saveHTML() saves a complete document even if you only parsed a fragment.
As of PHP 5.3.6 you can add a $node parameter to specify the fragment it should return - something like $xpath->query("/*/body")[0] would work.
$pattern = '/<a (.*?)href=[\"\'](.*?)\/\/(.*?)[\"\'](.*?)>(.*?)<\/a>/i';
$new_content = preg_replace($pattern, '$5', $content);
$pattern = '/<a[^<>]*?class="delete"[^<>]*?>(.*?)<\/a>/';
$test = 'First and Second and Third';
echo preg_replace($pattern, '$1', $test)."\n";
$test = 'First and <b class="delete">seriously</b> and Third';
echo preg_replace($pattern, '$1', $test)."\n";
$test = 'First and <b class="delete">seriously</b> and Third';
echo preg_replace($pattern, '$1', $test)."\n";
$test = 'First and <a class="delete" href="url2.html">Second</a> and Third';
echo preg_replace($pattern, '$1', $test)."\n";
preg_replace('#(.+?)#', '$1', $html_string);
It is important to understand this is not an ideal solution. First, it requires markup in this exact format. Second, if there were, say, a nested anchor tag (albeit unlikely) this would fail. These are some of the many reasons why Regular Expressions should not be used for parsing/manipulating HTML.

Regex match full hyperlink only with certain class

I have a string that has some hyperlinks inside. I want to match with regex only certain link from all of them. I can't know if the href or the class comes first, it may be vary.
This is for example a sting:
<div class='wp-pagenavi'>
<span class='pages'>Page 1 of 8</span><span class='current'>1</span>
<a href='http://stv.localhost/channel/political/page/2' class='page'>2</a>
»eee<span class='extend'>...</span><a href='http://stv.localhost/channel/political/page/8' class='last'>lastן »</a>
<a class="cccc">xxx</a>
</div>
I want to select from the aboce string only the one that has the class nextpostslink
So, the match in this example should return this -
»eee
This regex is the most close I could get -
/<a\s?(href=)?('|")(.*)('|") class=('|")nextpostslink('|")>.{1,6}<\/a>/
But it is selecting the links from the start of the string.
I think my problem is in the (.*) , but I can't figure out how to change this to select only the needed link.
I would appreciate your help.
It's much better to use a genuine HTML parser for this. Abandon all attempts to use regular expressions on HTML.
Use PHP's DOMDocument instead:
$dom = new DOMDocument;
$dom->loadHTML($yourHTML);
foreach ($dom->getElementsByTagName('a') as $link) {
$classes = explode(' ', $link->getAttribute('class'));
if (in_array('nextpostslink', $classes)) {
// $link has the class "nextpostslink"
}
}
Not sure if that's what you're but anyway: it's a bad idea to parse html with regex. Use a xpath implementation in order to reach the desired elements. The following xpath expression would give you all the 'a' elements with class "nextpostlink" :
//a[contains(#class,"nextpostslink")]
There are loads of xpath info around, since you didn't mention your programming language here goes a quick xpath tutorial using java: http://www.ibm.com/developerworks/library/x-javaxpathapi/index.html
Edit:
php + xpath + html: http://dev.juokaz.com/php/web-scraping-with-php-and-xpath
This would work in php:
/<a[^>]+href=(\"|')([^\"']*)('|\")[^>]+class=(\"|')[^'\"]*nextpostslink[^'\"]*('|\")[^>]*>(.{1,6})<\/a>/m
This is of course assuming that the class attribute always comes after the href attribute.
This is a code snippet:
$html = <<<EOD
<div class='wp-pagenavi'>
<span class='pages'>Page 1 of 8</span><span class='current'>1</span>
<a href='http://stv.localhost/channel/political/page/2' class='page'>2</a>
»eee<span class='extend'>...</span><a href='http://stv.localhost/channel/political/page/8' class='last'>lastן »</a>
<a class="cccc">xxx</a>
</div>
EOD;
$regexp = "/<a[^>]+href=(\"|')([^\"']*)('|\")[^>]+class=(\"|')[^'\"]*nextpostslink[^'\"]*('|\")[^>]*>(.{1,6})<\/a>/m";
$matches = array();
if(preg_match($regexp, $html, $matches)) {
echo "URL: " . $matches[2] . "\n";
echo "Text: " . $matches[6] . "\n";
}
I would however suggest first matching the link and then getting the url so that the order of the attributes doesn't matter:
<?php
$html = <<<EOD
<div class='wp-pagenavi'>
<span class='pages'>Page 1 of 8</span><span class='current'>1</span>
<a href='http://stv.localhost/channel/political/page/2' class='page'>2</a>
»eee<span class='extend'>...</span><a href='http://stv.localhost/channel/political/page/8' class='last'>lastן »</a>
<a class="cccc">xxx</a>
</div>
EOD;
$regexp = "/(<a[^>]+class=(\"|')[^'\"]*nextpostslink[^'\"]*('|\")[^>]*>(.{1,6})<\/a>)/m";
$matches = array();
if(preg_match($regexp, $html, $matches)) {
$link = $matches[0];
$text = $matches[4];
$regexp = "/href=(\"|')([^'\"]*)(\"|')/";
$matches = array();
if(preg_match($regexp, $html, $matches)) {
$url = $matches[2];
echo "URL: $url\n";
echo "Text: $text\n";
}
}
You could of course extend the regexp by matching one of the both variants (class first vs href first) but it would be very long and I don't think it would be a performance increase.
Just as a proof of concept I created a regexp that doesn't care about the order:
/<a[^>]+(href=(\"|')([^\"']*)('|\")[^>]+class=(\"|')[^'\"]*nextpostslink[^'\"]*(\"|')|class=(\"|')[^'\"]*nextpostslink[^'\"]*(\"|')[^>]+href=(\"|')([^\"']*)('|\"))[^>]*>(.{1,6})<\/a>/m
The text will be in group 12 and the URL will be in either group 3 or group 10 depending on the order.
As the question is to get it by regex, here is how <a\s[^>]*class=["|']nextpostslink["|'][^>]*>(.*)<\/a>.
It doesn't matter in which order are the attributs and it also consider simple or double quotes.
Check the regex online: https://regex101.com/r/DX03KD/1/
I replaced the (.*) with [^'"]+ as follows:
<a\s*(href=)?('|")[^'"]+('|") class=('|")nextpostslink('|")>.{1,6}</a>
Note: I tried this with RegEx Buddy so I didnt need to escape the <>'s or /

Remove all attributes from html tags

I have this html code:
<p style="padding:0px;">
<strong style="padding:0;margin:0;">hello</strong>
</p>
How can I remove attributes from all tags? I'd like it to look like this:
<p>
<strong>hello</strong>
</p>
Adapted from my answer on a similar question
$text = '<p style="padding:0px;"><strong style="padding:0;margin:0;">hello</strong></p>';
echo preg_replace("/<([a-z][a-z0-9]*)[^>]*?(\/?)>/si",'<$1$2>', $text);
// <p><strong>hello</strong></p>
The RegExp broken down:
/ # Start Pattern
< # Match '<' at beginning of tags
( # Start Capture Group $1 - Tag Name
[a-z] # Match 'a' through 'z'
[a-z0-9]* # Match 'a' through 'z' or '0' through '9' zero or more times
) # End Capture Group
[^>]*? # Match anything other than '>', Zero or More times, not-greedy (wont eat the /)
(\/?) # Capture Group $2 - '/' if it is there
> # Match '>'
/is # End Pattern - Case Insensitive & Multi-line ability
Add some quoting, and use the replacement text <$1$2> it should strip any text after the tagname until the end of tag /> or just >.
Please Note This isn't necessarily going to work on ALL input, as the Anti-HTML + RegExp will tell you. There are a few fallbacks, most notably <p style=">"> would end up <p>"> and a few other broken issues... I would recommend looking at Zend_Filter_StripTags as a more full proof tags/attributes filter in PHP
Here is how to do it with native DOM:
$dom = new DOMDocument; // init new DOMDocument
$dom->loadHTML($html); // load HTML into it
$xpath = new DOMXPath($dom); // create a new XPath
$nodes = $xpath->query('//*[#style]'); // Find elements with a style attribute
foreach ($nodes as $node) { // Iterate over found elements
$node->removeAttribute('style'); // Remove style attribute
}
echo $dom->saveHTML(); // output cleaned HTML
If you want to remove all possible attributes from all possible tags, do
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//#*');
foreach ($nodes as $node) {
$node->parentNode->removeAttribute($node->nodeName);
}
echo $dom->saveHTML();
I would avoid using regex as HTML is not a regular language and instead use a html parser like Simple HTML DOM
You can get a list of attributes that the object has by using attr. For example:
$html = str_get_html('<div id="hello">World</div>');
var_dump($html->find("div", 0)->attr); /
/*
array(1) {
["id"]=>
string(5) "hello"
}
*/
foreach ( $html->find("div", 0)->attr as &$value ){
$value = null;
}
print $html
//<div>World</div>
$html_text = '<p>Hello <b onclick="alert(123)" style="color: red">world</b>. <i>Its beautiful day.</i></p>';
$strip_text = strip_tags($html_text, '<b>');
$result = preg_replace('/<(\w+)[^>]*>/', '<$1>', $strip_text);
echo $result;
// Result
string 'Hello <b>world</b>. Its beautiful day.'
Another way to do it using php's DOMDocument class (without xpath) is to iterate over the attributes on a given node. Please note, due to the way php handles the DOMNamedNodeMap class, you must iterate backward over the collection if you plan on altering it. This behaviour has been discussed elsewhere and is also noted in the documentation comments. The same applies to the DOMNodeList class when it comes to removing or adding elements. To be on the safe side, I always iterate backwards with these objects.
Here is a simple example:
function scrubAttributes($html) {
$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
for ($els = $dom->getElementsByTagname('*'), $i = $els->length - 1; $i >= 0; $i--) {
for ($attrs = $els->item($i)->attributes, $ii = $attrs->length - 1; $ii >= 0; $ii--) {
$els->item($i)->removeAttribute($attrs->item($ii)->name);
}
}
return $dom->saveHTML();
}
Here's a demo: https://3v4l.org/M2ing
Optimized regular expression from the top rated answer on this issue:
$text = '<div width="5px">a is less than b: a<b, ya know?</div>';
echo preg_replace("/<([a-z][a-z0-9]*)[^<|>]*?(\/?)>/si",'<$1$2>', $text);
// <div>a is less than b: a<b, ya know?</div>
UPDATE:
It works better when allow only some tags with PHP strip_tags() function. Let's say we want to allow only <br>, <b> and <i> tags, then:
$text = '<i style=">">Italic</i>';
$text = strip_tags($text, '<br><b><i>');
echo preg_replace("/<([a-z][a-z0-9]*)[^<|>]*?(\/?)>/si",'<$1$2>', $text);
//<i>Italic</i>
As we can see it fixes flaws connected with tag symbols in attribute values.
Regex's are too fragile for HTML parsing. In your example, the following would strip out your attributes:
echo preg_replace(
"|<(\w+)([^>/]+)?|",
"<$1",
"<p style=\"padding:0px;\">\n<strong style=\"padding:0;margin:0;\">hello</strong>\n</p>\n"
);
Update
Make to second capture optional and do not strip '/' from closing tags:
|<(\w+)([^>]+)| to |<(\w+)([^>/]+)?|
Demonstrate this regular expression works:
$ phpsh
Starting php
type 'h' or 'help' to see instructions & features
php> $html = '<p style="padding:0px;"><strong style="padding:0;margin:0;">hello<br/></strong></p>';
php> echo preg_replace("|<(\w+)([^>/]+)?|", "<$1", $html);
<p><strong>hello</strong><br/></p>
php> $html = '<strong>hello</strong>';
php> echo preg_replace("|<(\w+)([^>/]+)?|", "<$1", $html);
<strong>hello</strong>
Hope this helps. It may not be the fastest way to do it, especially for large blocks of html.
If anyone has any suggestions as to make this faster, let me know.
function StringEx($str, $start, $end)
{
$str_low = strtolower($str);
$pos_start = strpos($str_low, $start);
$pos_end = strpos($str_low, $end, ($pos_start + strlen($start)));
if($pos_end==0) return false;
if ( ($pos_start !== false) && ($pos_end !== false) )
{
$pos1 = $pos_start + strlen($start);
$pos2 = $pos_end - $pos1;
$RData = substr($str, $pos1, $pos2);
if($RData=='') { return true; }
return $RData;
}
return false;
}
$S = '<'; $E = '>'; while($RData=StringEx($DATA, $S, $E)) { if($RData==true) {$RData='';} $DATA = str_ireplace($S.$RData.$E, '||||||', $DATA); } $DATA = str_ireplace('||||||', $S.$E, $DATA);
To do SPECIFICALLY what andufo wants, it's simply:
$html = preg_replace( "#(<[a-zA-Z0-9]+)[^\>]+>#", "\\1>", $html );
That is, he wants to strip anything but the tag name out of the opening tag. It won't work for self-closing tags of course.
Here's an easy way to get rid of attributes. It handles malformed html pretty well.
<?php
$string = '<p style="padding:0px;">
<strong style="padding:0;margin:0;">hello</strong>
</p>';
//get all html elements on a line by themselves
$string_html_on_lines = str_replace (array("<",">"),array("\n<",">\n"),$string);
//find lines starting with a '<' and any letters or numbers upto the first space. throw everything after the space away.
$string_attribute_free = preg_replace("/\n(<[\w123456]+)\s.+/i","\n$1>",$string_html_on_lines);
echo $string_attribute_free;
?>

Categories