So I need to strip the span tags of class tip.
So that would be <span class="tip"> and the corresponding </span>, and everything inside it...
I suspect a regular expression is needed but I terribly suck at this.
Laugh...
<?php
$string = 'April 15, 2003';
$pattern = '/(\w+) (\d+), (\d+)/i';
$replacement = '${1}1,$3';
echo preg_replace($pattern, $replacement, $string);
?>
Gives no error... But
<?php
$str = preg_replace('<span class="tip">.+</span>', "", '<span class="rss-title"></span><span class="rss-link">linkylink</span><span class="rss-id"></span><span class="rss-content"></span><span class=\"rss-newpost\"></span>');
echo $str;
?>
Gives me the error:
Warning: preg_replace() [function.preg-replace]: Unknown modifier '.' in <A FILE> on line 4
previously, the error was at the ); in the 2nd line, but now.... >.>
This is the "proper" method (adapted from this answer).
Input:
<?php
$str = '<div>lol wut <span class="tip">remove!</span><span>don\'t remove!</span></div>';
?>
Code:
<?php
function recurse(&$doc, &$parent) {
if (!$parent->hasChildNodes())
return;
for ($i = 0; $i < $parent->childNodes->length; ) {
$elm = $parent->childNodes->item($i);
if ($elm->nodeName == "span") {
$class = $elm->attributes->getNamedItem("class")->nodeValue;
if (!is_null($class) && $class == "tip") {
$parent->removeChild($elm);
continue;
}
}
recurse($doc, $elm);
$i++;
}
}
// Load in the DOM (remembering that XML requires one root node)
$doc = new DOMDocument();
$doc->loadXML("<document>" . $str . "</document>");
// Iterate the DOM
recurse($doc, $doc->documentElement);
// Output the result
foreach ($doc->childNodes->item(0)->childNodes as $node) {
echo $doc->saveXML($node);
}
?>
Output:
<div>lol wut <span>don't remove!</span></div>
A simple regular expression like:
<span class="tip">.+</span>
Wont work, the issue being that if another span was opened and closed inside the tip span, your regex will terminate with its ending, rather than the tip one. DOM Based tools like the one linked in the comments will really provide a more reliable answer.
As per my comment below, you need to add pattern delimiters when working with regular expressions in PHP.
<?php
$str = preg_replace('\<span class="tip">.+</span>\', "", '<span class="rss-title"></span><span class="rss-link">linkylink</span><span class="rss-id"></span><span class="rss-content"></span><span class=\"rss-newpost\"></span>');
echo $str;
?>
may be moderately more successful. Please take a look at the documentation page for the function in question.
Now without regexp, and without heavy XML parsing:
$html = ' ... <span class="tip"> hello <span id="x"> man </span> </span> ... ';
$tag = '<span class="tip">';
$tag_close = '</span>';
$tag_familly = '<span';
$tag_len = strlen($tag);
$p1 = -1;
$p2 = 0;
while ( ($p2!==false) && (($p1=strpos($html, $tag, $p1+1))!==false) ) {
// the tag is found, now we will search for its corresponding closing tag
$level = 1;
$p2 = $p1;
$continue = true;
while ($continue) {
$p2 = strpos($html, $tag_close, $p2+1);
if ($p2===false) {
// error in the html contents, the analysis cannot continue
echo "ERROR in html contents";
$continue = false;
$p2 = false; // will stop the loop
} else {
$level = $level -1;
$x = substr($html, $p1+$tag_len, $p2-$p1-$tag_len);
$n = substr_count($x, $tag_familly);
if ($level+$n<=0) $continue = false;
}
}
if ($p2!==false) {
// delete the couple of tags, the farest first
$html = substr_replace($html, '', $p2, strlen($tag_close));
$html = substr_replace($html, '', $p1, $tag_len);
}
}
Related
How can I use php to strip all/any attributes from a tag, say a paragraph tag?
<p class="one" otherrandomattribute="two"> to <p>
Although there are better ways, you could actually strip arguments from html tags with a regular expression:
<?php
function stripArgumentFromTags( $htmlString ) {
$regEx = '/([^<]*<\s*[a-z](?:[0-9]|[a-z]{0,9}))(?:(?:\s*[a-z\-]{2,14}\s*=\s*(?:"[^"]*"|\'[^\']*\'))*)(\s*\/?>[^<]*)/i'; // match any start tag
$chunks = preg_split($regEx, $htmlString, -1, PREG_SPLIT_DELIM_CAPTURE);
$chunkCount = count($chunks);
$strippedString = '';
for ($n = 1; $n < $chunkCount; $n++) {
$strippedString .= $chunks[$n];
}
return $strippedString;
}
?>
The above could probably be written in less characters, but it does the job (quick and dirty).
Strip attributes using SimpleXML (Standard in PHP5)
<?php
// define allowable tags
$allowable_tags = '<p><a><img><ul><ol><li><table><thead><tbody><tr><th><td>';
// define allowable attributes
$allowable_atts = array('href','src','alt');
// strip collector
$strip_arr = array();
// load XHTML with SimpleXML
$data_sxml = simplexml_load_string('<root>'. $data_str .'</root>', 'SimpleXMLElement', LIBXML_NOERROR | LIBXML_NOXMLDECL);
if ($data_sxml ) {
// loop all elements with an attribute
foreach ($data_sxml->xpath('descendant::*[#*]') as $tag) {
// loop attributes
foreach ($tag->attributes() as $name=>$value) {
// check for allowable attributes
if (!in_array($name, $allowable_atts)) {
// set attribute value to empty string
$tag->attributes()->$name = '';
// collect attribute patterns to be stripped
$strip_arr[$name] = '/ '. $name .'=""/';
}
}
}
}
// strip unallowed attributes and root tag
$data_str = strip_tags(preg_replace($strip_arr,array(''),$data_sxml->asXML()), $allowable_tags);
?>
Here is one function that will let you strip all attributes except ones you want:
function stripAttributes($s, $allowedattr = array()) {
if (preg_match_all("/<[^>]*\\s([^>]*)\\/*>/msiU", $s, $res, PREG_SET_ORDER)) {
foreach ($res as $r) {
$tag = $r[0];
$attrs = array();
preg_match_all("/\\s.*=(['\"]).*\\1/msiU", " " . $r[1], $split, PREG_SET_ORDER);
foreach ($split as $spl) {
$attrs[] = $spl[0];
}
$newattrs = array();
foreach ($attrs as $a) {
$tmp = explode("=", $a);
if (trim($a) != "" && (!isset($tmp[1]) || (trim($tmp[0]) != "" && !in_array(strtolower(trim($tmp[0])), $allowedattr)))) {
} else {
$newattrs[] = $a;
}
}
$attrs = implode(" ", $newattrs);
$rpl = str_replace($r[1], $attrs, $tag);
$s = str_replace($tag, $rpl, $s);
}
}
return $s;
}
In example it would be:
echo stripAttributes('<p class="one" otherrandomattribute="two">');
or if you eg. want to keep "class" attribute:
echo stripAttributes('<p class="one" otherrandomattribute="two">', array('class'));
Or
Assuming you are to send a message to an inbox and you composed your message with CKEDITOR, you can assign the function as follows and echo it to the $message variable before sending. Note the function with the name stripAttributes() will strip off all html tags that are unnecessary. I tried it and it work fine. i only saw the formatting i added like bold e.t.c.
$message = stripAttributes($_POST['message']);
or
you can echo $message; for preview.
I honestly think that the only sane way to do this is to use a tag and attribute whitelist with the HTML Purifier library. Example script here:
<html><body>
<?php
require_once '../includes/htmlpurifier-4.5.0-lite/library/HTMLPurifier/Bootstrap.php';
spl_autoload_register(array('HTMLPurifier_Bootstrap', 'autoload'));
$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.Allowed', 'p,b,a[href],i,br,img[src]');
$config->set('URI.Base', 'http://www.example.com');
$config->set('URI.MakeAbsolute', true);
$purifier = new HTMLPurifier($config);
$dirty_html = "
<a href=\"http://www.google.de\">broken a href link</a
fnord
<x>y</z>
<b>c</p>
<script>alert(\"foo!\");</script>
Anzahl besuchter Seiten
<img src=\"www.example.com/bla.gif\" />
<a href=\"http://www.google.de\">missing end tag
ende
";
$clean_html = $purifier->purify($dirty_html);
print "<h1>dirty</h1>";
print "<pre>" . htmlentities($dirty_html) . "</pre>";
print "<h1>clean</h1>";
print "<pre>" . htmlentities($clean_html) . "</pre>";
?>
</body></html>
This yields the following clean, standards-conforming HTML fragment:
broken a href linkfnord
y
<b>c
<a>Anzahl besuchter Seiten</a>
<img src="http://www.example.com/www.example.com/bla.gif" alt="bla.gif" /><a href="http://www.google.de">missing end tag
ende
</a></b>
In your case the whitelist would be:
$config->set('HTML.Allowed', 'p');
HTML Purifier is one of the better tools for sanitizing HTML with PHP.
You might also look into html purifier. True, it's quite bloated, and might not fit your needs if it only conceirns this specific example, but it offers more or less 'bulletproof' purification of possible hostile html. Also you can choose to allow or disallow certain attributes (it's highly configurable).
http://htmlpurifier.org/
I have a huge HTML code in a PHP variable like :
$html_code = '<div class="contianer" style="text-align:center;">The Sameple text.</div><br><span>Another sample text.</span>....';
I want to display only first 500 characters of this code. This character count must consider the text in HTML tags and should exclude HTMl tags and attributes while measuring the length.
but while triming the code, it should not affect DOM structure of HTML code.
Is there any tuorial or working examples available?
If its the text you want, you can do this with the following too
substr(strip_tags($html_code),0,500);
Ooohh... I know this I can't get it exactly off the top of my head but you want to load the text you've got as a DOMDOCUMENT
http://www.php.net/manual/en/class.domdocument.php
then grab the text from the entire document node (as a DOMnode http://www.php.net/manual/en/class.domnode.php)
This won't be exactly right, but hopefully this will steer you onto the right track.
Try something like:
$html_code = '<div class="contianer" style="text-align:center;">The Sameple text.</div><br><span>Another sample text.</span>....';
$dom = new DOMDocument();
$dom->loadHTML($html_code);
$text_to_strip = $dom->textContent;
$stripped = mb_substr($text_to_strip,0,500);
echo "$stripped"; // The Sameple text.Another sample text.....
edit ok... that should work. just tested locally
edit2
Now that I understand you want to keep the tags, but limit the text, lets see. You're going to want to loop the content until you get to 500 characters. This is probably going to take a few edits and passes for me to get right, but hopefully I can help. (sorry I can't give undivided attention)
First case is when the text is less than 500 characters. Nothing to worry about. Starting with the above code we can do the following.
if (strlen($stripped) > 500) {
// this is where we do our work.
$characters_so_far = 0;
foreach ($dom->child_nodes as $ChildNode) {
// should check if $ChildNode->hasChildNodes();
// probably put some of this stuff into a function
$characters_in_next_node += str_len($ChildNode->textcontent);
if ($characters_so_far+$characters_in_next_node > 500) {
// remove the node
// try using
// $ChildNode->parentNode->removeChild($ChildNode);
}
$characters_so_far += $characters_in_next_node
}
//
$final_out = $dom->saveHTML();
} else {
$final_out = $html_code;
}
i'm pasting below a php class i wrote a long time ago, but i know it works. its not exactly what you're after, as it deals with words instead of a character count, but i figure its pretty close and someone might find it useful.
class HtmlWordManipulator
{
var $stack = array();
function truncate($text, $num=50)
{
if (preg_match_all('/\s+/', $text, $junk) <= $num) return $text;
$text = preg_replace_callback('/(<\/?[^>]+\s+[^>]*>)/','_truncateProtect', $text);
$words = 0;
$out = array();
$text = str_replace('<',' <',str_replace('>','> ',$text));
$toks = preg_split('/\s+/', $text);
foreach ($toks as $tok)
{
if (preg_match_all('/<(\/?[^\x01>]+)([^>]*)>/',$tok,$matches,PREG_SET_ORDER))
foreach ($matches as $tag) $this->_recordTag($tag[1], $tag[2]);
$out[] = trim($tok);
if (! preg_match('/^(<[^>]+>)+$/', $tok))
{
if (!strpos($tok,'=') && !strpos($tok,'<') && strlen(trim(strip_tags($tok))) > 0)
{
++$words;
}
else
{
/*
echo '<hr />';
echo htmlentities('failed: '.$tok).'<br /)>';
echo htmlentities('has equals: '.strpos($tok,'=')).'<br />';
echo htmlentities('has greater than: '.strpos($tok,'<')).'<br />';
echo htmlentities('strip tags: '.strip_tags($tok)).'<br />';
echo str_word_count($text);
*/
}
}
if ($words > $num) break;
}
$truncate = $this->_truncateRestore(implode(' ', $out));
return $truncate;
}
function restoreTags($text)
{
foreach ($this->stack as $tag) $text .= "</$tag>";
return $text;
}
private function _truncateProtect($match)
{
return preg_replace('/\s/', "\x01", $match[0]);
}
private function _truncateRestore($strings)
{
return preg_replace('/\x01/', ' ', $strings);
}
private function _recordTag($tag, $args)
{
// XHTML
if (strlen($args) and $args[strlen($args) - 1] == '/') return;
else if ($tag[0] == '/')
{
$tag = substr($tag, 1);
for ($i=count($this->stack) -1; $i >= 0; $i--) {
if ($this->stack[$i] == $tag) {
array_splice($this->stack, $i, 1);
return;
}
}
return;
}
else if (in_array($tag, array('p', 'li', 'ul', 'ol', 'div', 'span', 'a')))
$this->stack[] = $tag;
else return;
}
}
truncate is what you want, and you pass it the html and the number of words you want it trimmed down to. it ignores html while counting words, but then rewraps everything in html, even closing trailing tags due to the truncation.
please don't judge me on the complete lack of oop principles. i was young and stupid.
edit:
so it turns out the usage is more like this:
$content = $manipulator->restoreTags($manipulator->truncate($myHtml,$numOfWords));
stupid design decision. allowed me to inject html inside the unclosed tags though.
I'm not up to coding a real solution, but if someone wants to, here's what I'd do (in pseudo-PHP):
$html_code = '<div class="contianer" style="text-align:center;">The Sameple text.</div><br><span>Another sample text.</span>....';
$aggregate = '';
$document = XMLParser($html_code);
foreach ($document->getElementsByTagName('*') as $element) {
$aggregate .= $element->text(); // This is the text, not HTML. It doesn't
// include the children, only the text
// directly in the tag.
}
The function below is designed to apply rel="nofollow" attributes to all external links and no internal links unless the path matches a predefined root URL defined as $my_folder below.
So given the variables...
$my_folder = 'http://localhost/mytest/go/';
$blog_url = 'http://localhost/mytest';
And the content...
internal
internal cloaked link
external
The end result, after replacement should be...
internal
internal cloaked link
external
Notice that the first link is not altered, since its an internal link.
The link on the second line is also an internal link, but since it matches our $my_folder string, it gets the nofollow too.
The third link is the easiest, since it does not match the blog_url, its obviously an external link.
However, in the script below, ALL of my links are getting nofollow. How can I fix the script to do what I want?
function save_rseo_nofollow($content) {
$my_folder = $rseo['nofollow_folder'];
$blog_url = get_bloginfo('url');
preg_match_all('~<a.*>~isU',$content["post_content"],$matches);
for ( $i = 0; $i <= sizeof($matches[0]); $i++){
if ( !preg_match( '~nofollow~is',$matches[0][$i])
&& (preg_match('~' . $my_folder . '~', $matches[0][$i])
|| !preg_match( '~'.$blog_url.'~',$matches[0][$i]))){
$result = trim($matches[0][$i],">");
$result .= ' rel="nofollow">';
$content["post_content"] = str_replace($matches[0][$i], $result, $content["post_content"]);
}
}
return $content;
}
Here is the DOMDocument solution...
$str = 'internal
internal cloaked link
external
external
external
external
';
$dom = new DOMDocument();
$dom->preserveWhitespace = FALSE;
$dom->loadHTML($str);
$a = $dom->getElementsByTagName('a');
$host = strtok($_SERVER['HTTP_HOST'], ':');
foreach($a as $anchor) {
$href = $anchor->attributes->getNamedItem('href')->nodeValue;
if (preg_match('/^https?:\/\/' . preg_quote($host, '/') . '/', $href)) {
continue;
}
$noFollowRel = 'nofollow';
$oldRelAtt = $anchor->attributes->getNamedItem('rel');
if ($oldRelAtt == NULL) {
$newRel = $noFollowRel;
} else {
$oldRel = $oldRelAtt->nodeValue;
$oldRel = explode(' ', $oldRel);
if (in_array($noFollowRel, $oldRel)) {
continue;
}
$oldRel[] = $noFollowRel;
$newRel = implode($oldRel, ' ');
}
$newRelAtt = $dom->createAttribute('rel');
$noFollowNode = $dom->createTextNode($newRel);
$newRelAtt->appendChild($noFollowNode);
$anchor->appendChild($newRelAtt);
}
var_dump($dom->saveHTML());
Output
string(509) "<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
internal
internal cloaked link
external
external
external
external
</body></html>
"
Try to make it more readable first, and only afterwards make your if rules more complex:
function save_rseo_nofollow($content) {
$content["post_content"] =
preg_replace_callback('~<(a\s[^>]+)>~isU', "cb2", $content["post_content"]);
return $content;
}
function cb2($match) {
list($original, $tag) = $match; // regex match groups
$my_folder = "/hostgator"; // re-add quirky config here
$blog_url = "http://localhost/";
if (strpos($tag, "nofollow")) {
return $original;
}
elseif (strpos($tag, $blog_url) && (!$my_folder || !strpos($tag, $my_folder))) {
return $original;
}
else {
return "<$tag rel='nofollow'>";
}
}
Gives following output:
[post_content] =>
internal
<a href="http://localhost/mytest/go/hostgator" rel=nofollow>internal cloaked link</a>
<a href="http://cnn.com" rel=nofollow>external</a>
The problem in your original code might have been $rseo which wasn't declared anywhere.
Try this one (PHP 5.3+):
skip selected address
allow manually set rel parameter
and code:
function nofollow($html, $skip = null) {
return preg_replace_callback(
"#(<a[^>]+?)>#is", function ($mach) use ($skip) {
return (
!($skip && strpos($mach[1], $skip) !== false) &&
strpos($mach[1], 'rel=') === false
) ? $mach[1] . ' rel="nofollow">' : $mach[0];
},
$html
);
}
Examples:
echo nofollow('something');
// will be same because it's already contains rel parameter
echo nofollow('something'); // ad
// add rel="nofollow" parameter to anchor
echo nofollow('something', 'localhost');
// skip this link as internall link
Using regular expressions to do this job properly would be quite complicated. It would be easier to use an actual parser, such as the one from the DOM extension. DOM isn't very beginner-friendly, so what you can do is load the HTML with DOM then run the modifications with SimpleXML. They're backed by the same library, so it's easy to use one with the other.
Here's how it can look like:
$my_folder = 'http://localhost/mytest/go/';
$blog_url = 'http://localhost/mytest';
$html = '<html><body>
internal
internal cloaked link
external
</body></html>';
$dom = new DOMDocument;
$dom->loadHTML($html);
$sxe = simplexml_import_dom($dom);
// grab all <a> nodes with an href attribute
foreach ($sxe->xpath('//a[#href]') as $a)
{
if (substr($a['href'], 0, strlen($blog_url)) === $blog_url
&& substr($a['href'], 0, strlen($my_folder)) !== $my_folder)
{
// skip all links that start with the URL in $blog_url, as long as they
// don't start with the URL from $my_folder;
continue;
}
if (empty($a['rel']))
{
$a['rel'] = 'nofollow';
}
else
{
$a['rel'] .= ' nofollow';
}
}
$new_html = $dom->saveHTML();
echo $new_html;
As you can see, it's really short and simple. Depending on your needs, you may want to use preg_match() in place of the strpos() stuff, for example:
// change the regexp to your own rules, here we match everything under
// "http://localhost/mytest/" as long as it's not followed by "go"
if (preg_match('#^http://localhost/mytest/(?!go)#', $a['href']))
{
continue;
}
Note
I missed the last code block in the OP when I first read the question. The code I posted (and basically any solution based on DOM) is better suited at processing a whole page rather than a HTML block. Otherwise, DOM will attempt to "fix" your HTML and may add a <body> tag, a DOCTYPE, etc...
Thanks #alex for your nice solution. But, I was having a problem with Japanese text. I have fixed it as following way. Also, this code can skip multiple domains with the $whiteList array.
public function addRelNoFollow($html, $whiteList = [])
{
$dom = new \DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
$a = $dom->getElementsByTagName('a');
/** #var \DOMElement $anchor */
foreach ($a as $anchor) {
$href = $anchor->attributes->getNamedItem('href')->nodeValue;
$domain = parse_url($href, PHP_URL_HOST);
// Skip whiteList domains
if (in_array($domain, $whiteList, true)) {
continue;
}
// Check & get existing rel attribute values
$noFollow = 'nofollow';
$rel = $anchor->attributes->getNamedItem('rel');
if ($rel) {
$values = explode(' ', $rel->nodeValue);
if (in_array($noFollow, $values, true)) {
continue;
}
$values[] = $noFollow;
$newValue = implode($values, ' ');
} else {
$newValue = $noFollow;
}
// Create new rel attribute
$rel = $dom->createAttribute('rel');
$node = $dom->createTextNode($newValue);
$rel->appendChild($node);
$anchor->appendChild($rel);
}
// There is a problem with saveHTML() and saveXML(), both of them do not work correctly in Unix.
// They do not save UTF-8 characters correctly when used in Unix, but they work in Windows.
// So we need to do as follows. #see https://stackoverflow.com/a/20675396/1710782
return $dom->saveHTML($dom->documentElement);
}
<?
$str='internal
internal cloaked link
external';
function test($x){
if (preg_match('#localhost/mytest/(?!go/)#i',$x[0])>0) return $x[0];
return 'rel="nofollow" '.$x[0];
}
echo preg_replace_callback('/href=[\'"][^\'"]+/i', 'test', $str);
?>
Here is the another solution which has whitelist option and add tagret Blank attribute.
And also it check if there already a rel attribute before add a new one.
function Add_Nofollow_Attr($Content, $Whitelist = [], $Add_Target_Blank = true)
{
$Whitelist[] = $_SERVER['HTTP_HOST'];
foreach ($Whitelist as $Key => $Link)
{
$Host = preg_replace('#^https?://#', '', $Link);
$Host = "https?://". preg_quote($Host, '/');
$Whitelist[$Key] = $Host;
}
if(preg_match_all("/<a .*?>/", $Content, $matches, PREG_SET_ORDER))
{
foreach ($matches as $Anchor_Tag)
{
$IS_Rel_Exist = $IS_Follow_Exist = $IS_Target_Blank_Exist = $Is_Valid_Tag = false;
if(preg_match_all("/(\w+)\s*=\s*['|\"](.*?)['|\"]/",$Anchor_Tag[0],$All_matches2))
{
foreach ($All_matches2[1] as $Key => $Attr_Name)
{
if($Attr_Name == 'href')
{
$Is_Valid_Tag = true;
$Url = $All_matches2[2][$Key];
// bypass #.. or internal links like "/"
if(preg_match('/^\s*[#|\/].*/', $Url))
{
continue 2;
}
foreach ($Whitelist as $Link)
{
if (preg_match("#$Link#", $Url)) {
continue 3;
}
}
}
else if($Attr_Name == 'rel')
{
$IS_Rel_Exist = true;
$Rel = $All_matches2[2][$Key];
preg_match("/[n|d]ofollow/", $Rel, $match, PREG_OFFSET_CAPTURE);
if( count($match) > 0 )
{
$IS_Follow_Exist = true;
}
else
{
$New_Rel = 'rel="'. $Rel . ' nofollow"';
}
}
else if($Attr_Name == 'target')
{
$IS_Target_Blank_Exist = true;
}
}
}
$New_Anchor_Tag = $Anchor_Tag;
if(!$IS_Rel_Exist)
{
$New_Anchor_Tag = str_replace(">",' rel="nofollow">',$Anchor_Tag);
}
else if(!$IS_Follow_Exist)
{
$New_Anchor_Tag = preg_replace("/rel=[\"|'].*?[\"|']/",$New_Rel,$Anchor_Tag);
}
if($Add_Target_Blank && !$IS_Target_Blank_Exist)
{
$New_Anchor_Tag = str_replace(">",' target="_blank">',$New_Anchor_Tag);
}
$Content = str_replace($Anchor_Tag,$New_Anchor_Tag,$Content);
}
}
return $Content;
}
To use it:
$Page_Content = 'internal
internal
google
example
stackoverflow';
$Whitelist = ["http://yoursite.com","http://localhost"];
echo Add_Nofollow_Attr($Page_Content,$Whitelist,true);
WordPress decision:
function replace__method($match) {
list($original, $tag) = $match; // regex match groups
$my_folder = "/articles"; // re-add quirky config here
$blog_url = 'https://'.$_SERVER['SERVER_NAME'];
if (strpos($tag, "nofollow")) {
return $original;
}
elseif (strpos($tag, $blog_url) && (!$my_folder || !strpos($tag, $my_folder))) {
return $original;
}
else {
return "<$tag rel='nofollow'>";
}
}
add_filter( 'the_content', 'add_nofollow_to_external_links', 1 );
function add_nofollow_to_external_links( $content ) {
$content = preg_replace_callback('~<(a\s[^>]+)>~isU', "replace__method", $content);
return $content;
}
a good script which allows to add nofollow automatically and to keep the other attributes
function nofollow(string $html, string $baseUrl = null) {
return preg_replace_callback(
'#<a([^>]*)>(.+)</a>#isU', function ($mach) use ($baseUrl) {
list ($a, $attr, $text) = $mach;
if (preg_match('#href=["\']([^"\']*)["\']#', $attr, $url)) {
$url = $url[1];
if (is_null($baseUrl) || !str_starts_with($url, $baseUrl)) {
if (preg_match('#rel=["\']([^"\']*)["\']#', $attr, $rel)) {
$relAttr = $rel[0];
$rel = $rel[1];
}
$rel = 'rel="' . ($rel ? (strpos($rel, 'nofollow') ? $rel : $rel . ' nofollow') : 'nofollow') . '"';
$attr = isset($relAttr) ? str_replace($relAttr, $rel, $attr) : $attr . ' ' . $rel;
$a = '<a ' . $attr . '>' . $text . '</a>';
}
}
return $a;
},
$html
);
}
Situation is a string that results in something like this:
<p>This is some text and here is a <strong>bold text then the post stop here....</p>
Because the function returns a teaser (summary) of the text, it stops after certain words. Where in this case the tag strong is not closed. But the whole string is wrapped in a paragraph.
Is it possible to convert the above result/output to the following:
<p>This is some text and here is a <strong>bold text then the post stop here....</strong></p>
I do not know where to begin. The problem is that.. I found a function on the web which does it regex, but it puts the closing tag after the string.. therefore it won't validate because I want all open/close tags within the paragraph tags. The function I found does this which is wrong also:
<p>This is some text and here is a <strong>bold text then the post stop here....</p></strong>
I want to know that the tag can be strong, italic, anything. That's why I cannot append the function and close it manually in the function. Any pattern that can do it for me?
Here is a function i've used before, which works pretty well:
function closetags($html) {
preg_match_all('#<(?!meta|img|br|hr|input\b)\b([a-z]+)(?: .*)?(?<![/|/ ])>#iU', $html, $result);
$openedtags = $result[1];
preg_match_all('#</([a-z]+)>#iU', $html, $result);
$closedtags = $result[1];
$len_opened = count($openedtags);
if (count($closedtags) == $len_opened) {
return $html;
}
$openedtags = array_reverse($openedtags);
for ($i=0; $i < $len_opened; $i++) {
if (!in_array($openedtags[$i], $closedtags)) {
$html .= '</'.$openedtags[$i].'>';
} else {
unset($closedtags[array_search($openedtags[$i], $closedtags)]);
}
}
return $html;
}
Personally though, I would not do it using regexp but a library such as Tidy. This would be something like the following:
$str = '<p>This is some text and here is a <strong>bold text then the post stop here....</p>';
$tidy = new Tidy();
$clean = $tidy->repairString($str, array(
'output-xml' => true,
'input-xml' => true
));
echo $clean;
A small modification to the original answer...while the original answer stripped tags correctly. I found that during my truncation, I could end up with chopped up tags. For example:
This text has some <b>in it</b>
Truncating at character 21 results in:
This text has some <
The following code, builds on the next best answer and fixes this.
function truncateHTML($html, $length)
{
$truncatedText = substr($html, $length);
$pos = strpos($truncatedText, ">");
if($pos !== false)
{
$html = substr($html, 0,$length + $pos + 1);
}
else
{
$html = substr($html, 0,$length);
}
preg_match_all('#<(?!meta|img|br|hr|input\b)\b([a-z]+)(?: .*)?(?<![/|/ ])>#iU', $html, $result);
$openedtags = $result[1];
preg_match_all('#</([a-z]+)>#iU', $html, $result);
$closedtags = $result[1];
$len_opened = count($openedtags);
if (count($closedtags) == $len_opened)
{
return $html;
}
$openedtags = array_reverse($openedtags);
for ($i=0; $i < $len_opened; $i++)
{
if (!in_array($openedtags[$i], $closedtags))
{
$html .= '</'.$openedtags[$i].'>';
}
else
{
unset($closedtags[array_search($openedtags[$i], $closedtags)]);
}
}
return $html;
}
$str = "This text has <b>bold</b> in it</b>";
print "Test 1 - Truncate with no tag: " . truncateHTML($str, 5) . "<br>\n";
print "Test 2 - Truncate at start of tag: " . truncateHTML($str, 20) . "<br>\n";
print "Test 3 - Truncate in the middle of a tag: " . truncateHTML($str, 16) . "<br>\n";
print "Test 4: - Truncate with less text: " . truncateHTML($str, 300) . "<br>\n";
Hope it helps someone out there.
And what about using PHP's native DOMDocument class? It inherently parses HTML and corrects syntax errors...
E.g.:
$fragment = "<article><h3>Title</h3><p>Unclosed";
$doc = new DOMDocument();
$doc->loadHTML($fragment);
$correctFragment = $doc->getElementsByTagName('body')->item(0)->C14N();
echo $correctFragment;
However, there are several disadvantages of this approach.
Firstly, it wraps the original fragment within the <body> tag. You can get rid of it easily by something like (preg_)replace() or by substituting the ...->C14N() function by some custom innerHTML() function, as suggested for example at http://php.net/manual/en/book.dom.php#89718.
The second pitfall is that PHP throws an 'invalid tag in Entity' warning if HTML5 or custom tags are used (nevertheless, it will still proceed correctly).
This PHP method always worked for me. It will close all un-closed HTML tags.
function closetags($html) {
preg_match_all('#<([a-z]+)(?: .*)?(?<![/|/ ])>#iU', $html, $result);
$openedtags = $result[1];
preg_match_all('#</([a-z]+)>#iU', $html, $result);
$closedtags = $result[1];
$len_opened = count($openedtags);
if (count($closedtags) == $len_opened) {
return $html;
}
$openedtags = array_reverse($openedtags);
for ($i=0; $i < $len_opened; $i++) {
if (!in_array($openedtags[$i], $closedtags)){
$html .= '</'.$openedtags[$i].'>';
} else {
unset($closedtags[array_search($openedtags[$i], $closedtags)]);
}
}
return $html;
}
There are numerous other variables that need to be addressed to give a full solution, but are not covered by your question.
However, I would suggest using something like HTML Tidy and in particular the repairFile or repaireString methods.
if tidy module is installed, use php tidy extension:
tidy_repair_string($html)
reference
Using a regular expression isn't an ideal approach for this. You should use an html parser instead to create a valid document object model.
As a second option, depending on what you want, you could use a regex to remove any and all html tags from your string before you put it in the <p> tag.
I've done this code witch doest the job quite correctly...
It's old school but efficient and I've added a flag to remove the unfinished tags such as " blah blah http://stackoverfl"
public function getOpennedTags(&$string, $removeInclompleteTagEndTagIfExists = true) {
$tags = array();
$tagOpened = false;
$tagName = '';
$tagNameLogged = false;
$closingTag = false;
foreach (str_split($string) as $c) {
if ($tagOpened && $c == '>') {
$tagOpened = false;
if ($closingTag) {
array_pop($tags);
$closingTag = false;
$tagName = '';
}
if ($tagName) {
array_push($tags, $tagName);
}
}
if ($tagOpened && $c == ' ') {
$tagNameLogged = true;
}
if ($tagOpened && $c == '/') {
if ($tagName) {
//orphan tag
$tagOpened = false;
$tagName = '';
} else {
//closingTag
$closingTag = true;
}
}
if ($tagOpened && !$tagNameLogged) {
$tagName .= $c;
}
if (!$tagOpened && $c == '<') {
$tagNameLogged = false;
$tagName = '';
$tagOpened = true;
$closingTag = false;
}
}
if ($removeInclompleteTagEndTagIfExists && $tagOpened) {
// an tag has been cut for exemaple ' blabh blah <a href="sdfoefzofk' so closing the tag will not help...
// let's remove this ugly piece of tag
$pos = strrpos($string, '<');
$string = substr($string, 0, $pos);
}
return $tags;
}
Usage example :
$tagsToClose = $stringHelper->getOpennedTags($val);
$tagsToClose = array_reverse($tagsToClose);
foreach ($tagsToClose as $tag) {
$val .= "</$tag>";
}
This is works for me to close any open HTML tags in a script.
<?php
function closetags($html) {
preg_match_all('#<([a-z]+)(?: .*)?(?<![/|/ ])>#iU', $html, $result);
$openedtags = $result[1];
preg_match_all('#</([a-z]+)>#iU', $html, $result);
$closedtags = $result[1];
$len_opened = count($openedtags);
if (count($closedtags) == $len_opened) {
return $html;
}
$openedtags = array_reverse($openedtags);
for ($i=0; $i < $len_opened; $i++) {
if (!in_array($openedtags[$i], $closedtags)) {
$html .= '</'.$openedtags[$i].'>';
} else {
unset($closedtags[array_search($openedtags[$i], $closedtags)]);
}
}
return $html;
}
An up-to-date solution with parsing HTML would be:
function fix_html($html) {
$dom = new DOMDocument();
$dom->loadHTML( mb_convert_encoding( $html, 'HTML-ENTITIES', 'UTF-8' ), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD );
return $dom->saveHTML();
}
LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD is needed to avoid implementing doctype, html and body.. the rest looks pretty obvious :)
UPDATE:
After some testing noticed, that the solution above ruins a correct layout time-after-time. The following works well, though:
function fix_html($html) {
$dom = new DOMDocument();
$dom->loadHTML( mb_convert_encoding( $html, 'HTML-ENTITIES', 'UTF-8' ) );
$return = '';
foreach ( $dom->getElementsByTagName( 'body' )->item(0)->childNodes as $v ) {
$return .= $dom->saveHTML( $v );
}
return $return;
}
How can I use php to strip all/any attributes from a tag, say a paragraph tag?
<p class="one" otherrandomattribute="two"> to <p>
Although there are better ways, you could actually strip arguments from html tags with a regular expression:
<?php
function stripArgumentFromTags( $htmlString ) {
$regEx = '/([^<]*<\s*[a-z](?:[0-9]|[a-z]{0,9}))(?:(?:\s*[a-z\-]{2,14}\s*=\s*(?:"[^"]*"|\'[^\']*\'))*)(\s*\/?>[^<]*)/i'; // match any start tag
$chunks = preg_split($regEx, $htmlString, -1, PREG_SPLIT_DELIM_CAPTURE);
$chunkCount = count($chunks);
$strippedString = '';
for ($n = 1; $n < $chunkCount; $n++) {
$strippedString .= $chunks[$n];
}
return $strippedString;
}
?>
The above could probably be written in less characters, but it does the job (quick and dirty).
Strip attributes using SimpleXML (Standard in PHP5)
<?php
// define allowable tags
$allowable_tags = '<p><a><img><ul><ol><li><table><thead><tbody><tr><th><td>';
// define allowable attributes
$allowable_atts = array('href','src','alt');
// strip collector
$strip_arr = array();
// load XHTML with SimpleXML
$data_sxml = simplexml_load_string('<root>'. $data_str .'</root>', 'SimpleXMLElement', LIBXML_NOERROR | LIBXML_NOXMLDECL);
if ($data_sxml ) {
// loop all elements with an attribute
foreach ($data_sxml->xpath('descendant::*[#*]') as $tag) {
// loop attributes
foreach ($tag->attributes() as $name=>$value) {
// check for allowable attributes
if (!in_array($name, $allowable_atts)) {
// set attribute value to empty string
$tag->attributes()->$name = '';
// collect attribute patterns to be stripped
$strip_arr[$name] = '/ '. $name .'=""/';
}
}
}
}
// strip unallowed attributes and root tag
$data_str = strip_tags(preg_replace($strip_arr,array(''),$data_sxml->asXML()), $allowable_tags);
?>
Here is one function that will let you strip all attributes except ones you want:
function stripAttributes($s, $allowedattr = array()) {
if (preg_match_all("/<[^>]*\\s([^>]*)\\/*>/msiU", $s, $res, PREG_SET_ORDER)) {
foreach ($res as $r) {
$tag = $r[0];
$attrs = array();
preg_match_all("/\\s.*=(['\"]).*\\1/msiU", " " . $r[1], $split, PREG_SET_ORDER);
foreach ($split as $spl) {
$attrs[] = $spl[0];
}
$newattrs = array();
foreach ($attrs as $a) {
$tmp = explode("=", $a);
if (trim($a) != "" && (!isset($tmp[1]) || (trim($tmp[0]) != "" && !in_array(strtolower(trim($tmp[0])), $allowedattr)))) {
} else {
$newattrs[] = $a;
}
}
$attrs = implode(" ", $newattrs);
$rpl = str_replace($r[1], $attrs, $tag);
$s = str_replace($tag, $rpl, $s);
}
}
return $s;
}
In example it would be:
echo stripAttributes('<p class="one" otherrandomattribute="two">');
or if you eg. want to keep "class" attribute:
echo stripAttributes('<p class="one" otherrandomattribute="two">', array('class'));
Or
Assuming you are to send a message to an inbox and you composed your message with CKEDITOR, you can assign the function as follows and echo it to the $message variable before sending. Note the function with the name stripAttributes() will strip off all html tags that are unnecessary. I tried it and it work fine. i only saw the formatting i added like bold e.t.c.
$message = stripAttributes($_POST['message']);
or
you can echo $message; for preview.
I honestly think that the only sane way to do this is to use a tag and attribute whitelist with the HTML Purifier library. Example script here:
<html><body>
<?php
require_once '../includes/htmlpurifier-4.5.0-lite/library/HTMLPurifier/Bootstrap.php';
spl_autoload_register(array('HTMLPurifier_Bootstrap', 'autoload'));
$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.Allowed', 'p,b,a[href],i,br,img[src]');
$config->set('URI.Base', 'http://www.example.com');
$config->set('URI.MakeAbsolute', true);
$purifier = new HTMLPurifier($config);
$dirty_html = "
<a href=\"http://www.google.de\">broken a href link</a
fnord
<x>y</z>
<b>c</p>
<script>alert(\"foo!\");</script>
Anzahl besuchter Seiten
<img src=\"www.example.com/bla.gif\" />
<a href=\"http://www.google.de\">missing end tag
ende
";
$clean_html = $purifier->purify($dirty_html);
print "<h1>dirty</h1>";
print "<pre>" . htmlentities($dirty_html) . "</pre>";
print "<h1>clean</h1>";
print "<pre>" . htmlentities($clean_html) . "</pre>";
?>
</body></html>
This yields the following clean, standards-conforming HTML fragment:
broken a href linkfnord
y
<b>c
<a>Anzahl besuchter Seiten</a>
<img src="http://www.example.com/www.example.com/bla.gif" alt="bla.gif" /><a href="http://www.google.de">missing end tag
ende
</a></b>
In your case the whitelist would be:
$config->set('HTML.Allowed', 'p');
HTML Purifier is one of the better tools for sanitizing HTML with PHP.
You might also look into html purifier. True, it's quite bloated, and might not fit your needs if it only conceirns this specific example, but it offers more or less 'bulletproof' purification of possible hostile html. Also you can choose to allow or disallow certain attributes (it's highly configurable).
http://htmlpurifier.org/