Change specific words into links within HTML using PHP [duplicate] - php

I need you help here.
I want to turn this:
sometext sometext http://www.somedomain.com/index.html sometext sometext
into:
sometext sometext www.somedomain.com/index.html sometext sometext
I have managed it by using this regex:
preg_replace("#((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)#ie", "'$1$4'", $text);
The problem is it’s also replacing the img URL, for example:
sometext sometext <img src="http//domain.com/image.jpg"> sometext sometext
is turned into:
sometext sometext <img src="domain.com/image.jpg"> sometext sometext
Please help.

Streamlined version of Gumbo's above:
$html = <<< HTML
<html>
<body>
<p>
This is a text with a link
and another http://example.com/2
and also another http://example.com with the latter being the
only one that should be replaced. There is also images in this
text, like <img src="http://example.com/foo"/> but these should
not be replaced either. In fact, only URLs in text that is no
a descendant of an anchor element should be converted to a link.
</p>
</body>
</html>
HTML;
Let's use an XPath that only fetches those elements that actually are textnodes containing http:// or https:// or ftp:// and that are not themselves textnodes of anchor elements.
$dom = new DOMDocument;
$dom->loadHTML($html);
$xPath = new DOMXPath($dom);
$texts = $xPath->query(
'/html/body//text()[
not(ancestor::a) and (
contains(.,"http://") or
contains(.,"https://") or
contains(.,"ftp://") )]'
);
The XPath above will give us a TextNode with the following data:
and also another http://example.com with the latter being the
only one that should be replaced. There is also images in this
text, like
Since PHP5.3 we could also use PHP inside the XPath to use the Regex pattern to select our nodes instead of the three calls to contains.
Instead of splitting the textnodes apart in the standards compliant way, we will use a document fragment and just replace the entire textnode with the fragment. Non-standard in this case only means, the method we will be using for this, is not part of the W3C specification of the DOM API.
foreach ($texts as $text) {
$fragment = $dom->createDocumentFragment();
$fragment->appendXML(
preg_replace(
"~((?:http|https|ftp)://(?:\S*?\.\S*?))(?=\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)~i",
'$1',
$text->data
)
);
$text->parentNode->replaceChild($fragment, $text);
}
echo $dom->saveXML($dom->documentElement);
and this will then output:
<html><body>
<p>
This is a text with a link
and another http://example.com/2
and also another http://example.com with the latter being the
only one that should be replaced. There is also images in this
text, like <img src="http://example.com/foo"/> but these should
not be replaced either. In fact, only URLs in text that is no
a descendant of an anchor element should be converted to a link.
</p>
</body></html>

You shouldn’t do that with regular expressions – at least not regular expressions only. Use a proper HTML DOM parser like the one of PHP’s DOM library instead. You then can iterate the nodes, check if it’s a text node and do the regular expression search and replace the text node appropriately.
Something like this should do it:
$pattern = "~((?:http|https|ftp)://(?:\S*?\.\S*?))(?=\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)~i";
$doc = new DOMDocument();
$doc->loadHTML($str);
// for every element in the document
foreach ($doc->getElementsByTagName('*') as $elem) {
// for every child node in each element
foreach ($elem->childNodes as $node) {
if ($node->nodeType === XML_TEXT_NODE) {
// split the text content to get an array of 1+2*n elements for n URLs in it
$parts = preg_split($pattern, $node->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE);
$n = count($parts);
if ($n > 1) {
$parentNode = $node->parentNode;
// insert for each pair of non-URL/URL parts one DOMText and DOMElement node before the original DOMText node
for ($i=1; $i<$n; $i+=2) {
$a = $doc->createElement('a');
$a->setAttribute('href', $parts[$i]);
$a->setAttribute('target', '_blank');
$a->appendChild($doc->createTextNode($parts[$i]));
$parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node);
$parentNode->insertBefore($a, $node);
}
// insert the last part before the original DOMText node
$parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node);
// remove the original DOMText node
$node->parentNode->removeChild($node);
}
}
}
}
Ok, since the DOMNodeList‍s of getElementsByTagName and childNodes are live, every change in the DOM is reflected to that list and thus you cannot use foreach that would also iterate the newly added nodes. Instead, you need to use for loops instead and keep track of the elements added to increase the index pointers and at best pre-calculated array boundaries appropriately.
But since that is quite difficult in such a somehow complex algorithm (you would need one index pointer and array boundary for each of the three for loops), using a recursive algorithm is more convenient:
function mapOntoTextNodes(DOMNode $node, $callback) {
if ($node->nodeType === XML_TEXT_NODE) {
return $callback($node);
}
for ($i=0, $n=count($node->childNodes); $i<$n; ++$i) {
$nodesChanged = 0;
switch ($node->childNodes->item($i)->nodeType) {
case XML_ELEMENT_NODE:
$nodesChanged = mapOntoTextNodes($node->childNodes->item($i), $callback);
break;
case XML_TEXT_NODE:
$nodesChanged = $callback($node->childNodes->item($i));
break;
}
if ($nodesChanged !== 0) {
$n += $nodesChanged;
$i += $nodesChanged;
}
}
}
function foo(DOMText $node) {
$pattern = "~((?:http|https|ftp)://(?:\S*?\.\S*?))(?=\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)~i";
$parts = preg_split($pattern, $node->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE);
$n = count($parts);
if ($n > 1) {
$parentNode = $node->parentNode;
$doc = $node->ownerDocument;
for ($i=1; $i<$n; $i+=2) {
$a = $doc->createElement('a');
$a->setAttribute('href', $parts[$i]);
$a->setAttribute('target', '_blank');
$a->appendChild($doc->createTextNode($parts[$i]));
$parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node);
$parentNode->insertBefore($a, $node);
}
$parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node);
$parentNode->removeChild($node);
}
return $n-1;
}
$str = '<div>sometext http://www.somedomain.com/index.html sometext <img src="http//domain.com/image.jpg"> sometext sometext</div>';
$doc = new DOMDocument();
$doc->loadHTML($str);
$elems = $doc->getElementsByTagName('body');
mapOntoTextNodes($elems->item(0), 'foo');
Here mapOntoTextNodes is used to map a given callback function onto every DOMText node in a DOM document. You can either pass the whole DOMDocument node or just a specific DOMNode (in this case just the BODY node).
The function foo is then used to find and replace the plain URLs in the DOMText node’s content by splitting the content string into non-URL‍/‍URL parts using preg_split while capturing the used delimiter resulting in an array of 1+2·n items. Then the non-URL parts are replaced by new DOMText nodes and the URL parts are replaced by new A elements that are then inserted before the origin DOMText node that is then removed at the end. Since this mapOntoTextNodes walks recursively, it suffices to just call that function on a specific DOMNode.

thanks for the reply, but its still does work. i have fixed using this function:
function livelinked ($text){
preg_match_all("#((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)|^(jpg)#ie", $text, $ccs);
foreach ($ccs[3] as $cc) {
if (strpos($cc,"jpg")==false && strpos($cc,"gif")==false && strpos($cc,"png")==false ) {
$old[] = "http://".$cc;
$new[] = ''.$cc.'';
}
}
return str_replace($old,$new,$text);
}

If you'd like to keep using a regex (and in this case, a regex is quite appropriate), you can have the regex match only URLs that "stand alone". Using a word boundary escape sequence (\b), you can only have the regex match where http is immediately preceded by whitespace or the beginning of the text:
preg_replace("#\b((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)#ie", "'$1$4'", $text);
// ^^ thar she blows
Thus, "http://..." won't match, but http:// as its own word will.

DomDocument is more mature and runs much faster, so it's just an alternative if someone wants to use PHP Simple HTML DOM Parser:
<?php
require_once('simple_html_dom.php');
$html = str_get_html('sometext sometext http://www.somedomain.com/index.html sometext sometext
http://www.somedomain.com/index.html
sometext sometext <img src="http//domain.com/image.jpg"> sometext sometext');
foreach ($html->find('text') as $element)
{
// you can add any tag into the array to exclude from replace
if (!in_array($element->parent()->tag, array('a')))
$element->innertext = preg_replace("#((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)#ie", "'$1$4'", $element->innertext);
}
echo $html;

You can try my code from this question:
echo preg_replace('/<a href="([^"]*)([^<\/]*)<\/a>/i', "$1", 'sometext sometext <img src="http//domain.com/image.jpg"> sometext sometext');
If you wanna turn some other tags - that's easy enough:
echo preg_replace('/<img src="([^"]*)([^\/><]*)>/i', "$1", 'sometext sometext <img src="http//domain.com/image.jpg"> sometext sometext');

match a whitespace (\s) at the start and end of the url string, this will ensure that
"http://url.com"
is not matched by
http://url.com
is matched;

Related

Difficulties with the function preg_match_all

I would like to get back the number which is between span HTML tags. The number may change!
<span class="topic-count">
::before
"
24
"
::after
</span>
I've tried the following code:
preg_match_all("#<span class=\"topic-count\">(.*?)</span>#", $source, $nombre[$i]);
But it doesn't work.
Entire code:
$result=array();
$page = 201;
while ($page>=1) {
$source = file_get_contents ("http://www.jeuxvideo.com/forums/0-27047-0-1-0-".$page."-0-counter-strike-global-offensive.htm");
preg_match_all("#<span class=\"topic-count\">(.*?)</span>#", $source, $nombre[$i]);
$result = array_merge($result, $nombre[$i][1]);
print("Page : ".$page ."\n");
$page-=25;
}
print_r ($nombre);
Can do with
preg_match_all(
'#<span class="topic-count">[^\d]*(\d+)[^\d]*?</span>#s',
$html,
$matches
);
which would capture any digits before the end of the span.
However, note that this regex will only work for exactly this piece of html. If there is a slight variation in the markup, for instance, another class or another attribute, the pattern will not work anymore. Writing reliable regexes for HTML is hard.
Hence the recommendation to use a DOM parser instead, e.g.
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTMLFile('http://www.jeuxvideo.com/forums/0-27047-0-1-0-1-0-counter-strike-global-offensive.htm');
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);
foreach ($xpath->evaluate('//span[contains(#class, "topic-count")]') as $node) {
if (preg_match_all('#\d+#s', $node->nodeValue, $topics)) {
echo $topics[0][0], PHP_EOL;
}
}
DOM will parse the entire page into a tree of nodes, which you can then query conveniently via XPath. Note the expression
//span[contains(#class, "topic-count")]
which will give you all the span elements with a class attribute containing the string topic-count. Then if any of these nodes contain a digit, echo it.

Validation like Strip_tags for HTML Attributes [duplicate]

I have this html code:
<p style="padding:0px;">
<strong style="padding:0;margin:0;">hello</strong>
</p>
How can I remove attributes from all tags? I'd like it to look like this:
<p>
<strong>hello</strong>
</p>
Adapted from my answer on a similar question
$text = '<p style="padding:0px;"><strong style="padding:0;margin:0;">hello</strong></p>';
echo preg_replace("/<([a-z][a-z0-9]*)[^>]*?(\/?)>/si",'<$1$2>', $text);
// <p><strong>hello</strong></p>
The RegExp broken down:
/ # Start Pattern
< # Match '<' at beginning of tags
( # Start Capture Group $1 - Tag Name
[a-z] # Match 'a' through 'z'
[a-z0-9]* # Match 'a' through 'z' or '0' through '9' zero or more times
) # End Capture Group
[^>]*? # Match anything other than '>', Zero or More times, not-greedy (wont eat the /)
(\/?) # Capture Group $2 - '/' if it is there
> # Match '>'
/is # End Pattern - Case Insensitive & Multi-line ability
Add some quoting, and use the replacement text <$1$2> it should strip any text after the tagname until the end of tag /> or just >.
Please Note This isn't necessarily going to work on ALL input, as the Anti-HTML + RegExp will tell you. There are a few fallbacks, most notably <p style=">"> would end up <p>"> and a few other broken issues... I would recommend looking at Zend_Filter_StripTags as a more full proof tags/attributes filter in PHP
Here is how to do it with native DOM:
$dom = new DOMDocument; // init new DOMDocument
$dom->loadHTML($html); // load HTML into it
$xpath = new DOMXPath($dom); // create a new XPath
$nodes = $xpath->query('//*[#style]'); // Find elements with a style attribute
foreach ($nodes as $node) { // Iterate over found elements
$node->removeAttribute('style'); // Remove style attribute
}
echo $dom->saveHTML(); // output cleaned HTML
If you want to remove all possible attributes from all possible tags, do
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//#*');
foreach ($nodes as $node) {
$node->parentNode->removeAttribute($node->nodeName);
}
echo $dom->saveHTML();
I would avoid using regex as HTML is not a regular language and instead use a html parser like Simple HTML DOM
You can get a list of attributes that the object has by using attr. For example:
$html = str_get_html('<div id="hello">World</div>');
var_dump($html->find("div", 0)->attr); /
/*
array(1) {
["id"]=>
string(5) "hello"
}
*/
foreach ( $html->find("div", 0)->attr as &$value ){
$value = null;
}
print $html
//<div>World</div>
$html_text = '<p>Hello <b onclick="alert(123)" style="color: red">world</b>. <i>Its beautiful day.</i></p>';
$strip_text = strip_tags($html_text, '<b>');
$result = preg_replace('/<(\w+)[^>]*>/', '<$1>', $strip_text);
echo $result;
// Result
string 'Hello <b>world</b>. Its beautiful day.'
Another way to do it using php's DOMDocument class (without xpath) is to iterate over the attributes on a given node. Please note, due to the way php handles the DOMNamedNodeMap class, you must iterate backward over the collection if you plan on altering it. This behaviour has been discussed elsewhere and is also noted in the documentation comments. The same applies to the DOMNodeList class when it comes to removing or adding elements. To be on the safe side, I always iterate backwards with these objects.
Here is a simple example:
function scrubAttributes($html) {
$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
for ($els = $dom->getElementsByTagname('*'), $i = $els->length - 1; $i >= 0; $i--) {
for ($attrs = $els->item($i)->attributes, $ii = $attrs->length - 1; $ii >= 0; $ii--) {
$els->item($i)->removeAttribute($attrs->item($ii)->name);
}
}
return $dom->saveHTML();
}
Here's a demo: https://3v4l.org/M2ing
Optimized regular expression from the top rated answer on this issue:
$text = '<div width="5px">a is less than b: a<b, ya know?</div>';
echo preg_replace("/<([a-z][a-z0-9]*)[^<|>]*?(\/?)>/si",'<$1$2>', $text);
// <div>a is less than b: a<b, ya know?</div>
UPDATE:
It works better when allow only some tags with PHP strip_tags() function. Let's say we want to allow only <br>, <b> and <i> tags, then:
$text = '<i style=">">Italic</i>';
$text = strip_tags($text, '<br><b><i>');
echo preg_replace("/<([a-z][a-z0-9]*)[^<|>]*?(\/?)>/si",'<$1$2>', $text);
//<i>Italic</i>
As we can see it fixes flaws connected with tag symbols in attribute values.
Regex's are too fragile for HTML parsing. In your example, the following would strip out your attributes:
echo preg_replace(
"|<(\w+)([^>/]+)?|",
"<$1",
"<p style=\"padding:0px;\">\n<strong style=\"padding:0;margin:0;\">hello</strong>\n</p>\n"
);
Update
Make to second capture optional and do not strip '/' from closing tags:
|<(\w+)([^>]+)| to |<(\w+)([^>/]+)?|
Demonstrate this regular expression works:
$ phpsh
Starting php
type 'h' or 'help' to see instructions & features
php> $html = '<p style="padding:0px;"><strong style="padding:0;margin:0;">hello<br/></strong></p>';
php> echo preg_replace("|<(\w+)([^>/]+)?|", "<$1", $html);
<p><strong>hello</strong><br/></p>
php> $html = '<strong>hello</strong>';
php> echo preg_replace("|<(\w+)([^>/]+)?|", "<$1", $html);
<strong>hello</strong>
Hope this helps. It may not be the fastest way to do it, especially for large blocks of html.
If anyone has any suggestions as to make this faster, let me know.
function StringEx($str, $start, $end)
{
$str_low = strtolower($str);
$pos_start = strpos($str_low, $start);
$pos_end = strpos($str_low, $end, ($pos_start + strlen($start)));
if($pos_end==0) return false;
if ( ($pos_start !== false) && ($pos_end !== false) )
{
$pos1 = $pos_start + strlen($start);
$pos2 = $pos_end - $pos1;
$RData = substr($str, $pos1, $pos2);
if($RData=='') { return true; }
return $RData;
}
return false;
}
$S = '<'; $E = '>'; while($RData=StringEx($DATA, $S, $E)) { if($RData==true) {$RData='';} $DATA = str_ireplace($S.$RData.$E, '||||||', $DATA); } $DATA = str_ireplace('||||||', $S.$E, $DATA);
To do SPECIFICALLY what andufo wants, it's simply:
$html = preg_replace( "#(<[a-zA-Z0-9]+)[^\>]+>#", "\\1>", $html );
That is, he wants to strip anything but the tag name out of the opening tag. It won't work for self-closing tags of course.
Here's an easy way to get rid of attributes. It handles malformed html pretty well.
<?php
$string = '<p style="padding:0px;">
<strong style="padding:0;margin:0;">hello</strong>
</p>';
//get all html elements on a line by themselves
$string_html_on_lines = str_replace (array("<",">"),array("\n<",">\n"),$string);
//find lines starting with a '<' and any letters or numbers upto the first space. throw everything after the space away.
$string_attribute_free = preg_replace("/\n(<[\w123456]+)\s.+/i","\n$1>",$string_html_on_lines);
echo $string_attribute_free;
?>

How to replace text URLs and exclude URLs in HTML tags?

I need you help here.
I want to turn this:
sometext sometext http://www.somedomain.com/index.html sometext sometext
into:
sometext sometext www.somedomain.com/index.html sometext sometext
I have managed it by using this regex:
preg_replace("#((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)#ie", "'$1$4'", $text);
The problem is it’s also replacing the img URL, for example:
sometext sometext <img src="http//domain.com/image.jpg"> sometext sometext
is turned into:
sometext sometext <img src="domain.com/image.jpg"> sometext sometext
Please help.
Streamlined version of Gumbo's above:
$html = <<< HTML
<html>
<body>
<p>
This is a text with a link
and another http://example.com/2
and also another http://example.com with the latter being the
only one that should be replaced. There is also images in this
text, like <img src="http://example.com/foo"/> but these should
not be replaced either. In fact, only URLs in text that is no
a descendant of an anchor element should be converted to a link.
</p>
</body>
</html>
HTML;
Let's use an XPath that only fetches those elements that actually are textnodes containing http:// or https:// or ftp:// and that are not themselves textnodes of anchor elements.
$dom = new DOMDocument;
$dom->loadHTML($html);
$xPath = new DOMXPath($dom);
$texts = $xPath->query(
'/html/body//text()[
not(ancestor::a) and (
contains(.,"http://") or
contains(.,"https://") or
contains(.,"ftp://") )]'
);
The XPath above will give us a TextNode with the following data:
and also another http://example.com with the latter being the
only one that should be replaced. There is also images in this
text, like
Since PHP5.3 we could also use PHP inside the XPath to use the Regex pattern to select our nodes instead of the three calls to contains.
Instead of splitting the textnodes apart in the standards compliant way, we will use a document fragment and just replace the entire textnode with the fragment. Non-standard in this case only means, the method we will be using for this, is not part of the W3C specification of the DOM API.
foreach ($texts as $text) {
$fragment = $dom->createDocumentFragment();
$fragment->appendXML(
preg_replace(
"~((?:http|https|ftp)://(?:\S*?\.\S*?))(?=\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)~i",
'$1',
$text->data
)
);
$text->parentNode->replaceChild($fragment, $text);
}
echo $dom->saveXML($dom->documentElement);
and this will then output:
<html><body>
<p>
This is a text with a link
and another http://example.com/2
and also another http://example.com with the latter being the
only one that should be replaced. There is also images in this
text, like <img src="http://example.com/foo"/> but these should
not be replaced either. In fact, only URLs in text that is no
a descendant of an anchor element should be converted to a link.
</p>
</body></html>
You shouldn’t do that with regular expressions – at least not regular expressions only. Use a proper HTML DOM parser like the one of PHP’s DOM library instead. You then can iterate the nodes, check if it’s a text node and do the regular expression search and replace the text node appropriately.
Something like this should do it:
$pattern = "~((?:http|https|ftp)://(?:\S*?\.\S*?))(?=\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)~i";
$doc = new DOMDocument();
$doc->loadHTML($str);
// for every element in the document
foreach ($doc->getElementsByTagName('*') as $elem) {
// for every child node in each element
foreach ($elem->childNodes as $node) {
if ($node->nodeType === XML_TEXT_NODE) {
// split the text content to get an array of 1+2*n elements for n URLs in it
$parts = preg_split($pattern, $node->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE);
$n = count($parts);
if ($n > 1) {
$parentNode = $node->parentNode;
// insert for each pair of non-URL/URL parts one DOMText and DOMElement node before the original DOMText node
for ($i=1; $i<$n; $i+=2) {
$a = $doc->createElement('a');
$a->setAttribute('href', $parts[$i]);
$a->setAttribute('target', '_blank');
$a->appendChild($doc->createTextNode($parts[$i]));
$parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node);
$parentNode->insertBefore($a, $node);
}
// insert the last part before the original DOMText node
$parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node);
// remove the original DOMText node
$node->parentNode->removeChild($node);
}
}
}
}
Ok, since the DOMNodeList‍s of getElementsByTagName and childNodes are live, every change in the DOM is reflected to that list and thus you cannot use foreach that would also iterate the newly added nodes. Instead, you need to use for loops instead and keep track of the elements added to increase the index pointers and at best pre-calculated array boundaries appropriately.
But since that is quite difficult in such a somehow complex algorithm (you would need one index pointer and array boundary for each of the three for loops), using a recursive algorithm is more convenient:
function mapOntoTextNodes(DOMNode $node, $callback) {
if ($node->nodeType === XML_TEXT_NODE) {
return $callback($node);
}
for ($i=0, $n=count($node->childNodes); $i<$n; ++$i) {
$nodesChanged = 0;
switch ($node->childNodes->item($i)->nodeType) {
case XML_ELEMENT_NODE:
$nodesChanged = mapOntoTextNodes($node->childNodes->item($i), $callback);
break;
case XML_TEXT_NODE:
$nodesChanged = $callback($node->childNodes->item($i));
break;
}
if ($nodesChanged !== 0) {
$n += $nodesChanged;
$i += $nodesChanged;
}
}
}
function foo(DOMText $node) {
$pattern = "~((?:http|https|ftp)://(?:\S*?\.\S*?))(?=\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)~i";
$parts = preg_split($pattern, $node->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE);
$n = count($parts);
if ($n > 1) {
$parentNode = $node->parentNode;
$doc = $node->ownerDocument;
for ($i=1; $i<$n; $i+=2) {
$a = $doc->createElement('a');
$a->setAttribute('href', $parts[$i]);
$a->setAttribute('target', '_blank');
$a->appendChild($doc->createTextNode($parts[$i]));
$parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node);
$parentNode->insertBefore($a, $node);
}
$parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node);
$parentNode->removeChild($node);
}
return $n-1;
}
$str = '<div>sometext http://www.somedomain.com/index.html sometext <img src="http//domain.com/image.jpg"> sometext sometext</div>';
$doc = new DOMDocument();
$doc->loadHTML($str);
$elems = $doc->getElementsByTagName('body');
mapOntoTextNodes($elems->item(0), 'foo');
Here mapOntoTextNodes is used to map a given callback function onto every DOMText node in a DOM document. You can either pass the whole DOMDocument node or just a specific DOMNode (in this case just the BODY node).
The function foo is then used to find and replace the plain URLs in the DOMText node’s content by splitting the content string into non-URL‍/‍URL parts using preg_split while capturing the used delimiter resulting in an array of 1+2·n items. Then the non-URL parts are replaced by new DOMText nodes and the URL parts are replaced by new A elements that are then inserted before the origin DOMText node that is then removed at the end. Since this mapOntoTextNodes walks recursively, it suffices to just call that function on a specific DOMNode.
thanks for the reply, but its still does work. i have fixed using this function:
function livelinked ($text){
preg_match_all("#((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)|^(jpg)#ie", $text, $ccs);
foreach ($ccs[3] as $cc) {
if (strpos($cc,"jpg")==false && strpos($cc,"gif")==false && strpos($cc,"png")==false ) {
$old[] = "http://".$cc;
$new[] = ''.$cc.'';
}
}
return str_replace($old,$new,$text);
}
If you'd like to keep using a regex (and in this case, a regex is quite appropriate), you can have the regex match only URLs that "stand alone". Using a word boundary escape sequence (\b), you can only have the regex match where http is immediately preceded by whitespace or the beginning of the text:
preg_replace("#\b((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)#ie", "'$1$4'", $text);
// ^^ thar she blows
Thus, "http://..." won't match, but http:// as its own word will.
DomDocument is more mature and runs much faster, so it's just an alternative if someone wants to use PHP Simple HTML DOM Parser:
<?php
require_once('simple_html_dom.php');
$html = str_get_html('sometext sometext http://www.somedomain.com/index.html sometext sometext
http://www.somedomain.com/index.html
sometext sometext <img src="http//domain.com/image.jpg"> sometext sometext');
foreach ($html->find('text') as $element)
{
// you can add any tag into the array to exclude from replace
if (!in_array($element->parent()->tag, array('a')))
$element->innertext = preg_replace("#((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)#ie", "'$1$4'", $element->innertext);
}
echo $html;
You can try my code from this question:
echo preg_replace('/<a href="([^"]*)([^<\/]*)<\/a>/i', "$1", 'sometext sometext <img src="http//domain.com/image.jpg"> sometext sometext');
If you wanna turn some other tags - that's easy enough:
echo preg_replace('/<img src="([^"]*)([^\/><]*)>/i', "$1", 'sometext sometext <img src="http//domain.com/image.jpg"> sometext sometext');
match a whitespace (\s) at the start and end of the url string, this will ensure that
"http://url.com"
is not matched by
http://url.com
is matched;

Regex / DOMDocument - match and replace text not in a link

I need to find and replace all text matches in a case insensitive way, unless the text is within an anchor tag - for example:
<p>Match this text and replace it</p>
<p>Don't match this text</p>
<p>We still need to match this text and replace it</p>
Searching for 'match this text' would only replace the first instance and last instance.
[Edit] As per Gordon's comment, it may be preferred to use DOMDocument in this instance. I'm not at all familiar with the DOMDocument extension, and would really appreciate some basic examples for this functionality.
Here is an UTF-8 safe solution, which not only works with properly formatted documents, but also with document fragments.
The mb_convert_encoding is needed, because loadHtml() seems to has a bug with UTF-8 encoding (see here and here).
The mb_substr is trimming the body tag from the output, this way you get back your original content without any additional markup.
<?php
$html = '<p>Match this text and replace it</p>
<p>Don\'t match this text</p>
<p>We still need to match this text and replace itŐŰ</p>
<p>This is a link <span>with <strong>don\'t match this text</strong> content</span></p>';
$dom = new DOMDocument();
// loadXml needs properly formatted documents, so it's better to use loadHtml, but it needs a hack to properly handle UTF-8 encoding
$dom->loadHtml(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"));
$xpath = new DOMXPath($dom);
foreach($xpath->query('//text()[not(ancestor::a)]') as $node)
{
$replaced = str_ireplace('match this text', 'MATCH', $node->wholeText);
$newNode = $dom->createDocumentFragment();
$newNode->appendXML($replaced);
$node->parentNode->replaceChild($newNode, $node);
}
// get only the body tag with its contents, then trim the body tag itself to get only the original content
echo mb_substr($dom->saveXML($xpath->query('//body')->item(0)), 6, -7, "UTF-8");
References:
1. find and replace keywords by hyperlinks in an html fragment, via php dom
2. Regex / DOMDocument - match and replace text not in a link
3. php problem with russian language
4. Why Does DOM Change Encoding?
I read dozens of answers in the subject, so I am sorry if I forgot somebody (please comment it and I will add yours as well in this case).
Thanks for Gordon and stillstanding for commenting on my other answer.
Try this one:
$dom = new DOMDocument;
$dom->loadHTML($html_content);
function preg_replace_dom($regex, $replacement, DOMNode $dom, array $excludeParents = array()) {
if (!empty($dom->childNodes)) {
foreach ($dom->childNodes as $node) {
if ($node instanceof DOMText &&
!in_array($node->parentNode->nodeName, $excludeParents))
{
$node->nodeValue = preg_replace($regex, $replacement, $node->nodeValue);
}
else
{
preg_replace_dom($regex, $replacement, $node, $excludeParents);
}
}
}
}
preg_replace_dom('/match this text/i', 'IT WORKS', $dom->documentElement, array('a'));
This is the stackless non-recursive approach using pre-order traversal of the DOM tree.
libxml_use_internal_errors(TRUE);
$dom=new DOMDocument('1.0','UTF-8');
$dom->substituteEntities=FALSE;
$dom->recover=TRUE;
$dom->strictErrorChecking=FALSE;
$dom->loadHTMLFile($file);
$root=$dom->documentElement;
$node=$root;
$flag=FALSE;
for (;;) {
if (!$flag) {
if ($node->nodeType==XML_TEXT_NODE &&
$node->parentNode->tagName!='a') {
$node->nodeValue=preg_replace(
'/match this text/is',
$replacement, $node->nodeValue
);
}
if ($node->firstChild) {
$node=$node->firstChild;
continue;
}
}
if ($node->isSameNode($root)) break;
if ($flag=$node->nextSibling)
$node=$node->nextSibling;
else
$node=$node->parentNode;
}
echo $dom->saveHTML();
libxml_use_internal_errors(TRUE); and the 3 lines of code after $dom=new DOMDocument; should be able to handle any malformed HTML.
$a='<p>Match this text and replace it</p>
<p>Don\'t match this text</p>
<p>We still need to match this text and replace it</p>';
echo preg_replace('~match this text(?![^<]*</a>)~i','replacement',$a);
The negative lookahead ensures the replacement happens only if the next tag is not a closing link . It works fine with your example, though it won't work if you happen to use other tags inside your links.
You can use PHP Simple HTML DOM Parser. It is similar to DOMDocument, but in my opinion it's simpler to use.
Here is the alternative in parallel with Netcoder's DomDocument solution:
function replaceWithSimpleHtmlDom($html_content, $search, $replace, $excludedParents = array()) {
require_once('simple_html_dom.php');
$html = str_get_html($html_content);
foreach ($html->find('text') as $element) {
if (!in_array($element->parent()->tag, $excludedParents))
$element->innertext = str_ireplace($search, $replace, $element->innertext);
}
return (string)$html;
}
I have just profiled this code against my DomDocument solution (witch prints the exact same output), and the DomDocument is (not surprisingly) way faster (~4ms against ~77ms).
<?php
$a = '<p>Match this text and replace it</p>
<p>Don\'t match this text</p>
<p>We still need to match this text and replace it</p>
';
$res = preg_replace("#[^<a.*>]match this text#",'replacement',$a);
echo $res;
?>
This way works. Hope you want realy case sensitive, so match with small letter.
HTML parsing with regexs is a huge challenge, and they can very easily end up getting too complex and taking up loads of memory. I would say the best way is to do this:
preg_replace('/match this text/i','replacement text');
preg_replace('/(<a[^>]*>[^(<\/a)]*)replacement text(.*?<\/a)/is',"$1match this text$3");
If your replacement text is something which might occur otherwise, you might want to add an intermediate step with some unique identifier.

Remove all attributes from html tags

I have this html code:
<p style="padding:0px;">
<strong style="padding:0;margin:0;">hello</strong>
</p>
How can I remove attributes from all tags? I'd like it to look like this:
<p>
<strong>hello</strong>
</p>
Adapted from my answer on a similar question
$text = '<p style="padding:0px;"><strong style="padding:0;margin:0;">hello</strong></p>';
echo preg_replace("/<([a-z][a-z0-9]*)[^>]*?(\/?)>/si",'<$1$2>', $text);
// <p><strong>hello</strong></p>
The RegExp broken down:
/ # Start Pattern
< # Match '<' at beginning of tags
( # Start Capture Group $1 - Tag Name
[a-z] # Match 'a' through 'z'
[a-z0-9]* # Match 'a' through 'z' or '0' through '9' zero or more times
) # End Capture Group
[^>]*? # Match anything other than '>', Zero or More times, not-greedy (wont eat the /)
(\/?) # Capture Group $2 - '/' if it is there
> # Match '>'
/is # End Pattern - Case Insensitive & Multi-line ability
Add some quoting, and use the replacement text <$1$2> it should strip any text after the tagname until the end of tag /> or just >.
Please Note This isn't necessarily going to work on ALL input, as the Anti-HTML + RegExp will tell you. There are a few fallbacks, most notably <p style=">"> would end up <p>"> and a few other broken issues... I would recommend looking at Zend_Filter_StripTags as a more full proof tags/attributes filter in PHP
Here is how to do it with native DOM:
$dom = new DOMDocument; // init new DOMDocument
$dom->loadHTML($html); // load HTML into it
$xpath = new DOMXPath($dom); // create a new XPath
$nodes = $xpath->query('//*[#style]'); // Find elements with a style attribute
foreach ($nodes as $node) { // Iterate over found elements
$node->removeAttribute('style'); // Remove style attribute
}
echo $dom->saveHTML(); // output cleaned HTML
If you want to remove all possible attributes from all possible tags, do
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//#*');
foreach ($nodes as $node) {
$node->parentNode->removeAttribute($node->nodeName);
}
echo $dom->saveHTML();
I would avoid using regex as HTML is not a regular language and instead use a html parser like Simple HTML DOM
You can get a list of attributes that the object has by using attr. For example:
$html = str_get_html('<div id="hello">World</div>');
var_dump($html->find("div", 0)->attr); /
/*
array(1) {
["id"]=>
string(5) "hello"
}
*/
foreach ( $html->find("div", 0)->attr as &$value ){
$value = null;
}
print $html
//<div>World</div>
$html_text = '<p>Hello <b onclick="alert(123)" style="color: red">world</b>. <i>Its beautiful day.</i></p>';
$strip_text = strip_tags($html_text, '<b>');
$result = preg_replace('/<(\w+)[^>]*>/', '<$1>', $strip_text);
echo $result;
// Result
string 'Hello <b>world</b>. Its beautiful day.'
Another way to do it using php's DOMDocument class (without xpath) is to iterate over the attributes on a given node. Please note, due to the way php handles the DOMNamedNodeMap class, you must iterate backward over the collection if you plan on altering it. This behaviour has been discussed elsewhere and is also noted in the documentation comments. The same applies to the DOMNodeList class when it comes to removing or adding elements. To be on the safe side, I always iterate backwards with these objects.
Here is a simple example:
function scrubAttributes($html) {
$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
for ($els = $dom->getElementsByTagname('*'), $i = $els->length - 1; $i >= 0; $i--) {
for ($attrs = $els->item($i)->attributes, $ii = $attrs->length - 1; $ii >= 0; $ii--) {
$els->item($i)->removeAttribute($attrs->item($ii)->name);
}
}
return $dom->saveHTML();
}
Here's a demo: https://3v4l.org/M2ing
Optimized regular expression from the top rated answer on this issue:
$text = '<div width="5px">a is less than b: a<b, ya know?</div>';
echo preg_replace("/<([a-z][a-z0-9]*)[^<|>]*?(\/?)>/si",'<$1$2>', $text);
// <div>a is less than b: a<b, ya know?</div>
UPDATE:
It works better when allow only some tags with PHP strip_tags() function. Let's say we want to allow only <br>, <b> and <i> tags, then:
$text = '<i style=">">Italic</i>';
$text = strip_tags($text, '<br><b><i>');
echo preg_replace("/<([a-z][a-z0-9]*)[^<|>]*?(\/?)>/si",'<$1$2>', $text);
//<i>Italic</i>
As we can see it fixes flaws connected with tag symbols in attribute values.
Regex's are too fragile for HTML parsing. In your example, the following would strip out your attributes:
echo preg_replace(
"|<(\w+)([^>/]+)?|",
"<$1",
"<p style=\"padding:0px;\">\n<strong style=\"padding:0;margin:0;\">hello</strong>\n</p>\n"
);
Update
Make to second capture optional and do not strip '/' from closing tags:
|<(\w+)([^>]+)| to |<(\w+)([^>/]+)?|
Demonstrate this regular expression works:
$ phpsh
Starting php
type 'h' or 'help' to see instructions & features
php> $html = '<p style="padding:0px;"><strong style="padding:0;margin:0;">hello<br/></strong></p>';
php> echo preg_replace("|<(\w+)([^>/]+)?|", "<$1", $html);
<p><strong>hello</strong><br/></p>
php> $html = '<strong>hello</strong>';
php> echo preg_replace("|<(\w+)([^>/]+)?|", "<$1", $html);
<strong>hello</strong>
Hope this helps. It may not be the fastest way to do it, especially for large blocks of html.
If anyone has any suggestions as to make this faster, let me know.
function StringEx($str, $start, $end)
{
$str_low = strtolower($str);
$pos_start = strpos($str_low, $start);
$pos_end = strpos($str_low, $end, ($pos_start + strlen($start)));
if($pos_end==0) return false;
if ( ($pos_start !== false) && ($pos_end !== false) )
{
$pos1 = $pos_start + strlen($start);
$pos2 = $pos_end - $pos1;
$RData = substr($str, $pos1, $pos2);
if($RData=='') { return true; }
return $RData;
}
return false;
}
$S = '<'; $E = '>'; while($RData=StringEx($DATA, $S, $E)) { if($RData==true) {$RData='';} $DATA = str_ireplace($S.$RData.$E, '||||||', $DATA); } $DATA = str_ireplace('||||||', $S.$E, $DATA);
To do SPECIFICALLY what andufo wants, it's simply:
$html = preg_replace( "#(<[a-zA-Z0-9]+)[^\>]+>#", "\\1>", $html );
That is, he wants to strip anything but the tag name out of the opening tag. It won't work for self-closing tags of course.
Here's an easy way to get rid of attributes. It handles malformed html pretty well.
<?php
$string = '<p style="padding:0px;">
<strong style="padding:0;margin:0;">hello</strong>
</p>';
//get all html elements on a line by themselves
$string_html_on_lines = str_replace (array("<",">"),array("\n<",">\n"),$string);
//find lines starting with a '<' and any letters or numbers upto the first space. throw everything after the space away.
$string_attribute_free = preg_replace("/\n(<[\w123456]+)\s.+/i","\n$1>",$string_html_on_lines);
echo $string_attribute_free;
?>

Categories