How to remove link with preg_replace();? - php

I'm not sure how to explain this, so I'll show it on my code.
First and
Second and
Third
how can I delete opening and closing but not the rest?
I'm asking for preg_replace(); and I'm not looking for DomDocument or others methods to do it. I just want to see example on preg_replace();
how is it achievable?

Only pick the groups you want to preserve:
$pattern = '~()([^<]*)()~';
// 1 2 3
$result = preg_replace($pattern, '$2', $subject);
You find more examples on the preg_replace manual page.

Since you asked me in the comments to show any method of doing this, here it is.
$html =<<<HTML
First and
Second and
Third
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$elems = $xpath->query("//a[#class='delete']");
foreach ($elems as $elem) {
$elem->parentNode->removeChild($elem);
}
echo $dom->saveHTML();
Note that saveHTML() saves a complete document even if you only parsed a fragment.
As of PHP 5.3.6 you can add a $node parameter to specify the fragment it should return - something like $xpath->query("/*/body")[0] would work.

$pattern = '/<a (.*?)href=[\"\'](.*?)\/\/(.*?)[\"\'](.*?)>(.*?)<\/a>/i';
$new_content = preg_replace($pattern, '$5', $content);

$pattern = '/<a[^<>]*?class="delete"[^<>]*?>(.*?)<\/a>/';
$test = 'First and Second and Third';
echo preg_replace($pattern, '$1', $test)."\n";
$test = 'First and <b class="delete">seriously</b> and Third';
echo preg_replace($pattern, '$1', $test)."\n";
$test = 'First and <b class="delete">seriously</b> and Third';
echo preg_replace($pattern, '$1', $test)."\n";
$test = 'First and <a class="delete" href="url2.html">Second</a> and Third';
echo preg_replace($pattern, '$1', $test)."\n";

preg_replace('#(.+?)#', '$1', $html_string);
It is important to understand this is not an ideal solution. First, it requires markup in this exact format. Second, if there were, say, a nested anchor tag (albeit unlikely) this would fail. These are some of the many reasons why Regular Expressions should not be used for parsing/manipulating HTML.

Related

Add target=_blank to outside links

I want to add target="_blank" to <a> tags to open that link in a new page, so I found this RegEx :
$content = preg_replace('/(<a href[^<>]+)>/is', '\\1 target="_blank">', $content);
This will work without any problem, but this code will add target="_blank" to all links, I want to add just to links which href will start with http://
How can I do this?
You've asked for a regular expression here, but it's not the right tool for the job.
$doc = new DOMDocument;
$doc->loadHTML($html); // Load your HTML
$xpath = new DOMXPath($doc);
$links = $xpath->query('//a[starts-with(#href, "http://")]');
foreach($links as $link) {
$link->setAttribute('target', '_blank');
}
echo $doc->saveHTML();
If you want to exclude internal links as suggested in the comments, you can do:
$links = $xpath->query('//a[starts-with(#href, "http://") and
not(starts-with(#href, "http://yoursite.com")) and
not(starts-with(#href, "http://www.yoursite.com))]');
You can use this regex:
(<a\b[^<>]*href=['"]?http[^<>]+)>
See demo.
I have added \b[^<>]* to account for any other attributes before href.
Sample code:
$re = "/(<a\\b[^<>]*href=['\"]?http[^<>]+)>/is";
$str = "<a href=\"do.com\">\n<a href=\"do.com\">\n<a another=\"val\" href=\"http://do.com\">\n";
$subst = "$1 target=\"_blank\">";
$result = preg_replace($re, $subst, $str);

Add id attribute to hyperlinks through PHP Regular Expressions

I am still relatively new to Regular Expressions and feel My code is being too greedy. I am trying to add an id attribute to existing links in a piece of code. My functions is like so:
function addClassHref($str) {
//$str = stripslashes($str);
$preg = "/<[\s]*a[\s]*href=[\s]*[\"\']?([\w.-]*)[\"\']?[^>]*>(.*?)<\/a>/i";
preg_match_all($preg, $str, $match);
foreach ($match[1] as $key => $val) {
$pattern[] = '/' . preg_quote($match[0][$key], '/') . '/';
$replace[] = "<a id='buttonRed' href='$val'>{$match[2][$key]}</a>";
}
return preg_replace($pattern, $replace, $str);
}
This adds the id tag like I want but it breaks the hyperlink. For example:
If the original code is : Link
Instead of <a id="class" href="http://www.google.com">Link</a>
It is giving
<a id="class" href="http">Link</a>
Any suggestions or thoughts?
Do not use regular expressions to parse XML or HTML.
$doc = new DOMDocument();
$doc->loadHTML($html);
$all_a = $doc->getElementsByTagName('a');
$firsta = $all_a->item(0);
$firsta->setAttribute('id', 'idvalue');
echo $doc->saveHTML($firsta);
You've got some overcomplications in your regex :)
Also, there's no need for the loop as preg_replace() will hit all the instances of the search pattern in the relevant string. The first regex below will take everything in the a tag and simply add the id attribute on at the end.
$str = 'Link' . "\n" .
'Link' . "\n" .
'Link';
$p = "{<\s*a\s*(href=[^>]*)>([^<]*)</a>}i";
$r = "<a $1 id=\"class\">$2</a>";
echo preg_replace($p, $r, $str);
If you only want to capture the href attribute you could do the following:
$p = '{<\s*a\s*href=["\']([^"\']*)["\'][^>]*>([^<]*)</a>}i';
$r = "<a href='$1' id='class'>$2</a>";
Your first subpattern ([\w.-]*) doesn't match :, thus it stops at "http".
Couldn't you just use a simple str_replace() for this? Regex seems like overkill if this is all you're doing.
$str = str_replace('<a ', '<a id="someID" ', $str);

preg_replace - How to remove contents inside a tag?

Say I have this.
$string = "<div class=\"name\">anyting</div>1234<div class=\"name\">anyting</div>abcd";
$regex = "#([<]div)(.*)([<]/div[>])#";
echo preg_replace($regex,'',$string);
The output is
abcd
But I want
1234abcd
How do I do it?
Like this:
preg_replace('/(<div[^>]*>)(.*?)(<\/div>)/i', '$1$3', $string);
If you want to remove the divs too:
preg_replace('/<div[^>]*>.*?<\/div>/i', '', $string);
To replace only the content in the divs with class name and not other classes:
preg_replace('/(<div.*?class="name"[^>]*>)(.*?)(<\/div>)/i', '$1$3', $string);
$string = "<div class=\"name\">anything</div>1234<div class=\"name\">anything</div>abcd";
echo preg_replace('%<div.*?</div>%i', '', $string); // echo's 1234abcd
Live example:
http://codepad.org/1XEC33sc
add ?, it will find FIRST occurence
preg_replace('~<div .*?>(.*?)</div>~','', $string);
http://sandbox.phpcode.eu/g/c201b/3
This might be a simple example, but if you have a more complex one, use an HTML/XML parser. For example with DOMDocument:
$doc = DOMDocument::loadHTML($string);
$xpath = new DOMXPath($doc);
$query = "//body/text()";
$nodes = $xpath->query($query);
$text = "";
foreach($nodes as $node) {
$text .= $node->wholeText;
}
Which query you have to use or whether you have to process the DOM tree in some other way, depends on the particular content you have.

Remove all text within specific tags

I am interesting in removing all the text within the following tags:
<p class="wp-caption-text">Remove this text</p>
Can anybody give me an idea of how this can be done in php?
Thank you very much
Get rid of the tag and content inside of it:
$content = preg_replace('/<p\sclass=\"wp\-caption\-text\">[^<]+<\/p>/i', '', $content);
or if you want to preserve the tags:
$content = preg_replace('/(<p\sclass=\"wp\-caption\-text\">)[^<]+(<\/p>)/i', '$1$2', $content);
As bit higher-level alternative to regular expressions.
You can process with DOM. You can match all nodes you're looking for with XPath //p[#class="wp-caption-text"].
For example:
$doc = new DOMDocument();
$doc->loadHTML($yourHTMLasString);
$xpath = new DOMXPath($doc);
$query = '//p[#class="wp-caption-text"]';
$entries = $xpath->query($query);
foreach ($entries as $entry) {
$entry->textContent = '';
}
echo $doc->saveHTML();
Try this:
$string = '<p class="wp-caption-text">Remove this text</p>';
$pattern = '/(.*<p .*>).*(<\/p>.*)/';
$replacement = '$1$2';
echo preg_replace($pattern, $replacement, $string);
if its always the same tag you could simply do search for the string. use the position resulting to substring from it to the closing tag.
Or you could use a regular expression, there are good ones posted here that can help you.

Using regex to remove HTML tags

I need to convert
$text = 'We had <i>fun</i>. Look at this photo of Joe';
[Edit] There could be multiple links in the text.
to
$text = 'We had fun. Look at this photo (http://example.com) of Joe';
All HTML tags are to be removed and the href value from <a> tags needs to be added like above.
What would be an efficient way to solve this with regex? Any code snippet would be great.
First do a preg_replace to keep the link. You could use:
preg_replace('(.*?)', '$\2 ($\1)', $str);
Then use strip_tags which will finish off the rest of the tags.
try an xml parser to replace any tag with it's inner html and the a tags with its href attribute.
http://www.php.net/manual/en/book.domxml.php
The DOM solution:
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach($xpath->query('//a[#href]') as $node) {
$textNode = new DOMText(sprintf('%s (%s)',
$node->nodeValue, $node->getAttribute('href')));
$node->parentNode->replaceChild($textNode, $node);
}
echo strip_tags($dom->saveHTML());
and the same without XPath:
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('a') as $node) {
if($node->hasAttribute('href')) {
$textNode = new DOMText(sprintf('%s (%s)',
$node->nodeValue, $node->getAttribute('href')));
$node->parentNode->replaceChild($textNode, $node);
}
}
echo strip_tags($dom->saveHTML());
All it does is load any HTML into a DomDocument instance. In the first case it uses an XPath expression, which is kinda like SQL for XML, and gets all links with an href attribute. It then creates a text node element from the innerHTML and the href attribute and replaces the link. The second version just uses the DOM API and no Xpath.
Yes, it's a few lines more than Regex but this is clean and easy to understand and it won't give you any headaches when you need to add additional logic.
I've done things like this using variations of substring and replace. I'd probably use regex today but you wanted an alternative so:
For the <i> tags, I'd do something like:
$text = replace($text, "<i>", "");
$text = replace($text, "</i>", "");
(My php is really rusty, so replace may not be the right function name -- but the idea is what I'm sharing.)
The <a> tag is a bit more tricky. But, it can be done. You need to find the point that <a starts and that the > ends with. Then you extract the entire length and replace the closing </a>
That might go something like:
$start = strrpos( $text, "<a" );
$end = strrpos( $text, "</a>", $start );
$text = substr( $text, $start, $end );
$text = replace($text, "</a>", "");
(I don't know if this will work, again the idea is what I want to communicate. I hope the code fragments help but they probably don't work "out of the box". There are also a lot of possible bugs in the code snippets depending on your exact implementation and environment)
Reference:
strrpos - http://www.php.net/manual/en/function.strrpos.php
replace - http://www.php.net/manual/en/function.str-replace.php
substr - http://php.net/manual/en/function.substr.php
It's also very easy to do with a parser:
# available from http://simplehtmldom.sourceforge.net
include('simple_html_dom.php');
# parse and echo
$html = str_get_html('We had <i>fun</i>. Look at this photo of Joe');
$a = $html->find('a');
$a[0]->outertext = "{$a[0]->innertext} ( {$a[0]->href} )";
echo strip_tags($html);
And that produces the code you want in your test case.

Categories