How to remove a link from content using php? - php

$text = file_get_contents('http://www.example.com/file.php?id=name');
echo preg_replace('#<a.*?>.*?</a>#i', '', $text)
the link contains this content:
text text text. <br><a href='http://www.example.com' target='_blank' title='title' style='text-decoration:none;'>name</a>
what is the problem at this script?

You can't parse HTML with regular expressions. Use an XML/HTML parser.

Tempted to flag your question, but there's no option for "Report user for summoning Cthulhu"
I'd recommend reading: http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html
RegEx is very poor and not at all intended to parse HTML. That's why there are HTML parsing libraries. Find and use one for PHP. :)

use <a[^>]+>[^<]*</a> (works fine as long as theres just text and no tags inside the a element)

USE strip_tags this way
$t = 'http://yoururl.com/test1.php';
$t1 = file_get_contents($t);
$text = strip_tags($t1);
it should work getting rid of all the links inside the page you are reading, visit the reference anyway, it may not work for complicated elements http://php.net/manual/en/function.strip-tags.php

Related

preg_replace script, link tag not working

I used the following code to remove script, link tags from my string,
$contents='<script>inside tag</script>hfgkdhgjh<script>inside 2</script>';
$ss=preg_replace('#<script(.*?)>(.*?)</script>#is', '', $contents);
echo htmlspecialchars($ss);
it works fine. But can I use anything that similar to html parsing rather than preg_match for this?
Here are few things you can do
htmlspecialchars() can prove those tags useless
striptags() removes all HTML tags
But the technique you are using is the correct one. However here is a improved version for that
echo preg_replace('/<script\b[^>]*>(.*?)<\/script>/is', "", $contents);
HTML Purifier is always a good choice. phpQuery has also come in handy a few times.
If you are sanitizing content, it's very easy to make mistakes with regular expressions... read this post. It just depends what you're trying to achieve.

preg_replace on text only and not inside href's

I have this code to do some ugly inline text style color formatting on html content.
But this breaks anything inside tags as links, and emails.
I've figured out halfways how to prevent it from formatting links, but it still doesn't prevent the replace when the text is info#mytext.com
$text = preg_replace('/(?<!\.)mytext(?!\/)/', '<span style="color:#DD1E32">my</span><span style="color:#002d6a">text</span>', $text);
What will be a better approach to only replace text, and prevent the replacing on links?
A better approach would be to use XML functions instead.
Your lookbehind assertion only tests one character, so it's insufficient to assert matches outside of html tags. This is something where regular expression aren't the best option. You can however get an approximation like:
preg_replace("/(>[^<]*)(?<![#.])(mytext)/", "$1<span>$2</span>",
This would overlook the first occourence of mytext if it's not preceeded by a html tag. So works best if $text = "<div>$text</div>" or something.
[edited] ohh I see you solved the href problem.
to solve your email problem, change all #mytext. to [email_safeguard] with str_replace, before working on the text, and when your finished, change it back. :)
$text = str_replace('info#mytext.com','[email_safeguard]',$text);
//work on the text with preg_match()
$text = str_replace('[email_safeguard]','info#mytext.com',$text);
that should do the trick :)
but as people have mentioned before, you better avoid html and regex, or you will suffer the wrath of Cthulhu.
see this instead

php - preg_match string not within the href attribute

i find regex kinda confusing so i got stuck with this problem:
i need to insert <b> tags on certain keywords in a given text. problem is that if the keyword is within the href attribute, it would result to a broken link.
the code goes like this:
$text = preg_replace('/(\b'.$keyword.'\b)/i','<b>\1</b>',$text);
so for cases like
this keyword here
i end up with:
this <b>keyword</b> here
i tried all sorts of combinations but i still couldn't get the right pattern.
thanks!
You can't only use Regex to do that. They are powerful, but they can't parse recursive grammar like HTML.
Instead you should properly parse the HTML using a existing HTML parser. you just have to echo the HTML unless you encouter some text entity. In that case, you run your preg_repace on the text before echoing it.
If your HTML is valid XHTML, you can use the xml_parse function. if it's not, then use whatever HTML parser is available.
You can use preg_replace again after the first replacement to remove b tags from href:
$text=preg_replace('#(href="[^"]*)<b>([^"]*)</b>#i',"$1$2",$text);
Yes, you can use regex like that, but the code might become a little convulted.
Here is a quick example
$string = 'link text with keyword and stuff';
$keyword = 'keyword';
$text = preg_replace(
'/(<a href=")('.$keyword.')(.php">)(.*)(<\/a>)/',
"$1$2$3<b>$4</b>$5",
$string
);
echo $string."\n";
echo $text."\n";
The content inside () are stored in variables $1,$2 ... $n, so I don't have to type stuff over again. The match can also be made more generic to match different kinds of url syntax if needed.
Seeing this solution you might want to rethink the way you plan to do matching of keywords in your code. :)
output:
link text with keyword and stuff
<b>link text with keyword and stuff</b>

Find Links and Remove them from HTML

How can I look for links in HTML and remove them?
$html = '<p>Test Title 1</p>';
$html .= '<p>Test Title 2</p>';
$html .= '<p>Test Title 3</p>';
$match = '<a href="javascript:doThis('Test Title 2')">';
I want to remove the anchor but display the text. see below.
Test Title 1
Test Title 2
Test Title 3
I've never used Regular Expressions before, but maybe i can avoid it also. Let me know if im not clear.
Thanks
Mark
EDIT: its not a client side thing. I cant use javascript for this. I have a custom CMS and want to edit HTML stored in a Database.
You may try the simplest thing:
echo strip_tags($html, '<p>');
This strips all tags except <p>
If you really like regexp:
echo preg_replace('=</?a(\s[^>]*)?>=ims', '', $html);
EDIT:
Delete a - tag AND surrounding tags (code gets messy and doesn't work with broken (X)HTML):
echo preg_replace('=<([a-z]+)[^>]*>\s*<a(\s[^>]*)?>(.*?)</a>\s*</\\1>=ims', '$3', $html);
Howerwer if your problem is that complicated, I recommend that you try xpath.
You could see if Simple HTML DOM does the trick.
You might have some joy with Beautiful Soup - http://www.crummy.com/software/BeautifulSoup/ (Python HTML parsing / manipulation API)
sed -i -e 's/<a.*<\/a>//g' filename.html
Note that using regular expressions for hacking HTML is a... dubious proposition, but it might just work in practice ;-)
You can use
var foo = document.getElementsByTagName('a');
to fetch all the link tags. No need for regular expressions here...
EDIT: I'm just learning to read... ;) Go with PHP's DOM or XML abilities. It should be pretty easy using those.
open the HTML file in Microsoft Expression.
Ctrl+F and then chose replace tag or tag attributes contents
Easy and quick solution
Thanks
Shomaail

php anchor tag regex

I have a bunch of strings, each containing an anchor tag and url.
string ex.
here is a link http://www.google.com. enjoy!
i want to parse out the anchor tags and everything in between.
result ex.
here is a link. enjoy!
the urls in the href= portion don't always match the link text however (sometimes there are shortened urls,sometimes just descriptive text).
i'm having an extremely difficult time figuring out how to do this with either regular expressions or php functions. how can i parse an entire anchor tag/link from a string?
thanks!
Looking at your result example, it seems like you're just removing the tags/content - did you want to keep what you stripped out or no? If not you might be looking for strip_tags().
You shouldn't use regex to parse html and use an html parser instead.
But if you should use regex, and your anchor tags inner contents are guaranteed to be free of html like </a>, and each string is guaranteed to contain only one anchor tag as in the example case, then - only then - you can use something like:
Replacing /^(.+)<a.+<\/a>(.+)$/ with $1$2
Since your problem seems to be very specific, I think this should do it:
$str = preg_replace('#\s?<a.*/a>#', '', $str);
just use your normal PHP string functions.
$str='here is a link http://www.google.com. enjoy!';
$s = explode("</a>",$str);
foreach($s as $a=>$b){
if( strpos( $b ,"href")!==FALSE ){
$m=strpos("$b","<a");
echo substr($b,0,$m);
}
}
print end($s);
output
$ php test.php
here is a link . enjoy!
$string = 'here is a link http://www.google.com. enjoy!';
$text = strip_tags($string);
echo $text; //Outputs "here is a link . enjoy!"

Categories