Removal of bad hyperlinks and the content inside of them - php

Ok, basically I have an array of bad urls and I would like to search through a string and strip them out. I want to strip everything from the opening tag to the closing tag, but only if the url in the hyperlink is in the array of bad urls. Here is how I would picture it working but I don't understand regular expressions well.
foreach($bad_urls as $bad_url){
$pattern = "/<a*$bad_url*</a>/";
$replacement = ' ';
preg_replace($pattern, $replacement, $content);
}
Thanks in advance.

Assuming that your 'bad urls' are properly formatted URLs, I would suggest doing something like this:
foreach($bad_urls as $bad_url){
$pattern = '/<[aA]\s.+[href|HREF]\=\"' . convert_to_pattern($bad_url) . '\".+<\/[aA]>/msU';
$replacement = ' ';
$content = preg_replace_all($pattern, $replacement, $content);
}
and separately
function convert_to_pattern($url)
{
searches = array('%', '&', '?', '.', '/', ';', ' ');
replaces = array('\%','\&','\?','\.','\/','\;','\ ');
return preg_replace_all($searches, $replaces, $url);
}

Please do not try to parse HTML using regular expressions. Just load up the HTML in a DOM, find all the <a> tags and check the href property. Much simpler and fool-proof.

Related

PHP - Inner HTML recursive replace

I need to perform a recursive str_replace on a portion of HTML (with recursive I mean inner nodes first), so I wrote:
$str = //get HTML;
$pttOpen = '(\w+) *([^<]{1,100}?)';
$pttClose = '\w+';
$pttHtml = '(?:(?!(?:<x-)).+)';
while (preg_match("%<x-(?:$pttOpen)>($pttHtml)*</x-($pttClose)>%m", $str, $match)) {
list($outerHtml, $open, $attributes, $innerHtml, $close) = $match;
$newHtml = //some work....
str_replace($outerHtml, $newHtml, $str);
}
The idea is to first replace non-nested x-tags.
But it only works if innerHtml in on the same line of the opening tag (so I guess I misunderstood what the /m modifier does). I don't want to use a DOM library, because I just need simple string replacement. Any help?
Try this regex:
%<x-(?P<open>\w+)\s*(?P<attributes>[^>]*)>(?P<innerHtml>.*)</x-(?P=open)>%s
Demo
http://regex101.com/r/nA2zO5
Sample code
$str = // get HTML
$pattern = '%<x-(?P<open>\w+)\s*(?P<attributes>[^>]*)>(?P<innerHtml>.*)</x-(?P=open)>%s';
while (preg_match($pattern, $str, $matches)) {
$newHtml = sprintf('<ns:%1$s>%2$s</ns:%1$s>', $matches['open'], $matches['innerHtml']);
$str = str_replace($matches[0], $newHtml, $str);
}
echo htmlspecialchars($str);
Output
Initially, $str contained this text:
<x-foo>
sdfgsdfgsd
<x-bar>
sdfgsdfg
</x-bar>
<x-baz attr1='5'>
sdfgsdfg
</x-baz>
sdfgsdfgs
</x-foo>
It ends up with:
<ns:foo>
sdfgsdfgsd
<ns:bar>
sdfgsdfg
</ns:bar>
<ns:baz>
sdfgsdfg
</ns:baz>
sdfgsdfgs
</ns:foo>
Since, I didn't know what work is done on $newHtml, I mimic this work somehow by replacing x-with ns: and removing any attributes.
Thanks to #Alex I came up with this:
%<x-(?P<open>\w+)\s*(?P<attributes>[^>]*?)>(?P<innerHtml>((?!<x-).)*)</x-(?P=open)>%is
Without the ((?!<x-).)*) in the innerHtml pattern it won't work with nested tags (it will first match outer ones, which isn't what I wanted). This way innermost ones are matched first. Hope this helps.
I don't know exactly what kind of changes you are trying to do, however this is the way I will proceed:
$pattern = <<<'EOD'
~
<x-(?<tagName>\w++) (?<attributes>[^>]*+) >
(?<content>(?>[^<]++|<(?!/?x-))*) #by far more efficient than (?:(?!</?x-).)*
</x-\g<tagName>>
~x
EOD;
function callback($m) { // exemple function
return '<n-' . $m['tagName'] . $m['attributes'] . '>' . $m['content']
. '</n-' . $m['tagName'] . '>';
};
do {
$code = preg_replace_callback($pattern, 'callback', $code, -1, $count);
} while ($count);
echo htmlspecialchars(print_r($code, true));

How to use preg_replace to remove part of url address?

I have some HTML code like this:
<a href="http://mysite.com/documentos/Servicios/SUCRE/sucDoc19.pdf&sa=U&ei=sf0JUrmjIc3Nswb154CgDQ&ved=0CCkQFjAA&usg=AFQjCNGfXg_9x83U3pYr6JfkJcWuXv8X0Q">
I need to clean my code to get something like this
<a href="http://mysite.com/documentos/Servicios/SUCRE/sucDoc19.pdf">
using preg_replace.
My code is the following:
$serp = preg_replace('&sa=(.*)" ', '" ', $serp);
and it doesn't work.
BTW i need to restrict search with preg_replace until the FIRST entrance, i.e. i need to replace all html from &sa= to the FIRST ", but now it search from &sa= to the LAST "...
You're missing the regex delimiters.
$serp = preg_replace('/&sa=(.*)" /', '" ', $serp);
will give you this.
You missed the delimiter.
So your code looks like:
$serp = preg_replace('/&sa=(.*)" /', '" ', $serp);
okay, if you want to delete everything till the first quote then you can try the following instead of regex:
$temp = substr($serp,strpos($serp,'&sa='),strpos($serp,'"',strpos($serp,'&sa=')));
$serp = str_replace($temp,"",$serp);
Just another regex to do it :)
$text = '<a href="http://mysite.com/documentos/Servicios/SUCRE/sucDoc19.pdf&sa=U&ei=sf0JUrmjIc3Nswb154CgDQ&ved=0CCkQFjAA&usg=AFQjCNGfXg_9x83U3pYr6JfkJcWuXv8X0Q" target="_blank">';
$text = preg_replace('/(&sa=[^"]*)/', '', $text);
echo $text;
// Output:
<a href="http://mysite.com/documentos/Servicios/SUCRE/sucDoc19.pdf" target="_blank">
You can try it HERE (thks to hjpotter92 for this tool)

Remove javascript links

I'm looking for a regex that will be able to replace all links like Link with a warning. I've been having a play but no success so far! I've always been bad with regex, can someone point me in the right direction? I have this so far:
Edit: People saying don't use Regex - the HTML will be the output of a markdown parser with all HTML tags in the markdown stripped. Therefore i know that the output of all links will be formatted as stated above, therefore regex would surely be a good tool in this particular situation. I am not allowing users to enter pure HTML. And SO has done something very similar, try creating a javascript link, and it will be removed
<?php
//Javascript link filter test
if(isset($_POST['jsfilter'])){
$html = " JS Link ";
$pattern = "/ href\\s*?=\\s*?[\"']\\s*?(javascript)\\s*?(:).*?([\"']) /is";
$replacement = "\"javascript: alert('Javascript links have been blocked');\"";
$html = preg_replace($pattern, $replacement, $html);
echo $html;
}
?>
<form method="post">
<input type="text" name="jsfilter" />
<button type="submit">Submit</button>
</form>
The right regex should be :
$pattern = '/href="javascript:[^"]+"/';
$replacement = 'href="javascript:alert(\'Javascript links have been blocked\')"';
Use strip_tags and htmlSpecialChars() to display user generated content. If you want to let users use specific tags, refer to BBcode.
You should test quote and double quotes, handle white spaces, etc...
$html = preg_replace( '/href\s*=\s*"javascript:[^"]+"/i' , 'href="#"' , $html );
$html = preg_replace( '/href\s*=\s*\'javascript:[^i]+\'/i' , 'href=\'#\'' , $html );
Try this code. I think, this would help.
<?php
//Javascript link filter test
if(isset($_POST['jsfilter'])){
$html = " JS Link ";
$pattern = '/a href="javascript:(.*?)"/i';
$replacement = 'a href="javascript: alert(\'Javascript links have been blocked\');"';
$html = preg_replace($pattern, $replacement, $html);
echo $html;
}
?>

change youtube url to embed url in php

I found this code (Swap all youtube urls to embed via preg_replace()) to swap youtube urls (http://www.youtube.com/watch?v=CfDQ92vOfdc, or http://www.youtube.com/v/CfDQ92vOfdc) into youtube embed urls (http://www.youtube.com/embed/CfDQ92vOfdc) but it doesn't seem to be working? Any ideas? I don't know much about regular expression.
Here's the code:
$string = 'http://www.youtube.com/watch?v=CfDQ92vOfdc';
$search = '#<a (?:.*?)href=["\\\']http[s]?:\/\/(?:[^\.]+\.)*youtube\.com\/(?:v\/|watch\?(?:.*?\&)?v=|embed\/)([\w\-\_]+)["\\\']#ixs';
$replace = 'http://www.youtube.com/embed/$2';
$url = preg_replace($search,$replace,$string);
but it's still displaying as:
http://www.youtube.com/watch?v=CfDQ92vOfdc
instead of:
http://www.youtube.com/embed/CfDQ92vOfdc
Thanks in advance.
One problem is that your expression is expecting a-href tags around the address.
Another issue is that your $replace string is using single-quotes which will not parse $2.
This simpler expression should work:
$string = 'http://www.youtube.com/watch?v=CfDQ92vOfdc';
$search = '/youtube\.com\/watch\?v=([a-zA-Z0-9]+)/smi';
$replace = "youtube.com/embed/$1";
$url = preg_replace($search,$replace,$string);
echo $url;
Either change
$string = 'http://www.youtube.com/watch?v=CfDQ92vOfdc';
to
$string = '<a href="http://www.youtube.com/watch?v=CfDQ92vOfdc" ></a>';
OR
$search = '#<a (?:.*?)href=["\\\']http[s]?:\/\/(?:[^\.]+\.)*youtube\.com\/(?:v\/|watch\?(?:.*?\&)?v=|embed\/)([\w\-\_]+)["\\\']#ixs';
to
$search = '#(.*?)(?:href="https?://)?(?:www\.)?(?:youtu\.be/|youtube\.com(?:/embed/|/v/|/watch?.*?v=))([\w\-]{10,12}).*#x';
If there is anyone who is still looking for a better straight up solution ,
here it is I just played with your code until it gave me an easy solution.
$string = $content;
$search = '/www.youtube\.com\/watch\?v=([a-zA-Z0-9]+)/smi';
$replace = "<iframe width='560' height='315' src='https://youtube.com/embed/$1' frameborder='0' allowfullscreen></iframe>
";
$content = preg_replace($search,$replace,$string);
NOTE: to choose how you want the links to be processed just edit the $search part,
if you will be processing from www.youtube.com it will be
$search = '/www.youtube\.com\/watch\?v=([a-zA-Z0-9]+)/smi';
else if you want it to process just youtube.com links just remove the www.
$search = '/youtube\.com\/watch\?v=([a-zA-Z0-9]+)/smi';
here is a function i wrote that you echo out the result:
function youtube_url_to_embed($youtube_url) {
$search = '/youtube\.com\/watch\?v=([a-zA-Z0-9]+)/smi';
$replace = "youtube.com/embed/$1";
$embed_url = preg_replace($search,$replace,$youtube_url);
return $embed_url;
}

PHP Regex to convert text before colon to link

I need to find the first occurance of a colon ':' and take the complete string before that and append it to a link.
e.g.
username: #twitter nice site! RT www.google.com : visited!
needs to be converted to:
username: nice site! RT www.google.com : visited!
I've already got the following regex that converts the string #twitter to a clickable URL:
E.g.
$description = preg_replace("/#(\w+)/", "#\\1", $description);
Any ideas : )
I'd use string manipulation for this, rather than regex, using strstr, substr and strlen:
$username = strstr($description, ':', true);
$description = '' . $username . ''
. substr($description, strlen($username));
$regEx = "/^([^:\s]*)(.*?:)/";
$replacement = "\1\2";
I have not tested the code, but it should work as is. Basically you need to capture after #twitter too.
$description = preg_replace("%([^:]+): #twitter (.+)%i",
"#\\1: \\2",
$description);
The following should work -
$description = preg_replace("/^(.+?):\s#twitter\s(.+?)$/", "#\\1: \\2", $description);
Direct answer to your question:
$string = preg_replace('/^(.*?):/', '$1:', $string);
But I assume that you are parsing twitter RSS or something similar. So you can just use /^(\w+)/.

Categories