Ommitting a specific pattern from a regex statement - php

I've spent the last couple of days trying to figure out how to resolve this particular issue and posting on SO, but no dice so far. I think this is probably easier than I've been making it to be, but I need some help;
Here is a pretty basic regex statement that linkifies pretty much any link. It's not the only regex pattern I have, so I've included a piece that skips over the link if it includes the specific pattern "img.youtube.com/vi/" It works great;
$message = preg_replace("#(((f|ht)tp(s)?://)?!(img.youtube.com/vi/)[-a-zA-Z?-??-?()0-9#:%_+.~\#?&;//=,])+#i", "<a href=$1 target='_blank'><b>$1</b></a>", $message);
I do not want this to linkify any url with .jpeg, jpg, gif, or any popular image format, I have another expression that will embed those kinds of links (and it works fine, too). So, I need to find a way to get this expression to reject those kinds of links.
I've gotten advice on negative lookarounds, matching to specific strings, but none of them seem to work so far. I need to find a way to get this regex to ignore any URL that ends with .jpeg and so forth;
So, the regex statement above already has an example of a string that disqualifies certain URLs - ?!(img.youtube.com/vi/). This seems like that's all I need to do, but where do I put it and how does it look? The + symbol in the statement makes it so that the regex will scrutinize the string all the way to the end of it, using the matching characters of [-a-zA-Z?-??-?()0-9#:%_+.~#?&;//=,]. So, this matching string should probably be put somewhere before the + symbol. Does it go in "?!(img.youtube.com/vi/)" ? In my mind, it should probably look like this;
$message = preg_replace("#(((f|ht)tp(s)?://)?!(img.youtube.com/vi/|/^\.jpeg$/|/^\.jpg$/|/^\gif$/)[-a-zA-Z?-??-?()0-9#:%_+.~\#?&;//=,])+#i",
"<a href=$1 target='_blank'><b>$1</b></a>", $message);
Any help is appreciated.

I answer and also clean up your regexp
(?i)((?:f|ht)tps?://((?!img|jpe?g|gif|png|bmp))(?:([-a-z0-9()#:%_+.~#?&;/=,])(?2))+(?!(?3)))
Now the img etc you don't want is in the neg lookahead and you can add a things you don't like.
$good="http://www.google.com/";
$bad="http://img.google.com/";
$r="#(?i)((?:f|ht)tps?://((?!img|jpe?g|gif|png|bmp))(?:([-a-z0-9()#:%_+.~\#?&;/=,])(?2))+(?!(?3)))#";
$rep="<a href=$1 target='_blank'><b>$1</b></a>";
echo preg_replace($r,$rep,$good);
echo preg_replace($r,$rep,$bad);
You can try here http://ideone.com/419yfm

Just remove this part of the regex:
img|
<?php
$good="http://www.google.com/";
$bad="http://img.google.com/";
$r="#(?i)((?:f|ht)tps?://((?!jpe?g|gif|png|bmp))(?:([-a-z0-9()#:%_+.~\#?&;/=,])(?2))+(?!(?3)))#";
$rep="<a href=$1 target='_blank'><b>$1</b></a>";
echo preg_replace($r,$rep,$good); echo "\n";
echo preg_replace($r,$rep,$bad);
?>
DEMO

Related

make link clickables with preg_replace

I want to make my links automatically clickables, but it doesn't work.
Here's my code:
$val['message'] = preg_replace('#https?://(w{3}.)?([a-zA-Z0-9_-]{1,20}(.[a-zA-Z0-9_-]{1,10}))(/[a-zA-Z0-9_-]{1,12}(/[a-zA-Z0-9_-]{1,12}))?(/([a-zA-Z0-9_-]{1,20})(.[a-zA-Z0-9_-]{1,7}))?(\?[a-zA-Z0-9_-]{1,7}=[a-zA-Z0-9_-]{1,7}(&[a-zA-Z0-9_-]{1,7}=[a-zA-Z0-9_-]{1,7}))?#is', '$0', $val['message']);
(here is my preg thing, but with lines:)
'https?://
(w{3}.)?
([a-zA-Z0-9_-]{1,20}(.[a-zA-Z0-9_-]{1,10}))
(/[a-zA-Z0-9_-]{1,12}(/[a-zA-Z0-9_-]{1,12}))?
(/([a-zA-Z0-9_-]{1,20})
(.[a-zA-Z0-9_-]{1,7}))?
(\?[a-zA-Z0-9_-]{1,7}=[a-zA-Z0-9_-]{1,7}
(&[a-zA-Z0-9_-]{1,7}=[a-zA-Z0-9_-]{1,7}))?
I also tried this:
$val['message'] = preg_replace("#(([\w]+?://[\w#$%&~.-;:=,?#[]+])(/[\w#$%&~/.-;:=,?#[]+])?)#is", "$1", $val['message']);
but doesn't work with links like https://www.youtube.com/watch?v=videolink
Try this regex, worked for me:
(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?
Why does everyone like to try to make their own regex for this? Linkifying links is hard work with lots of edge cases, not to mention what should or shouldn't be included in the link, e.g.
Are you talking about youtube.com?
I like the ASP.net language
I wonder what www.stackoverflow.com counts as a link
Parentheses are a particular pain in the butt (example: http://example.com/?auth=gH;2($Hd)DA0;QAb)
Aside: in the last line above, StackOverflow's preview section links everything until the last closing bracket, but after submission it only links up to the first punctuation mark bracket. Helps prove my point about how hard this is to get right and consistent though!
Best to use something established, example:
https://github.com/misd-service-development/php-linkify
For something a bit more quick n dirty:
http://buildinternet.com/2010/05/how-to-automatically-linkify-text-with-php-regular-expressions/

Trying to stop regex at a tag

I know there are other posts with a similar name but I've looked through them and they haven't helped me resolve this.
I'm trying to get my head around regex and preg_match. I am going through a body of text and each time a link exists I want it to be extracted. I'm currently using the following:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
which works fine until it finds one that has <br after it. Then I get the url plus the <br which means it doesn't work correctly. How can I have it so that it stops at the < without including it?
Also, I have been looking everywhere for a clear explanation of using regex and I'm still confused by it. Has anyone any good guides on it for future reference?
\S* is too broad. In particular, I could inject into your code with a URL like:
http://hax.hax/"><script>alert('HAAAAAAAX!');</script>
You should only allow characters that are allowed in URLs:
[-A-Za-z0-9._~:/?#[]#!$&'()*+,;=]*
Some of these characters are only allowed in specific places (such as ?) so if you want better validation you will need more cleverness
Instead of \S exclude the open tag char from the class:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/[^<]*)?/";
You might even want to be more restrictive by only allowing characters valid in URLs:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/[a-zA-Z_\-\.%\?&]*)?/";
(or some more characters)
You could use this one as presented on the:
http://regex101.com/r/zV1uI7
On the bottom of the site you got it explained step by step.

PHP / RegEx - Convert URLs to links by detecting .com/.net/.org/.edu etc

I know there have been many questions asking for help converting URLs to clickable links in strings, but I haven't found quite what I'm looking for.
I want to be able to match any of the following examples and turn them into clickable links:
http://www.domain.com
https://www.domain.net
http://subdomain.domain.org
www.domain.com/folder
subdomain.domain.net
subdomain.domain.edu/folder/subfolder
domain.net
domain.com/folder
I do not want to match random.stuff.separated.with.periods.
EDIT: Please keep in mind that these URLs need to be found within larger strings of 'normal' text. For example, I want to match 'domain.net' in "Hello! Come check out domain.net!".
I think this could be accomplished with a regex that can determine whether the matching url contains .com, .net, .org, or .edu followed by either a forward slash or whitespace. Other than a user typo, I can't imagine any other case in which a valid URL would have one of those followed by anything else.
I realize there are many valid domain extensions out there, but I don't need to support them all. I can just choose which to support with something like (com|net|org|edu) in the regex. Unfortunately, I'm not skilled enough with regex yet to know how to properly implement this.
I'm hoping someone can help me find a regular expression (for use with PHP's preg_replace) that can match URLs based on just about any text connected by one or more dots and either ending with one of the specified extensions followed by whitespace OR containing one of the specified extensions followed by a slash and possibly folders.
I did several searches and so far have not found what I'm looking for. If there already exists a SO post that answers this, I apologize.
Thanks in advance.
--- EDIT 3 ---
After days of trial and error and some help from SO, here's what works:
preg_replace_callback('#(\s|^)((https?://)?(\w|-)+(\.(\w+|-)*)+(?<=\.net|org|edu|com|cc|br|jp|dk|gs|de)(\:[0-9]+)?(?:/[^\s]*)?)(?=\s|\b)#is',
create_function('$m', 'if (!preg_match("#^(https?://)#", $m[2]))
return $m[1]."".$m[2].""; else return $m[1]."".$m[2]."";'),
$event_desc);
This is a modified version of anubhava's code below and so far seems to do exactly what I want. Thanks!
You can use this regex:
#(\s|^)((?:https?://)?\w+(?:\.\w+)+(?<=\.(net|org|edu|com))(?:/[^\s]*|))(?=\s|\b)#is
Code:
$arr = array(
'http://www.domain.com/?foo=bar',
'http://www.that"sallfolks.com',
'This is really cool site: https://www.domain.net/ isn\'t it?',
'http://subdomain.domain.org',
'www.domain.com/folder',
'Hello! You can visit vertigofx.com/mysite/rocks for some awesome pictures, or just go to vertigofx.com by itself',
'subdomain.domain.net',
'subdomain.domain.edu/folder/subfolder',
'Hello! Check out my site at domain.net!',
'welcome.to.computers',
'Hello.Come visit oursite.com!',
'foo.bar',
'domain.com/folder',
);
foreach($arr as $url) {
$link = preg_replace_callback('#(\s|^)((?:https?://)?\w+(?:\.\w+)+(?<=\.(net|org|edu|com))(?:/[^\s]*|))(?=\s|\b)#is',
create_function('$m', 'if (!preg_match("#^(https?://)#", $m[2]))
return $m[1]."".$m[2].""; else return $m[1]."".$m[2]."";'),
$url);
echo $link . "\n";
OUTPUT:
http://www.domain.com/?foo=bar
http://www.that"sallfolks.com
This is really cool site: https://www.domain.net/ isn't it?
http://subdomain.domain.org
www.domain.com/folder
Hello! You can visit vertigofx.com/mysite/rocks for some awesome pictures, or just go to vertigofx.com by itself
subdomain.domain.net
subdomain.domain.edu/folder/subfolder
Hello! Check out my site at domain.net!
welcome.to.computers
Hello.Come visit oursite.com!
foo.bar
domain.com/folder
PS: This regex only supports http and https scheme in URL. So eg: if you want to support ftp also then you need to modify the regex a little.
'/(http(s)?:\/\/)?[\w\/\.]+(\.((com)|(edu)|(net)|(org)))[\w\/]*/'
That works for your examples. You might want to add extra characters support for "-", "&", "?", ":", etc in the last bracket.
'/(http(s)?:\/\/)?[\w\/\.]+(\.((com)|(edu)|(net)|(org)))[\w\/\?=&-;]*/'
This will support parameters and port numbers.
eg.: www.foo.ca:8888/test?param1=val1&param2=val2
Thanks a ton. I modified his final solution to allow all domains (.ca, .co.uk), not just the specified ones.
$html = preg_replace_callback('#(\s|^)((https?://)?(\w|-)+(\.[a-z]{2,3})+(\:[0-9]+)?(?:/[^\s]*)?)(?=\s|\b)#is',
create_function('$m', 'if (!preg_match("#^(https?://)#", $m[2])) return $m[1]."".$m[2].""; else return $m[1]."".$m[2]."";'),
$url);

regular expression for replacing all links but css and js

i want to download a site an replace all links on that site to an internal link.
that's easy:
$page=file_get_contents($url);
$local=$_SERVER['HTTP_HOST'].$_SERVER['PHP_SELF'];
$page=preg_replace('/href="(.+?)"/','href="http://'.$local.'?href=\\1"',$page);
but i want to exclude all css files and js files from replacing, so i tried:
$regex='/href="(.+?(?!(\.js|\.css)))"/';
$page=preg_replace($regex,'href="http://'.$local.'?href=\\1"',$page);
but that didnt work,
what am i doing wrong?
i thought
?!
is a negative lookahead
To answer your regex question, you need a lookbehind there and better limit the match with a character class:
$regex = '/href="([^"]+(?<!\.js|\.css))"/';
The charclass first matches the whole link content, then asserts that this didn't end in .js or .css.
You might want to augment the whole match with <a\s[^>]*? even, so it really just finds anything that looks like a link.
Another option would be using domdocument or querypath for such tasks, which is usually tedious and more code, but simpler to add programmatic conditions to:
htmlqp->find("a") FOREACH $a->attr("href", "http:/...".$a->attr("href"))
// would need a real foreach and an if and stuff..

PHP regex for filtering out urls from specific domains for use in a vBulletin plug-in

I'm trying to put together a plug-in for vBulletin to filter out links to filesharing sites. But, as I'm sure you often hear, I'm a newb to php let alone regexes.
Basically, I'm trying to put together a regex and use a preg_replace to find any urls that are from these domains and replace the entire link with a message that they aren't allowed. I'd want it to find the link whether it's hyperlinked, posted as plain text, or enclosed in [CODE] bb tags.
As for regex, I would need it to find URLS with the following, I think:
Starts with http or an anchor tag. I believe that the URLS in [CODE] tags could be processed the same as the plain text URLS and it's fine if the replacement ends up inside the [CODE] tag afterward.
Could contain any number of any characters before the domain/word
Has the domain somewhere in the middle
Could contain any number of any characters after the domain
Ends with a number of extentions such as (html|htm|rar|zip|001) or in a closing anchor tag.
I have a feeling that it's numbers 2 and 4 that are tripping me up (if not much more). I found a similar question on here and tried to pick apart the code a bit (even though I didn't really understand it). I now have this which I thought might work, but it doesn't:
<?php
$filterthese = array('domain1', 'domain2', 'domain3');
$replacement = 'LINKS HAVE BEEN FILTERED MESSAGE';
$regex = array('!^http+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*(html|htm|rar|zip|001)$!',
'!^<a+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*</a>$!');
$this->post['message'] = preg_replace($regex, $replacement, $this->post['message']);
?>
I have a feeling that I'm way off base here, and I admit that I don't fully understand php let alone regexes. I'm open to any suggestions on how to do this better, how to just make it work, or links to RTM (though I've read up a bit and I'm going to continue).
Thanks.
You can use parse_url on the URLs and look into the hashmap it returns. That allows you to filter for domains or even finer-grained control.
I think you can avoid the overhead of this in using the filter_var built-in function.
You may use this feature since PHP 5.2.0.
$good_url = filter_var( filter_var( $raw_url, FILTER_SANITIZE_URL), FILTER_VALIDATE_URL);
Hmm, my first guess: You put $filterthese directly inside a single-quoted string. That single quotes don't allow for variable substitution. Also, the $filterthese is an array, that should first be joined:
var $filterthese = implode("|", $filterthese);
Maybe I'm way off, because I don't know anything about vBulletin plugins and their embedded magic, but that points seem worth a check to me.
Edit: OK, on re-checking your provided source, I think the regexp line should read like this:
$regex = '!(?#
possible "a" tag [start]: )(<a[^>]+href=["\']?)?(?#
offending link: )https?://(?#
possible subdomains: )(([a-z0-9-]+\.)*\.)?(?#
domains to block: )('.implode("|", $filterthese).')(?#
possible path: )(/[^ "\'>]*)?(?#
possible "a" tag [end]: )(["\']?[^>]*>)?!';

Categories