I want to make my links automatically clickables, but it doesn't work.
Here's my code:
$val['message'] = preg_replace('#https?://(w{3}.)?([a-zA-Z0-9_-]{1,20}(.[a-zA-Z0-9_-]{1,10}))(/[a-zA-Z0-9_-]{1,12}(/[a-zA-Z0-9_-]{1,12}))?(/([a-zA-Z0-9_-]{1,20})(.[a-zA-Z0-9_-]{1,7}))?(\?[a-zA-Z0-9_-]{1,7}=[a-zA-Z0-9_-]{1,7}(&[a-zA-Z0-9_-]{1,7}=[a-zA-Z0-9_-]{1,7}))?#is', '$0', $val['message']);
(here is my preg thing, but with lines:)
'https?://
(w{3}.)?
([a-zA-Z0-9_-]{1,20}(.[a-zA-Z0-9_-]{1,10}))
(/[a-zA-Z0-9_-]{1,12}(/[a-zA-Z0-9_-]{1,12}))?
(/([a-zA-Z0-9_-]{1,20})
(.[a-zA-Z0-9_-]{1,7}))?
(\?[a-zA-Z0-9_-]{1,7}=[a-zA-Z0-9_-]{1,7}
(&[a-zA-Z0-9_-]{1,7}=[a-zA-Z0-9_-]{1,7}))?
I also tried this:
$val['message'] = preg_replace("#(([\w]+?://[\w#$%&~.-;:=,?#[]+])(/[\w#$%&~/.-;:=,?#[]+])?)#is", "$1", $val['message']);
but doesn't work with links like https://www.youtube.com/watch?v=videolink
Try this regex, worked for me:
(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?
Why does everyone like to try to make their own regex for this? Linkifying links is hard work with lots of edge cases, not to mention what should or shouldn't be included in the link, e.g.
Are you talking about youtube.com?
I like the ASP.net language
I wonder what www.stackoverflow.com counts as a link
Parentheses are a particular pain in the butt (example: http://example.com/?auth=gH;2($Hd)DA0;QAb)
Aside: in the last line above, StackOverflow's preview section links everything until the last closing bracket, but after submission it only links up to the first punctuation mark bracket. Helps prove my point about how hard this is to get right and consistent though!
Best to use something established, example:
https://github.com/misd-service-development/php-linkify
For something a bit more quick n dirty:
http://buildinternet.com/2010/05/how-to-automatically-linkify-text-with-php-regular-expressions/
Related
I've spent the last couple of days trying to figure out how to resolve this particular issue and posting on SO, but no dice so far. I think this is probably easier than I've been making it to be, but I need some help;
Here is a pretty basic regex statement that linkifies pretty much any link. It's not the only regex pattern I have, so I've included a piece that skips over the link if it includes the specific pattern "img.youtube.com/vi/" It works great;
$message = preg_replace("#(((f|ht)tp(s)?://)?!(img.youtube.com/vi/)[-a-zA-Z?-??-?()0-9#:%_+.~\#?&;//=,])+#i", "<a href=$1 target='_blank'><b>$1</b></a>", $message);
I do not want this to linkify any url with .jpeg, jpg, gif, or any popular image format, I have another expression that will embed those kinds of links (and it works fine, too). So, I need to find a way to get this expression to reject those kinds of links.
I've gotten advice on negative lookarounds, matching to specific strings, but none of them seem to work so far. I need to find a way to get this regex to ignore any URL that ends with .jpeg and so forth;
So, the regex statement above already has an example of a string that disqualifies certain URLs - ?!(img.youtube.com/vi/). This seems like that's all I need to do, but where do I put it and how does it look? The + symbol in the statement makes it so that the regex will scrutinize the string all the way to the end of it, using the matching characters of [-a-zA-Z?-??-?()0-9#:%_+.~#?&;//=,]. So, this matching string should probably be put somewhere before the + symbol. Does it go in "?!(img.youtube.com/vi/)" ? In my mind, it should probably look like this;
$message = preg_replace("#(((f|ht)tp(s)?://)?!(img.youtube.com/vi/|/^\.jpeg$/|/^\.jpg$/|/^\gif$/)[-a-zA-Z?-??-?()0-9#:%_+.~\#?&;//=,])+#i",
"<a href=$1 target='_blank'><b>$1</b></a>", $message);
Any help is appreciated.
I answer and also clean up your regexp
(?i)((?:f|ht)tps?://((?!img|jpe?g|gif|png|bmp))(?:([-a-z0-9()#:%_+.~#?&;/=,])(?2))+(?!(?3)))
Now the img etc you don't want is in the neg lookahead and you can add a things you don't like.
$good="http://www.google.com/";
$bad="http://img.google.com/";
$r="#(?i)((?:f|ht)tps?://((?!img|jpe?g|gif|png|bmp))(?:([-a-z0-9()#:%_+.~\#?&;/=,])(?2))+(?!(?3)))#";
$rep="<a href=$1 target='_blank'><b>$1</b></a>";
echo preg_replace($r,$rep,$good);
echo preg_replace($r,$rep,$bad);
You can try here http://ideone.com/419yfm
Just remove this part of the regex:
img|
<?php
$good="http://www.google.com/";
$bad="http://img.google.com/";
$r="#(?i)((?:f|ht)tps?://((?!jpe?g|gif|png|bmp))(?:([-a-z0-9()#:%_+.~\#?&;/=,])(?2))+(?!(?3)))#";
$rep="<a href=$1 target='_blank'><b>$1</b></a>";
echo preg_replace($r,$rep,$good); echo "\n";
echo preg_replace($r,$rep,$bad);
?>
DEMO
I know there are other posts with a similar name but I've looked through them and they haven't helped me resolve this.
I'm trying to get my head around regex and preg_match. I am going through a body of text and each time a link exists I want it to be extracted. I'm currently using the following:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
which works fine until it finds one that has <br after it. Then I get the url plus the <br which means it doesn't work correctly. How can I have it so that it stops at the < without including it?
Also, I have been looking everywhere for a clear explanation of using regex and I'm still confused by it. Has anyone any good guides on it for future reference?
\S* is too broad. In particular, I could inject into your code with a URL like:
http://hax.hax/"><script>alert('HAAAAAAAX!');</script>
You should only allow characters that are allowed in URLs:
[-A-Za-z0-9._~:/?#[]#!$&'()*+,;=]*
Some of these characters are only allowed in specific places (such as ?) so if you want better validation you will need more cleverness
Instead of \S exclude the open tag char from the class:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/[^<]*)?/";
You might even want to be more restrictive by only allowing characters valid in URLs:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/[a-zA-Z_\-\.%\?&]*)?/";
(or some more characters)
You could use this one as presented on the:
http://regex101.com/r/zV1uI7
On the bottom of the site you got it explained step by step.
I'm a newbie here. I'm facing a weird problem in using regex in PHP.
$result = "some very long long string with different kind of links";
$regex='/<.*?href.*?="(.*?net.*?)"/'; //this is the regex rule
preg_match_all($regex,$result,$parts);
Here in this code I'm trying to get the links from the result string. But it will provide me only those links which contains .net. But I also want to get those links which have .com. For this I tried this code
$regex='/<.*?href.*?="(.*?net|com.*?)"/';
But it shows nothing.
SOrry for my bad English.
Thanks in advance.
Update 1 :
now i'm using this
$regex='/<.*?href.*?="(.*?)"/';
this rule grab all the links from the string. But this is not perfect. Because it also grabs other substrings like "javascript".
The | character applies to everything within the capturing group, so (.*?net|com.*?) will match either .*?net or com.*?, I think what you want is (.*?(net|com).*?).
If you do not want the extra capturing group, you can use (.*?(?:net|com).*?).
You could also use (.*?net.*?|.*?com.*?), but this is not recommended because of the unnecessary repetition.
Your regex gets interpreted as .*?net or com.*?. You'll want (.*?(net|com).*?).
Try this:
$regex='/<.*?href.*?="(.*?\.(?:net|com)\b.*?)"/i';
or better:
$regex='/<a .*?href\s*+=\s*+"\K.*?\.(?:net|com)\b[^"]*+/i';
<.*?href
is a problem. This will match from the first < on the current line to the first href, regardless of whether they belong to the same tag.
Generally, it's unwise to try and parse HTML with regexes; if you absolutely insist on doing that, at least be a bit more specific (but still not perfect):
$regex='/<[^<>]*href[^<>=]*="(?:[^"]*(net|com)[^"]*)"/';
Let's assume I do preg_replace as follows:
preg_replace ("/<my_tag>(.*)<\/my_tag>/U", "<my_new_tag>$1</my_new_tag>", $sourse);
That works but I do also want to grab the attribute of the my_tag - how would I do it with this:
<my_tag my_attribute_that_know_the_name_of="some_value">tra-la-la</my_tag>
You don't use regex. You use a real parser, because this stuff cannot be parsed with regular expressions. You'll never know if you've got all the corner cases quite right and then your regex has turned into a giant bloated monster and you'll wish you'd just taken fredley's advice and used a real parser.
For a humourous take, see this famous post.
preg_replace('#<my_tag\b([^>]*)>(.*?)</my_tag>#',
'<my_new_tag$1>$2</my_new_tag>', $source)
The ([^>]*) captures anything after the tag name and before the closing >. Of course, > is legal inside HTML attribute values, so watch out for that (but I've never seen it in the wild). The \b prevents matches of tag names that happen to start with my_tag, preventing bogus matches like this:
<my_tag_xyz>ooga-booga</my_tag_xyz><my_tag>tra-la-la</my_tag>
But that will still break on <my_tag> elements wrapped in other <my_tag> elements, yielding results like this:
<my_tag><my_tag>tra-la-la</my_tag>
If you know you'll never need to match tags with other tags inside them, you can replace the (.*?) with ([^<>]++).
I get pretty tired of the glib "don't do that" answers too, but as you can see, there are good reasons behind them--I could come up with this many more without having to consult any references. When you ask "How do I do this?" with no background or qualification, we have no idea how much of this you already know.
Forget regex's, use this instead:
http://simplehtmldom.sourceforge.net/
I'm trying to put together a plug-in for vBulletin to filter out links to filesharing sites. But, as I'm sure you often hear, I'm a newb to php let alone regexes.
Basically, I'm trying to put together a regex and use a preg_replace to find any urls that are from these domains and replace the entire link with a message that they aren't allowed. I'd want it to find the link whether it's hyperlinked, posted as plain text, or enclosed in [CODE] bb tags.
As for regex, I would need it to find URLS with the following, I think:
Starts with http or an anchor tag. I believe that the URLS in [CODE] tags could be processed the same as the plain text URLS and it's fine if the replacement ends up inside the [CODE] tag afterward.
Could contain any number of any characters before the domain/word
Has the domain somewhere in the middle
Could contain any number of any characters after the domain
Ends with a number of extentions such as (html|htm|rar|zip|001) or in a closing anchor tag.
I have a feeling that it's numbers 2 and 4 that are tripping me up (if not much more). I found a similar question on here and tried to pick apart the code a bit (even though I didn't really understand it). I now have this which I thought might work, but it doesn't:
<?php
$filterthese = array('domain1', 'domain2', 'domain3');
$replacement = 'LINKS HAVE BEEN FILTERED MESSAGE';
$regex = array('!^http+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*(html|htm|rar|zip|001)$!',
'!^<a+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*</a>$!');
$this->post['message'] = preg_replace($regex, $replacement, $this->post['message']);
?>
I have a feeling that I'm way off base here, and I admit that I don't fully understand php let alone regexes. I'm open to any suggestions on how to do this better, how to just make it work, or links to RTM (though I've read up a bit and I'm going to continue).
Thanks.
You can use parse_url on the URLs and look into the hashmap it returns. That allows you to filter for domains or even finer-grained control.
I think you can avoid the overhead of this in using the filter_var built-in function.
You may use this feature since PHP 5.2.0.
$good_url = filter_var( filter_var( $raw_url, FILTER_SANITIZE_URL), FILTER_VALIDATE_URL);
Hmm, my first guess: You put $filterthese directly inside a single-quoted string. That single quotes don't allow for variable substitution. Also, the $filterthese is an array, that should first be joined:
var $filterthese = implode("|", $filterthese);
Maybe I'm way off, because I don't know anything about vBulletin plugins and their embedded magic, but that points seem worth a check to me.
Edit: OK, on re-checking your provided source, I think the regexp line should read like this:
$regex = '!(?#
possible "a" tag [start]: )(<a[^>]+href=["\']?)?(?#
offending link: )https?://(?#
possible subdomains: )(([a-z0-9-]+\.)*\.)?(?#
domains to block: )('.implode("|", $filterthese).')(?#
possible path: )(/[^ "\'>]*)?(?#
possible "a" tag [end]: )(["\']?[^>]*>)?!';