Extracting links using regex via php - php

need a little help with the regexp i make... have some text where links are encoded like
[NameLink ->link] or [NameLink->link]
The link can be without http:// or www.
Tried to get it using this pattern
/\[.{1,}\-\>.{1,}\]/
but if there are 2 such encoded links in a row then it doesn't separate them and takes also the content between two links. Can someone tell me what's the problem? Thank you

Use +? instead of {1,}. Also, have a read on greedy vs. nongreedy.
You may want to strip spaces using \s* around your .+?, this allows for both [NameLink -> link] and [NameLink ->link].

Related

Trying to stop regex at a tag

I know there are other posts with a similar name but I've looked through them and they haven't helped me resolve this.
I'm trying to get my head around regex and preg_match. I am going through a body of text and each time a link exists I want it to be extracted. I'm currently using the following:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
which works fine until it finds one that has <br after it. Then I get the url plus the <br which means it doesn't work correctly. How can I have it so that it stops at the < without including it?
Also, I have been looking everywhere for a clear explanation of using regex and I'm still confused by it. Has anyone any good guides on it for future reference?
\S* is too broad. In particular, I could inject into your code with a URL like:
http://hax.hax/"><script>alert('HAAAAAAAX!');</script>
You should only allow characters that are allowed in URLs:
[-A-Za-z0-9._~:/?#[]#!$&'()*+,;=]*
Some of these characters are only allowed in specific places (such as ?) so if you want better validation you will need more cleverness
Instead of \S exclude the open tag char from the class:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/[^<]*)?/";
You might even want to be more restrictive by only allowing characters valid in URLs:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/[a-zA-Z_\-\.%\?&]*)?/";
(or some more characters)
You could use this one as presented on the:
http://regex101.com/r/zV1uI7
On the bottom of the site you got it explained step by step.

Can't use OR( | ) in php Regular expression

I'm a newbie here. I'm facing a weird problem in using regex in PHP.
$result = "some very long long string with different kind of links";
$regex='/<.*?href.*?="(.*?net.*?)"/'; //this is the regex rule
preg_match_all($regex,$result,$parts);
Here in this code I'm trying to get the links from the result string. But it will provide me only those links which contains .net. But I also want to get those links which have .com. For this I tried this code
$regex='/<.*?href.*?="(.*?net|com.*?)"/';
But it shows nothing.
SOrry for my bad English.
Thanks in advance.
Update 1 :
now i'm using this
$regex='/<.*?href.*?="(.*?)"/';
this rule grab all the links from the string. But this is not perfect. Because it also grabs other substrings like "javascript".
The | character applies to everything within the capturing group, so (.*?net|com.*?) will match either .*?net or com.*?, I think what you want is (.*?(net|com).*?).
If you do not want the extra capturing group, you can use (.*?(?:net|com).*?).
You could also use (.*?net.*?|.*?com.*?), but this is not recommended because of the unnecessary repetition.
Your regex gets interpreted as .*?net or com.*?. You'll want (.*?(net|com).*?).
Try this:
$regex='/<.*?href.*?="(.*?\.(?:net|com)\b.*?)"/i';
or better:
$regex='/<a .*?href\s*+=\s*+"\K.*?\.(?:net|com)\b[^"]*+/i';
<.*?href
is a problem. This will match from the first < on the current line to the first href, regardless of whether they belong to the same tag.
Generally, it's unwise to try and parse HTML with regexes; if you absolutely insist on doing that, at least be a bit more specific (but still not perfect):
$regex='/<[^<>]*href[^<>=]*="(?:[^"]*(net|com)[^"]*)"/';

Complex PHP/Perl regular expression for emoticons

I've checked google for help on this subject but all the answers keep overlooking a fatal flaw in the replacement method.
Essentially I have a set of emoticons such as :) LocK :eek and so on and need to replace them with image tags. The problem I'm having is identifying that a particular emoticon is not part of a word and is alone on a line. For example on our site we allow 'quick links' which are not included in the smiley replacement which take the format go:forum, user:Username and so on. Pretty much all answers I've read don't allow for this possiblity and as such break these links (i.e. go<img src="image.gif" />orum). I've tried experimenting around with different ways to get around this to check for the start of the line, spaces/newline characters and so on but I've not had much luck.
Any help with this problem would be greatly appreciated. Oh also I'm using PHP 5 and the preg_% functions.
Thanks,
Rupert S.
Edit 18/04/2011:
Thanks for your help peeps :) Have created the final regex that I though I'd share with everyone, had a couple problems to do with special space chars including newline but it's now working like a dream the final regex is:
(?<=\s|\A|\n|\r|\t|\v|\<br \/\>|\<br\>)(:S)(?=\s|\Z|$|\n|\r|\t|\v|\<br \/\>|\<br\>)
To complete the comment into an answer: The simplest workaround would be to assert that the emoticons are always surrounded by whitespace.
(?<=\s|^)[<:-}]+(?=\s|$)
The \s covers normal spaces and line breaks. Just to be safe ^ and $ cover occurrences at the start or very end of the text subject. The assertions themselves do not match, so can be ignored in the replacement string/callback.
If you want to do all the replace in one single preg_replace, try this:
preg_replace('/(?<=^|\s)(:\)|:eek)(?=$|\s)/e'
,"'$1'==':)'?'<img src=\"smile.gif\"/>':('$1'==':eek'?'<img src=\"eek.gif\"/>':'$1')"
,$input);

extracting one or more urls from a string in php

I'm trying to extract one or more urls from a plain text string in php. Here's some examples
"mydomain.com has hit the headlines again"
extract " http://www.mydomain.com"
"this is 1 domain.com and this is anotherdomain.co.uk but sometimes http://thirddomain.net"
extract "http://www.domain.com" , "http://www.anotherdomain.co.uk" , "http://www.thirddomain.net"
There are two special cases I need - I'm thinking regex, but dont fully understand them
1) all symbols like '(' or ')' and spaces (excluding hyphens) need to be removed
2) the word dot needs to be replaced with the symbol . , so dot com would be .com
p.s I'm aware of PHP validation/regex for URL but cant work out how I would use this to achieve the end goal.
Thanks
In this case it will be hard to get 100% correct results.
Depending on the input you may try to force matching just most popular first level domains (add more to it):
(?:https?://)?[a-zA-Z0-9\-\.]+\.(?:com|org|net|biz|edu|uk|ly|gov)\b
You may need to remove the word boundary (\b) to get different results.
You can test it here:
http://bit.ly/dlrgzQ
EDIT: about your cases
1) remove from what?
2) this could be done in php like:
$result = preg_replace('/\s+dot\s+(?=(com|org|net|biz|edu|and_ect))/', '.', $input);
But I have few important notes:
This Regex are more like guidance, not actual production code
Working with this kind of loose rules on text is wacky for the least - and adding more special cases will make it even more looney. Consider this - even stackoverflow doesn't do that:
http://example.org
but not!
example.org
It would be easier if you'd said what are you trying to achieve? Because if you want to process some kind of text that goes somewhere on the WWW later, then it is very bad idea! You should not do this by your own (as you said - you don't understand Regex!), as this would be just can of XSS worms. Better think about some kind of Markdown language or BBCore or else.
Also get interested in: http://htmlpurifier.org/

PHP regex for filtering out urls from specific domains for use in a vBulletin plug-in

I'm trying to put together a plug-in for vBulletin to filter out links to filesharing sites. But, as I'm sure you often hear, I'm a newb to php let alone regexes.
Basically, I'm trying to put together a regex and use a preg_replace to find any urls that are from these domains and replace the entire link with a message that they aren't allowed. I'd want it to find the link whether it's hyperlinked, posted as plain text, or enclosed in [CODE] bb tags.
As for regex, I would need it to find URLS with the following, I think:
Starts with http or an anchor tag. I believe that the URLS in [CODE] tags could be processed the same as the plain text URLS and it's fine if the replacement ends up inside the [CODE] tag afterward.
Could contain any number of any characters before the domain/word
Has the domain somewhere in the middle
Could contain any number of any characters after the domain
Ends with a number of extentions such as (html|htm|rar|zip|001) or in a closing anchor tag.
I have a feeling that it's numbers 2 and 4 that are tripping me up (if not much more). I found a similar question on here and tried to pick apart the code a bit (even though I didn't really understand it). I now have this which I thought might work, but it doesn't:
<?php
$filterthese = array('domain1', 'domain2', 'domain3');
$replacement = 'LINKS HAVE BEEN FILTERED MESSAGE';
$regex = array('!^http+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*(html|htm|rar|zip|001)$!',
'!^<a+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*</a>$!');
$this->post['message'] = preg_replace($regex, $replacement, $this->post['message']);
?>
I have a feeling that I'm way off base here, and I admit that I don't fully understand php let alone regexes. I'm open to any suggestions on how to do this better, how to just make it work, or links to RTM (though I've read up a bit and I'm going to continue).
Thanks.
You can use parse_url on the URLs and look into the hashmap it returns. That allows you to filter for domains or even finer-grained control.
I think you can avoid the overhead of this in using the filter_var built-in function.
You may use this feature since PHP 5.2.0.
$good_url = filter_var( filter_var( $raw_url, FILTER_SANITIZE_URL), FILTER_VALIDATE_URL);
Hmm, my first guess: You put $filterthese directly inside a single-quoted string. That single quotes don't allow for variable substitution. Also, the $filterthese is an array, that should first be joined:
var $filterthese = implode("|", $filterthese);
Maybe I'm way off, because I don't know anything about vBulletin plugins and their embedded magic, but that points seem worth a check to me.
Edit: OK, on re-checking your provided source, I think the regexp line should read like this:
$regex = '!(?#
possible "a" tag [start]: )(<a[^>]+href=["\']?)?(?#
offending link: )https?://(?#
possible subdomains: )(([a-z0-9-]+\.)*\.)?(?#
domains to block: )('.implode("|", $filterthese).')(?#
possible path: )(/[^ "\'>]*)?(?#
possible "a" tag [end]: )(["\']?[^>]*>)?!';

Categories