I have a url in the following formats
/fixed1/term1/fixed2/term2
/fixed1/term1/term2/fixed2/term3
/fixed1/term1/term2/...termN.../fixed2/termN+1
in all cases I need the regex to return me all the terms (not including the fixedN).
term can be anything (as long as it's a valid url)
I managed to get until
fixed1\/([^\/]+)\/fixed2\/(.*)
which works fine for
/fixed1/term1/fixed2/term2
but does not work properly on the other cases (when I have multiple terms between the two fixed words)
Any suggestions?
your regex
[^\/]+
will stop after it hits the first backslash, which is why you can find term1 just fine. As a lazy first stab at this solution, i would try
fixed1\/(.+?)\/fixed2\/(.*)
Related
I know there are other posts with a similar name but I've looked through them and they haven't helped me resolve this.
I'm trying to get my head around regex and preg_match. I am going through a body of text and each time a link exists I want it to be extracted. I'm currently using the following:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
which works fine until it finds one that has <br after it. Then I get the url plus the <br which means it doesn't work correctly. How can I have it so that it stops at the < without including it?
Also, I have been looking everywhere for a clear explanation of using regex and I'm still confused by it. Has anyone any good guides on it for future reference?
\S* is too broad. In particular, I could inject into your code with a URL like:
http://hax.hax/"><script>alert('HAAAAAAAX!');</script>
You should only allow characters that are allowed in URLs:
[-A-Za-z0-9._~:/?#[]#!$&'()*+,;=]*
Some of these characters are only allowed in specific places (such as ?) so if you want better validation you will need more cleverness
Instead of \S exclude the open tag char from the class:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/[^<]*)?/";
You might even want to be more restrictive by only allowing characters valid in URLs:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/[a-zA-Z_\-\.%\?&]*)?/";
(or some more characters)
You could use this one as presented on the:
http://regex101.com/r/zV1uI7
On the bottom of the site you got it explained step by step.
I have a custom markup parsing function that has been working very well for many years. I recently discovered a bug that I hadn't noticed before and I haven't been able to fix it. If anyone can help me with this that'd be awesome. So I have a custom built forum and text based MMORPG and every input is sanitized and parsed for bbcode like markup. It'll also parse out URL's and make them into legit links that go to an exit page with a disclaimer that you're leaving the site... So the issue that I'm having is that when I user posts multiple URL's in a text box (let's say \n delimited) it'll only convert every other URL into a link. Here's the parser for URL's:
$markup = preg_replace("/(^|[^=\"\/])\b((\w+:\/\/|www\.)[^\s<]+)" . "((\W+|\b)([\s<]|$))/ei", '"$1".shortURL("$2")."$4"', $markup);
As you can see it calls a PHP function, but that's not the issue here. Then entire text block is passed into this preg_replace at the same time rather than line by line or any other means.
If there's a simpler way of writing this preg_replace, please let me know
If you can figure out why this is only parsing every other URL, that's my ultimate goal here
Example INPUT:
http://skylnk.co/tRRTnb
http://skylnk.co/hkIJBT
http://skylnk.co/vUMGQo
http://skylnk.co/USOLfW
http://skylnk.co/BPlaJl
http://skylnk.co/tqcPbL
http://skylnk.co/jJTjRs
http://skylnk.co/itmhJs
http://skylnk.co/llUBAR
http://skylnk.co/XDJZxD
Example OUTPUT:
http://skylnk.co/tRRTnb
<br>http://skylnk.co/hkIJBT
<br>http://skylnk.co/vUMGQo
<br>http://skylnk.co/USOLfW
<br>http://skylnk.co/BPlaJl
<br>http://skylnk.co/tqcPbL
<br>http://skylnk.co/jJTjRs
<br>http://skylnk.co/itmhJs
<br>http://skylnk.co/llUBAR
<br>http://skylnk.co/XDJZxD
<br>
e flag in preg_replace is deprecated. You can use preg_replace_callback to access the same functionality.
i flag is useless here, since \w already matches both upper case and lower case, and there is no backreference in your pattern.
I set m flag, which makes the ^ and $ matches the beginning and the end of a line, rather than the beginning and the end of the entire string. This should fix your weird problem of matching every other line.
I also make some of the groups non-capturing (?:pattern) - since the bigger capturing groups have captured the text already.
The code below is not tested. I only tested the regex on regex tester.
preg_replace_callback(
"/(^|[^=\"\/])\b((?:\w+:\/\/|www\.)[^\s<]+)((?:\W+|\b)(?:[\s<]|$))/m",
function ($m) {
return "$m[1]".shortURL($m[2])."$m[3]";
},
$markup
);
I'm a newbie here. I'm facing a weird problem in using regex in PHP.
$result = "some very long long string with different kind of links";
$regex='/<.*?href.*?="(.*?net.*?)"/'; //this is the regex rule
preg_match_all($regex,$result,$parts);
Here in this code I'm trying to get the links from the result string. But it will provide me only those links which contains .net. But I also want to get those links which have .com. For this I tried this code
$regex='/<.*?href.*?="(.*?net|com.*?)"/';
But it shows nothing.
SOrry for my bad English.
Thanks in advance.
Update 1 :
now i'm using this
$regex='/<.*?href.*?="(.*?)"/';
this rule grab all the links from the string. But this is not perfect. Because it also grabs other substrings like "javascript".
The | character applies to everything within the capturing group, so (.*?net|com.*?) will match either .*?net or com.*?, I think what you want is (.*?(net|com).*?).
If you do not want the extra capturing group, you can use (.*?(?:net|com).*?).
You could also use (.*?net.*?|.*?com.*?), but this is not recommended because of the unnecessary repetition.
Your regex gets interpreted as .*?net or com.*?. You'll want (.*?(net|com).*?).
Try this:
$regex='/<.*?href.*?="(.*?\.(?:net|com)\b.*?)"/i';
or better:
$regex='/<a .*?href\s*+=\s*+"\K.*?\.(?:net|com)\b[^"]*+/i';
<.*?href
is a problem. This will match from the first < on the current line to the first href, regardless of whether they belong to the same tag.
Generally, it's unwise to try and parse HTML with regexes; if you absolutely insist on doing that, at least be a bit more specific (but still not perfect):
$regex='/<[^<>]*href[^<>=]*="(?:[^"]*(net|com)[^"]*)"/';
I'm trying to extract one or more urls from a plain text string in php. Here's some examples
"mydomain.com has hit the headlines again"
extract " http://www.mydomain.com"
"this is 1 domain.com and this is anotherdomain.co.uk but sometimes http://thirddomain.net"
extract "http://www.domain.com" , "http://www.anotherdomain.co.uk" , "http://www.thirddomain.net"
There are two special cases I need - I'm thinking regex, but dont fully understand them
1) all symbols like '(' or ')' and spaces (excluding hyphens) need to be removed
2) the word dot needs to be replaced with the symbol . , so dot com would be .com
p.s I'm aware of PHP validation/regex for URL but cant work out how I would use this to achieve the end goal.
Thanks
In this case it will be hard to get 100% correct results.
Depending on the input you may try to force matching just most popular first level domains (add more to it):
(?:https?://)?[a-zA-Z0-9\-\.]+\.(?:com|org|net|biz|edu|uk|ly|gov)\b
You may need to remove the word boundary (\b) to get different results.
You can test it here:
http://bit.ly/dlrgzQ
EDIT: about your cases
1) remove from what?
2) this could be done in php like:
$result = preg_replace('/\s+dot\s+(?=(com|org|net|biz|edu|and_ect))/', '.', $input);
But I have few important notes:
This Regex are more like guidance, not actual production code
Working with this kind of loose rules on text is wacky for the least - and adding more special cases will make it even more looney. Consider this - even stackoverflow doesn't do that:
http://example.org
but not!
example.org
It would be easier if you'd said what are you trying to achieve? Because if you want to process some kind of text that goes somewhere on the WWW later, then it is very bad idea! You should not do this by your own (as you said - you don't understand Regex!), as this would be just can of XSS worms. Better think about some kind of Markdown language or BBCore or else.
Also get interested in: http://htmlpurifier.org/
I'm trying to put together a plug-in for vBulletin to filter out links to filesharing sites. But, as I'm sure you often hear, I'm a newb to php let alone regexes.
Basically, I'm trying to put together a regex and use a preg_replace to find any urls that are from these domains and replace the entire link with a message that they aren't allowed. I'd want it to find the link whether it's hyperlinked, posted as plain text, or enclosed in [CODE] bb tags.
As for regex, I would need it to find URLS with the following, I think:
Starts with http or an anchor tag. I believe that the URLS in [CODE] tags could be processed the same as the plain text URLS and it's fine if the replacement ends up inside the [CODE] tag afterward.
Could contain any number of any characters before the domain/word
Has the domain somewhere in the middle
Could contain any number of any characters after the domain
Ends with a number of extentions such as (html|htm|rar|zip|001) or in a closing anchor tag.
I have a feeling that it's numbers 2 and 4 that are tripping me up (if not much more). I found a similar question on here and tried to pick apart the code a bit (even though I didn't really understand it). I now have this which I thought might work, but it doesn't:
<?php
$filterthese = array('domain1', 'domain2', 'domain3');
$replacement = 'LINKS HAVE BEEN FILTERED MESSAGE';
$regex = array('!^http+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*(html|htm|rar|zip|001)$!',
'!^<a+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*</a>$!');
$this->post['message'] = preg_replace($regex, $replacement, $this->post['message']);
?>
I have a feeling that I'm way off base here, and I admit that I don't fully understand php let alone regexes. I'm open to any suggestions on how to do this better, how to just make it work, or links to RTM (though I've read up a bit and I'm going to continue).
Thanks.
You can use parse_url on the URLs and look into the hashmap it returns. That allows you to filter for domains or even finer-grained control.
I think you can avoid the overhead of this in using the filter_var built-in function.
You may use this feature since PHP 5.2.0.
$good_url = filter_var( filter_var( $raw_url, FILTER_SANITIZE_URL), FILTER_VALIDATE_URL);
Hmm, my first guess: You put $filterthese directly inside a single-quoted string. That single quotes don't allow for variable substitution. Also, the $filterthese is an array, that should first be joined:
var $filterthese = implode("|", $filterthese);
Maybe I'm way off, because I don't know anything about vBulletin plugins and their embedded magic, but that points seem worth a check to me.
Edit: OK, on re-checking your provided source, I think the regexp line should read like this:
$regex = '!(?#
possible "a" tag [start]: )(<a[^>]+href=["\']?)?(?#
offending link: )https?://(?#
possible subdomains: )(([a-z0-9-]+\.)*\.)?(?#
domains to block: )('.implode("|", $filterthese).')(?#
possible path: )(/[^ "\'>]*)?(?#
possible "a" tag [end]: )(["\']?[^>]*>)?!';