PHP regexp - find all URLs except special folders in URI - php

Need to get all URLs like:
http://domain.name/novostroyki/novyy_petergof/
http://domain.name/novostroyki/novyy_petergof/?var1=value1&val2=value2=...
but not the following ones:
http://domain.name/novostroyki/novyy_petergof/flats/
http://domain.name/novostroyki/novyy_petergof/flats/?var1=value1&val2=value2=...
Tried something like that, but it doesn't work as I wish:
/novostroyki/((?!flats)[a-z_0-9A-Z\.])*/?\??(.*)/

Try this regex:
/novostroyki/((?!flats)[\w.]*/?)*(\?.*)?
Not sure if it will be fine in all cases - it certainly should be in the ones listed above.

check if this suggestion is true:
if an url matches the regex on the list, it doesn't continue to the next item on the list.
then you can use:
the regex with all novostroyki after the regex with all novostroyki/flats/

Related

Search for the match in the middle of the string as well

My regex:
(?<=span class="ope">)[a-z0-9]+?\.(pl|com|net\.pl|tk|org|org\.pl|eu)|$(?=<\/span>)$
At the moment, it does match only if the string is found on the beginning of the text, when its in the middle, it fails.
Eg.
Something example.com - fail
example.com Something - success (example.com found).
Is there any solution for this one?
(?<=span class="ope">).*?([a-zA-Z0-9]*\.(pl|com|net\.pl|tk|org|org\.pl|eu)).*(?=<\/span>)
Test: http://www.regex101.com/r/wK0aA2
You're going to have to pull out group 1 rather than group 0 if you use this.
Here's a tested solution:
(?<=span class="ope">).*?(?P<domain>\w+\.(?:pl|com|net\.pl|tk|org|org\.pl|eu)).*?(?=<\/span>)
it returns with the key domain your wanted domain. Try it here: http://www.regex101.com/r/mK1fP0
The problems where the two look behinds. The second one must be a lookahead instead. Also I inserted some .*? twice to match some stuff around the domain.
Try this
(?<=span class="ope">)[a-zA-Z0-9\s]*[a-z0-9]+?\.(pl|com|net\.pl|tk|org|org\.pl|eu)|$(?=<\/span>)$

URL routing regex

I'm trying to create a snippet of regex that will match a URL route.
Basically, if I have this route /users/:id I want /users/100 to match, but /users/100/edit not to match.
This is what I'm using now: users/(.*)/ but because of the greedy match it's matching regardless of what's after the user ID. I need some way of "breaking" the match if there's an /edit or something else on the end of the route.
I've looked into the Regex NOT operator but with no luck.
Any advice?
Are you just trying to collect digits?
You could use users/(\d*)/
And this one is how you would do it if you wanted to collect everything until a /, and it uses a NOT, ^/users/[^/]*$
You can use negative lookahead:
users/(.*)/(?!edit)
This will always require a trailing slash however. Maybe a better solution would be:
users/(\d+)(?!/edit)
See this post for more information.

help with a regex code

i have this regex code
/^(https?:\/\/+[\w\-]+\.[\w\-]+)/i
it works but there is a problem
you NEED http:// in the url for it to validate, and what i am making, the user will not want to add http:// to the url they want to just have example.com, if its possible i need it to work weather it has http:// or not
i don't know how to make my own regex, and ive searched but cannot find a one that does what i need, unless im just not looking in the right place. (Google :P)
Don't bother with regex. Use parse_url function.
You can just make it optional
/^((?:https?:\/\/+)?[\w\-]+\.[\w\-]+)/i
The (?:) around the part you don't want to have is a non capturing group, the ? afterwards makes it optional.
I'm not sure what the + after the second slash is good for, it says at least one of the preceding character. That means it allows also stuff like http://////////.
I hope you are aware, that this regex is far from matching valid URLs.
For example it will match stuff like
http://////////------------.-
or at least
http://N.O
^ after this position you can write what you want and it will match valid.
Here on Regexr you can see what your regex is matching.
See Purple Coder's answer for a probably better solution.
/^((https?:\/\/+)?[\w-]+.[\w-]+)/i
I'm using this :
// Validate that the string contains at least a dot .
var filterWebsite = /^([a-zA-Z0-9:_\.\-/])+\.([a-zA-Z0-9_\.\-/])+$/;

Simple PHP Regex

I am setting up a Zend_Route (but it is still just a regex) and I wish to match a url like
/en/experience/this-is-my-name-and-the-last-is-1-of-id-123456.html
So I want to grab the
this-is-my-name-and-the-last-is-1-of
and the
123456
I tried
\w{2}/experience/(.+)?-(\d+)\.html
but that doesn't seem to work.
It would be easy if the other way around e.g. if it was id the name
/en/experience/123456-this-is-my-name-and-the-last-is-1-of-id.html
I could use
\w{2}/experience/(\d+)-(.+)\.html
But that is a cop out - so any advice on how to match original format?
Try this one:
/\w{2}/experience/(.+?)-(\d+)\.html
try this:
/\w{2}/experience/(.+)?-(\d+)\.html
zend route internally does this:
preg_match('#^/\w{2}/experience/(.+)?-(\d+)\.html$#i', '/en/experience/this-is-my-name-and-the-last-is-1-of-id-123456.html', $matches);
so, your pattern only matches with a slash on the beginning.

Need a good regex to convert URLs to links but leave existing links alone

I have a load of user-submitted content. It is HTML, and may contain URLs. Some of them will be <a>'s already (if the user is good) but sometimes users are lazy and just type www.something.com or at best http://www.something.com.
I can't find a decent regex to capture URLs but ignore ones that are immediately to the right of either a double quote or '>'. Anyone got one?
Jan Goyvaerts, creator of RegexBuddy, has written a response to Jeff Atwood's blog that addresses the issues Jeff had and provides a nice solution.
\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$]
In order to ignore matches that occur right next to a " or >, you could add (?<![">]) to the start of the regex, so you get
(?<![">])\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$]
This will match full addresses (http://...) and addresses that start with www. or ftp. - you're out of luck with addresses like ars.userfriendly.org...
This thread is old as the hills, but I came across it while working on my own problem: That is, convert any urls into links, but leave alone any that are already within anchor tags. After a while, this is what has popped out:
(?!(?!.*?<a)[^<]*<\/a>)(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&#/%=~_|$?!:,.]*[A-Z0-9+&#/%=~_|$]
With the following input:
http://www.google.com
http://google.com
www.google.com
<p>http://www.google.com<p>
this is a normal sentence. let's hope it's ok.
www.google.com
This is the output of a preg_replace:
http://www.google.com
http://google.com
www.google.com
<p>http://www.google.com<p>
this is a normal sentence. let's hope it's ok.
www.google.com
Just wanted to contribute back to save somebody some time.
I made a slight modification to the Regex contained in the original answer:
(?<![.*">])\b(?:(?:https?|ftp|file)://|[a-z]\.)[-A-Z0-9+&#/%=~_|$?!:,.]*[A-Z0-9+&#/%=~_|$]
which allows for more subdomains, and also runs a more full check on tags. To apply this to PHP's preg replace, you can use:
$convertedText = preg_replace( '#(?<![.*">])\b(?:(?:https?|ftp|file)://|[a-z]\.)[-A-Z0-9+&#/%=~_|$?!:,.]*[A-Z0-9+&#/%=~_|$]#i', '\0', $originalText );
Note, I removed # from the regex, in order to use it as a delimiter for preg_replace. It's pretty rare that # would be used in a URL anyway.
Obviously, you can modify the replacement text, and remove target="_blank", or add rel="nofollow" etc.
Hope that helps.
To skip existing ones just use a look-behind - add (?<!href=") to the beginning of your regular expression, so it would look something like this:
/(?<!href=")http://\S*/
Obviously this isn't a complete solution for finding all types of URLs, but this should solve your problem of messing with existing ones.
if (preg_match('/\b(?<!=")(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*[A-Z0-9+&##\/%=~_|](?!.*".*>)(?!.*<\/a>)/i', $subject)) {
# Successful match
} else {
# Match attempt failed
}
Shameless plug: You can look here (regular expression replace a word by a link) for inspiration.
The question asked to replace some word with a certain link, unless there already was a link. So the problem you have is more or less the same thing.
All you need is a regex that matches a URL (in place of the word). The simplest assumption would be like this: An URL (optionally) starts with "http://", "ftp://" or "mailto:" and lasts as long as there are no white-space characters, line breaks, tag brackets or quotes).
Beware, long regex ahead. Apply case-insensitively.
(href\s*=\s*['"]?)?((?:http://|ftp://|mailto:)?[^.,<>"'\s\r\n\t]+(?:\.(?![.<>"'\s\r\n])[^.,!<>"'\s\r\n\t]+)+)
Be warned - this will also match URLs that are technically invalid, and it will recognize things.formatted.like.this as an URL. It depends on your data if it is too insensitive. I can fine-tune the regex if you have examples where it returns false positives.
The regex will produce two match groups. Group 2 will contain the matched thing, which is most likely an URL. Group 1 will either contain an empty string or an 'href="'. You can use it as an indicator that this match occurred inside a href parameter of an existing link and you don't have to do touch that one.
Once you confirm that this does the right thing for you most of the time (with user supplied data, you can never be sure), you can do the rest in two steps, as I proposed it in the other question:
Make a link around every URL there is (unless there is something in match group 1!) This will produce double nested <a> tags for things that have a link already.
Scan for incorrectly nested <a> tags, removing the innermost one

Categories