PHP Regex that matches URLs without invalid characters - php

I'm looking for a Regex that will find URLs in a string but ignore pre/following characters which are not part of the URL.
for example, from the string:
example.co.uk (main site: example.com),
The Regex will find:
example.co.uk and exaple.com.
In order to find URLs within a given string, I use the Regex '#(www\.|https?://)?[a-z0-9]+\.[a-z0-9]{2,4}\S*#i'.
The problem is that if I use this regex with the given string above, it finds example.co.uk and example.com) with the closing bracket at the end.
Is there any Regex that can find URLs in a string, not matter what characters it has from both sides?
Thanks!

You may have to use a word boundary (\b) ...
(?:www\.|https?:\/\/)?[a-z0-9]+\.[a-z0-9]{2,4}\S*\b
^
regex demo

Related

Regex: retrieve from URL everything between www. and .com

I am trying to use PHP's preg_match() to retrieve everything between the www. and .com of a URL.
e.g.:
www.example.com will return example
www.example-website.com will return example-website
I'm lucky in that the URLs I'm working with always start www. and always end .com, so it doesn't need to be particularly complex, accounting for many use cases.
However, my Regex knowledge is minimal to none.
My try:
preg_match("/.([^.]*)./", $string, $matches);
As according to RegExr the second match ($matches[1]?) should contain what I need, but it doesn't seem to be working.
Thanks.
(?<=www\.)(.+?)(?=\.com)
Try this.Grab the capture.See demo.
http://regex101.com/r/iZ9sO5/10
You need to escape the dots in the regex.
preg_match("/www\.([^.]*)\.com/", $string, $matches);
. in a regex can match (nearly) any character,
where as
\. matches only the literal . dot within the url.
www and com can be used for delimiting the string in the url which gives extra safety.
Example : http://regex101.com/r/aA5eC5/2
The first capture group (\1) will contain
example
example-website
EDIT
If the regex is to match strings with other . in it, something like www.example.somesite.com, then the regex can be modified as
preg_match("/www\.(.+)\.com/", $string, $matches);

How to exclude a word or string from an URL - Regex

I'm using the following Regex to match all types of URL in PHP (It works very well):
$reg_exUrl = "%\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))%s";
But now, I want to exclude Youtube, youtu.be and Vimeo URLs:
I'm doing something like this after researching, but it is not working:
$reg_exUrl = "%\b(([\w-]+://?|www[.])(?!youtube|youtu|vimeo)[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))%s";
I want to do this, because I have another regex that match Youtube urls which returns an iframe and this regex is causing confusion between the two Regex.
Any help would be gratefully appreciated, thanks.
socodLib, to exclude something from a string, place yourself at the beginning of the string by anchoring with a ^ (or use another anchor) and use a negative lookahead to assert that the string doesn't contain a word, like so:
^(?!.*?(?:youtube|some other bad word|some\.string\.with\.dots))
Before we make the regex look too complex by concatenating it with yours, let;s see what we would do if you wanted to match some word characters \w+ but not youtube or google, you would write:
^(?!.*?(?:youtube|google))\w+
As you can see, after the assertion (where we say what we don't want), we say what we do want by using the \w+
In your case, let's add a negative lookahead to your initial regex (which I have not tuned):
$reg_exUrl = "%(?i)\b(?!.*?(?:youtu\.?be|vimeo))(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))%s";
I took the liberty of making the regex case insensitive with (?i). You could also have added i to your s modifier at the end. The youtu\.?be expression allows for an optional dot.
I am certain you can apply this recipe to your expression and other regexes in the future.
Reference
Regex lookarounds
StackOverflow regex FAQ

PHP url routing regex

I am using a regular expression to route my application, this is the regex I am using:
#/users/([a-zA-Z0-9\-\_]+)/posts#
But unfortunately it matches against these these urls too:
/users/:uid/posts/:pid
/users/:uid/posts/:pid/comment/:cid
But it shouldn't, it should match exact the same url so only:
/users/:uid/posts
What should I change in the regex to make it match the exact same string?
Thanks for help
You should include anchors for the beginning (^) and end ($) of the string:
#^/users/([a-zA-Z0-9\-\_]+)/posts/?$#
I also allowed for an optional / at the end of the URL.

Pull out second url from string via regex

I am looking for a way within PHP to pull the second link http://secure.hello.com out of the following string via regex.
http://hello.com/http://secure.hello.com
The string may or may not contain https in either of the spots.
Try this regular expression:
://\S*/(\w+://\S*)
It searches for the first :// then searches for a slash followed by some word characters, then ://, then anything apart from spaces. The text you want is in the first capturing group.
In a PHP string literal it can be written as:
'#://\S*/(\w+://\S*)#'
See it working online: ideone
If you want to restrict to http or https, change the \w+ to https?.

php regular expression help finding multiple filenames only not full URL

I am trying to fix a regular expression i have been using in php it finds all find filenames within a sentence / paragraph. The file names always look like this: /this-a-valid-page.php
From help i have received on SOF my old pattern was modified to this which avoids full urls which is the issue i was having, but this pattern only finds one occurance at the beginning of a string, nothing inside the string.
/^\/(.*?).php/
I have a live example here: http://vzio.com/upload/reg_pattern.php
Remove the ^ - the carat signifies the beginning of a string/line, which is why it's not matching elsewhere.
If you need to avoid full URLs, you might want to change the ^ to something like (?:^|\s) which will match either the beginning of the string or a whitespace character - just remember to strip whitespace from the beginning of your match later on.
The last dot in your expression could still cause problems, since it'll match "one anything". You could match, for example, /somefilename#php with that pattern. Backslash it to make it a literal period:
/\/(.*?)\.php/
Also note the ? to make .* non-greedy is necessary, and Arda Xi's pattern won't work. .* would race to the end of the string and then backup one character at a time until it can match the .php, which certainly isn't what you'd want.
To find all the occurrences, you'll have to remove the start anchor and use the preg_match_all function instead of preg_match :
if(preg_match_all('/\/(.*?)\.php/',$input,$matches)) {
var_dump($matches[1]); // will print all filenames (after / and before .php)
}
Also . is a meta char. You'll have to escape it as \. to match a literal period.

Categories