Validate URL with or without protocol - php

Hi I would like to validate this following urls, so they all would pass with or without http/www part in them as long as there is TLD present like .com, .net, .org etc..
Valid URLs Should Be:
http://www.domain.com
http://domain.com
https://www.domain.com
https://domain.com
www.domain.com
domain.com
To support long tlds:
http://www.domain.com.uk
http://domain.com.uk
https://www.domain.com.uk
https://domain.com.uk
www.domain.com.uk
domain.com.uk
To support dashes (-):
http://www.domain-here.com
http://domain-here.com
https://www.domain-here.com
https://domain-here.com
www.domain-here.com
domain-here.com
Also to support numbers in domains:
http://www.domain1-test-here.com
http://domain1-test-here.com
https://www.domain1-test-here.com
https://domain1-test-here.com
www.domain1-test-here.com
domain-here.com
Also maybe allow even IPs:
127.127.127.127
(but this is extra!)
Also allow dashes (-), forgot to mantion that =)
I've found many functions that validate one or another but not both at same time.
If any one knows good regex for it, please share. Thank you for your help.

For url validation perfect solution.
Above Answer is right but not work on all domains like .me, .it, .in
so please user below for url match:
$pattern = '/(?:https?:\/\/)?(?:[a-zA-Z0-9.-]+?\.(?:[a-zA-Z])|\d+\.\d+\.\d+\.\d+)/';
if(preg_match($pattern, "http://website.in"))
{
echo "valid";
}else{
echo "invalid";
}

When you ignore the path part and look for the domain part only, a simple rule would be
(?:https?://)?(?:[a-zA-Z0-9.-]+?\.(?:com|net|org|gov|edu|mil)|\d+\.\d+\.\d+\.\d+)
If you want to support country TLDs as well you must either supply a complete (current) list or append |.. to the TLD part.
With preg_match you must wrap it between some delimiters
$pattern = ';(?:https?://)?(?:[a-zA-Z0-9.-]+?\.(?:com|net|org|gov|edu|mil)|\d+\.\d+\.\d+\.\d+);';
$index = preg_match($pattern, $url);
Usually, you use /. But in this case, slashes are part of the pattern, so I have chosen some other delimiter. Otherwise I must escape the slashes with \
$pattern = '/(?:https?:\/\/)?(?:[a-zA-Z0-9.-]+?\.(?:com|net|org|gov|edu|mil)|\d+\.\d+\.\d+\.\d+)/';

Don't use a regular expression. Not every problem that involves strings needs to use regexes.
Don't write your own URL validator. URL validation is a solved problem, and there is existing code that has already been written, debugged and testing. In fact, it comes standard with PHP.
Look at PHP's built-in filtering functionality: http://us2.php.net/manual/en/book.filter.php

I think you can use flags for filter_vars.
For FILTER_VALIDATE_URL there is several flags available:
FILTER_FLAG_SCHEME_REQUIRED Requires the URL to contain a scheme
part.
FILTER_FLAG_HOST_REQUIRED Requires the URL to contain a host
part.
FILTER_FLAG_PATH_REQUIRED Requires the URL to contain a path
part.
FILTER_FLAG_QUERY_REQUIRED Requires the URL to contain a query
string.
FILTER_FLAG_SCHEME_REQUIRED and FILTER_FLAG_HOST_REQUIRED used by default.
Lets say you want to check for path part and do not want to check for scheme part, you can do something like this (falg is a bitmask):
filter_var($url, FILTER_VALIDATE_URL, ~FILTER_FLAG_SCHEME_REQUIRED | FILTER_FLAG_PATH_REQUIRED)

Related

PHP / RegEx - Convert URLs to links by detecting .com/.net/.org/.edu etc

I know there have been many questions asking for help converting URLs to clickable links in strings, but I haven't found quite what I'm looking for.
I want to be able to match any of the following examples and turn them into clickable links:
http://www.domain.com
https://www.domain.net
http://subdomain.domain.org
www.domain.com/folder
subdomain.domain.net
subdomain.domain.edu/folder/subfolder
domain.net
domain.com/folder
I do not want to match random.stuff.separated.with.periods.
EDIT: Please keep in mind that these URLs need to be found within larger strings of 'normal' text. For example, I want to match 'domain.net' in "Hello! Come check out domain.net!".
I think this could be accomplished with a regex that can determine whether the matching url contains .com, .net, .org, or .edu followed by either a forward slash or whitespace. Other than a user typo, I can't imagine any other case in which a valid URL would have one of those followed by anything else.
I realize there are many valid domain extensions out there, but I don't need to support them all. I can just choose which to support with something like (com|net|org|edu) in the regex. Unfortunately, I'm not skilled enough with regex yet to know how to properly implement this.
I'm hoping someone can help me find a regular expression (for use with PHP's preg_replace) that can match URLs based on just about any text connected by one or more dots and either ending with one of the specified extensions followed by whitespace OR containing one of the specified extensions followed by a slash and possibly folders.
I did several searches and so far have not found what I'm looking for. If there already exists a SO post that answers this, I apologize.
Thanks in advance.
--- EDIT 3 ---
After days of trial and error and some help from SO, here's what works:
preg_replace_callback('#(\s|^)((https?://)?(\w|-)+(\.(\w+|-)*)+(?<=\.net|org|edu|com|cc|br|jp|dk|gs|de)(\:[0-9]+)?(?:/[^\s]*)?)(?=\s|\b)#is',
create_function('$m', 'if (!preg_match("#^(https?://)#", $m[2]))
return $m[1]."".$m[2].""; else return $m[1]."".$m[2]."";'),
$event_desc);
This is a modified version of anubhava's code below and so far seems to do exactly what I want. Thanks!
You can use this regex:
#(\s|^)((?:https?://)?\w+(?:\.\w+)+(?<=\.(net|org|edu|com))(?:/[^\s]*|))(?=\s|\b)#is
Code:
$arr = array(
'http://www.domain.com/?foo=bar',
'http://www.that"sallfolks.com',
'This is really cool site: https://www.domain.net/ isn\'t it?',
'http://subdomain.domain.org',
'www.domain.com/folder',
'Hello! You can visit vertigofx.com/mysite/rocks for some awesome pictures, or just go to vertigofx.com by itself',
'subdomain.domain.net',
'subdomain.domain.edu/folder/subfolder',
'Hello! Check out my site at domain.net!',
'welcome.to.computers',
'Hello.Come visit oursite.com!',
'foo.bar',
'domain.com/folder',
);
foreach($arr as $url) {
$link = preg_replace_callback('#(\s|^)((?:https?://)?\w+(?:\.\w+)+(?<=\.(net|org|edu|com))(?:/[^\s]*|))(?=\s|\b)#is',
create_function('$m', 'if (!preg_match("#^(https?://)#", $m[2]))
return $m[1]."".$m[2].""; else return $m[1]."".$m[2]."";'),
$url);
echo $link . "\n";
OUTPUT:
http://www.domain.com/?foo=bar
http://www.that"sallfolks.com
This is really cool site: https://www.domain.net/ isn't it?
http://subdomain.domain.org
www.domain.com/folder
Hello! You can visit vertigofx.com/mysite/rocks for some awesome pictures, or just go to vertigofx.com by itself
subdomain.domain.net
subdomain.domain.edu/folder/subfolder
Hello! Check out my site at domain.net!
welcome.to.computers
Hello.Come visit oursite.com!
foo.bar
domain.com/folder
PS: This regex only supports http and https scheme in URL. So eg: if you want to support ftp also then you need to modify the regex a little.
'/(http(s)?:\/\/)?[\w\/\.]+(\.((com)|(edu)|(net)|(org)))[\w\/]*/'
That works for your examples. You might want to add extra characters support for "-", "&", "?", ":", etc in the last bracket.
'/(http(s)?:\/\/)?[\w\/\.]+(\.((com)|(edu)|(net)|(org)))[\w\/\?=&-;]*/'
This will support parameters and port numbers.
eg.: www.foo.ca:8888/test?param1=val1&param2=val2
Thanks a ton. I modified his final solution to allow all domains (.ca, .co.uk), not just the specified ones.
$html = preg_replace_callback('#(\s|^)((https?://)?(\w|-)+(\.[a-z]{2,3})+(\:[0-9]+)?(?:/[^\s]*)?)(?=\s|\b)#is',
create_function('$m', 'if (!preg_match("#^(https?://)#", $m[2])) return $m[1]."".$m[2].""; else return $m[1]."".$m[2]."";'),
$url);

Regexing URLs with and without a protocol in PHP

So I've got this URL regex:
/(?:((?:[^-/"':!=a-z0-9_#]|^|\:))((https?://)((?:[^\p{P}\p{Lo}\s].-|[^\p{P}\p{Lo}\s])+.[a-z]{2,}(?::[0-9]+)?)(/(?:(?:([a-z0-9!*';:=+\$/%#[]-_,~]+))|#[a-z0-9!*';:=+\$/%#[]-_,~]+/|[.\,]?(?:[a-z0-9!*';:=+\$/%#[]-_~]|,(?!\s)))*[a-z0-9=#/]?)?(\?[a-z0-9!*'();:&=+\$/%#[]-_.,~]*[a-z0-9_&=#/])?))/iux
What it's currently matching:
http://www.google.com
http://google.com
I need it to also match:
www.google.com
google.com
I tried making the protocol part of the regex optional by slapping a ? at the end "(https?:\/\/)?" but that didn't do anything.
Ideas?
I'd look for something in the language that you are using to do this. URLs are tough to match with a regex. If you insist, I changed yours to make the (https?://) optional. I did not check it though.
/(?:((?:[^-/"':!=a-z0-9_#]|^|\:))((https?://)?((?:[^\p{P}\p{Lo}\s].-|[^\p{P}\p{Lo}\s])+.[a-z]{2,}(?::[0-9]+)?)(/(?:(?:([a-z0-9!*';:=+\$/%#[]-_,~]+))|#[a-z0-9!*';:=+\$/%#[]-_,~]+/|[.\,]?(?:[a-z0-9!*';:=+\$/%#[]-_~]|,(?!\s)))*[a-z0-9=#/]?)?(\?[a-z0-9!*'();:&=+\$/%#[]-_.,~]*[a-z0-9_&=#/])?))/iux
I got this example from the RFC 3986 and was directed there by this comment. Although, I'd still recommend using something from whatever language you are using rather than a regex.
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
Since you are using PHP, did you consider using parse_url? It looks like it will return false on bad urls.

Checking for valid web address, regular expressions in PHP

I have this text input, and I need to check if the string is a valid web address, like http://www.example.com. How can be done with regular expressions in PHP?
Use the filter extension:
filter_var($url, FILTER_VALIDATE_URL);
This will be far more robust than any regex you can write.
Found this:
(http|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?
From Here:
A regex that validates a web address and matches an empty string?
You need to first understand a web address before you can begin to parse it effectively. Yes, http://www.example.com is a valid address. So is www.example.com. Or example.com. Or http://example.com. Or prefix.example.com.
Have a look at the specifications for a URI, especially the Syntax components.
I found the below from http://www.roscripts.com/PHP_regular_expressions_examples-136.html
//URL: Different URL parts
//Protocol, domain name, page and CGI parameters are captured into backreferenes 1 through 4
'\b((?#protocol)https?|ftp)://((?#domain)[-A-Z0-9.]+)((?#file)/[-A-Z0-9+&##/%=~_|!:,.;]*)?((?#parameters)\?[-A-Z0-9+&##/%=~_|!:,.;]*)?'
//URL: Different URL parts
//Protocol, domain name, page and CGI parameters are captured into named capturing groups.
//Works as it is with .NET, and after conversion by RegexBuddy on the Use page with Python, PHP/preg and PCRE.
'\b(?<protocol>https?|ftp)://(?<domain>[-A-Z0-9.]+)(?<file>/[-A-Z0-9+&##/%=~_|!:,.;]*)?(?<parameters>\?[-A-Z0-9+&##/%=~_|!:,.;]*)?'
//URL: Find in full text
//The final character class makes sure that if an URL is part of some text, punctuation such as a
//comma or full stop after the URL is not interpreted as part of the URL.
'\b(https?|ftp|file)://[-A-Z0-9+&##/%?=~_|!:,.;]*[-A-Z0-9+&##/%=~_|]'
//URL: Replace URLs with HTML links
preg_replace('\b(https?|ftp|file)://[-A-Z0-9+&##/%?=~_|!:,.;]*[-A-Z0-9+&##/%=~_|]', '\0', $text);
In most cases you don't have to check if a string is a valid address.
Either it is, and a web site will be available or it won't be and the user will simply go back.
You should only escape illegals characters to avoid XSS, if your user doesn't want do give a valid website, it should be his problem.
(In most cases).
PS: If you still want to check URLs, look at nikic's answer.
To match more protocols, you can do:
((https?|s?ftp|gopher|telnet|file|notes|ms-help)://)?[\w:##%/;$()~=\.&-]+

Detecting .com / .co.uk etc etc

I currently have a preg_match to detect http:// and www. etc..... but I want to detect domain.com or domain.co.uk from a string
example string: "Hey hows it going,
check out domain.com" And I want to
detect domain.com
What I want is to detect any major domains form this string i.e. .com .co.uk .eu etc... from the form example.com example2.co.uk and then return true or false to handle it. In this case it would find domain.com.
However I do NOT want it to detect something like:
"hey.i love this site"
Whereby this is obviously an error in typing a space from the full stop!
Any ideas i need to scratch up on my regex!
Thanks,
Stefan
After they introduced non-Latin urls, it will be close to impossible to use regex to get a completely working filter. So I'd say it's not even worth trying to use regex for this anymore. Doubt parse_url() has support for it yet either, but using it means someone else have to work out the problems with non-Latin urls, which is always a bonus :) So use that
http://au.php.net/parse_url
http://thenextweb.com/me/2010/05/06/monumental-day-internet-nonlatin-domain-names-live/
Edit:
Ok, from a string, split it into words like this
$array = explode(" ", $string);
for(int i = 0; i < count($array);i++)
{
if(parse_url($array[i]) != false)
{
$url[] = $array[i];
}
}
Ok, parse_url() isn't supposed to be used like this, but there is no other function built into php to do url filtering as far as I can see.
Here is regexp that would match a provided list of domain zones:
[a-z0-9\-\.]+\.(com|co\.uk|net|org)

PHP regex for validating a URL

I'm looking for a decent regex to match a URL (a full URL with scheme, domain, path etc.)
I would normally use filter_var but I can't in this case as I have to support PHP<5.2!
I've searched the web but can't find anything that I'm confident will be fool-proof, and all I can find on SO is people saying to use filter_var.
Does anybody have a regex that they use for this?
My code (just so you can see what I'm trying to achieve):
function validate_url($url){
if (function_exists('filter_var')){
return filter_var($url, FILTER_VALIDATE_URL);
}
return preg_match(REGEX_HERE, $url);
}
I have created a solution for validating the domain. While it does not specifically cover the entire URL, it is very detailed and specific. The question you need to ask yourself is, "Why am I validating a domain?" If it is to see if the domain actually could exist, then you need to confirm the domain (including valid TLDs). The problem is, too many developers take the shortcut of ([a-z]{2,4}) and call it good. If you think along these lines, then why call it URL validation? It's not. It's just passing the URL through a regex.
I have an open source class that will allow you to validate the domain not only using the single source for TLD management (iana.org), but it will also validate the domain via DNS records to make sure it actually exists. The DNS validating is optional, but the domain will be specifically valid based on TLD.
For example: example.ay is NOT a valid domain as the .ay TLD is invalid. But using the regex posted here ([a-z]{2,4}), it would pass. I have an affinity for quality. I try to express that in the code I write. Others may not really care. So if you want to simply "check" the URL, you can use the examples listed in these responses. If you actually want to validate the domain in the URL, you can have at the class I created to do just that. It can be downloaded at:
http://code.google.com/p/blogchuck/source/browse/trunk/domains.php
It validates based on the RFCs that "govern" (using the term loosely) what determines a valid domain. In a nutshell, here is what the domains class will do:
Basic rules of the domain validation
must be at least one character long
must start with a letter or number
contains letters, numbers, and hyphens
must end in a letter or number
may contain multiple nodes (i.e. node1.node2.node3)
each node can only be 63 characters long max
total domain name can only be 255 characters long max
must end in a valid TLD
can be an IP4 address
It will also download a copy of the master TLD file iana.org only after checking your local copy. If your local copy is outdated by 30 days, it will download a new copy. The TLDs in the file will be used in the REGEX to validate the TLD in the domain you are validating. This prevents the .ay (and other invalid TLDs) from passing validation.
This is a lengthy bit of code, but very compact considering what it does. And it is the most accurate. That's why I asked the question earlier. Do you want to do "validation" or simple "checking"?
You could try this one. I haven't tried it myself but it's surely the biggest regexp I've ever seen, haha.
^(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/|~\/|\/)?(?#Username:Password)(?:\w+:\w+#)?(?#Subdomains)(?:(?:[-\w]+\.)+(?#TopLevel Domains)(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))(?#Port)(?::[\d]{1,5})?(?#Directories)(?:(?:(?:\/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|\/)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?$
!(https?://)?([-_a-z0-9]+\.)*([-_a-z0-9]+)\.([a-z]{2,4})(/?)(.*)!i
I use this regular expression for validating URLs. So far it didn't fail me a single time :)
I've seen a regex that could actually validate any kind of valid URL but it was two pages long...
You're probably better off parsing the url with parse_url and then checking if all of your required bits are in order.
Addition:
This is a snip of my URL class:
public static function IsUrl($test)
{
if (strpos($test, ' ') > -1)
{
return false;
}
if (strpos($test, '.') > 1)
{
$check = #parse_url($test);
return is_array($check)
&& isset($check['scheme'])
&& isset($check['host']) && count(explode('.', $check['host'])) > 1
}
return false;
}
It tests the given string and requires some basics in the url, namely that the scheme is set and the hostname has a dot in it.

Categories