URL Validation Regex - php

As far as i know there are many other questions similar to title, but my main reason for asking this question is i want my validation as perfect as i want. Here is my explanation which URL should valid
http:// (if given then match otherwise skip),
domain.com (should match & return validate)
subdomain.domain.com (should match & return validate)
www.com (should return false)
http://www.com (should return false)
I searched a lot about perfect regex pattern according to my need but didn't succeed so thats why i made my self and posting here to want to know that anyother Valid URL would it skip or not except http://localhost.
If yes then please correct me.
Pattern:
((?:http|https|ftp)://)?(?:www.)?((?!www)[A-Z0-9][A-Z0-9_-]*(?:.[A-Z0-9][A-Z0-9_-]*)+):?(\d+)?/?

I know this actually doesn't answer your question directly, but REGEXes aside, you can also use filter_var(), with the flag FILTER_VALIDATE_URL, which returns the URL in case of valid url, or FALSE otherwise:
var_dump(filter_var('http://example.com', FILTER_VALIDATE_URL));
// string(18) http://example.com
You can read here the filters used by this function, especially the last row regarding flags used by the VALIDATE_URL filter.
I actually don't know how it's implemented internally, but I suppose it works better than many regexes you can find outside in the wild internet.

Related

How to use PHP to recognising (and not validating) URL

I'm aware of filter_var() and it's FILTER_VALIDATE_URL filter. The point is there are some URLs which are exist but not count as a valid URL and I need to verify them. For example, URLs with spaces.
At the moment I am checking only those protocols that application is interested in (http, https and ftp) using strpos().
But I was wondering if there is a more generic method in PHP that I could employ?
It might help if i explain that i need to differentiate if the target source is a URL or a local path.
Use function parse_url() to split the URL into components then do some basic analysis on the pieces it returns (or just check if the returned value is an array() or FALSE).
As the documentation says:
This function is not meant to validate the given URL, it only breaks it up into the above listed parts. Partial URLs are also accepted, parse_url() tries its best to parse them correctly.
And also:
On seriously malformed URLs, parse_url() may return FALSE.
It looks like it matches your request pretty well.

PHP Validate URL

I know this question has been asked many times before but none of the options seem to work for me (including the regexes and php's filter_var). I'm looking for a way to make sure a URL is a valid possibility before continuing, below are examples of valid URLs
google.com
www.google.com
http://www.google.com
https://www.google.com
http://google.com
https://google.com
domain-site.com
google.net
google.ca
google.email
IPs (ex. 198.XX.XXX.XXX)
Here are invalid URLs
test
http://test
https://test
Below are links/things I've tried to get the result I want above:
PHP regex for url validation, filter_var is too permisive
filter_var($url_new, FILTER_VALIDATE_URL)
preg_match('#^((https?)://(?:([a-z0-9-.]+:[a-z0-9-.]+)#)?([a-z0-9-.]+)(?::([0-9]+))?)(?:/|$)((?:[^?/]*/)*)([^?]*)(?:\?([^\#]*))?(?:\#.*)?$#i', $toLoad, $tmp)
etc
I know I can cURL to check if a domain exists and this may be the easiest option but its also very slow, I would prefer a regex type solution where I can get a preliminary check of the domain's validity.
Any help is appreciated!

PHP Find a URL in any form within a string

I'm looking for some form of input security on a project I working on.
Basically I wish to flag text if the user has inputted any form of a URL.
IE 'For more of my pic visit myhotpic.net'
Hence it would detect a url and then I can flag the string for validation via staff.
So I would need to check for any form of a URL.
There is a similar question here
Finding urls from text string via php and regex?
with an answer. But I have tired this with various strings and I do not get the expected results.
For example
$pattern = '#(www\.|https?:\/\/){?}[a-zA-Z0-9]{2,254}\.[a-zA-Z0-9]{2,4}(\S*)#i';
$count = preg_match_all($pattern, 'http://www.Imaurl.com', $matches, PREG_PATTERN_ORDER);
returns matches as
array(3) {
[0]=>
array(0) {
}
[1]=>
array(0) {
}
[2]=>
array(0) {
}
}
and no error is return via preg_last_error()
Why is this not working? Is there an error in the Regex? I would assume it to be fine as other users have had success with it.
I cannot seem to find a suitable answer for my problem anywhere else.
In the regex, change {?} to just ?. Then it will work. No idea what {?} is supposed to mean (I've never seen anything like that).
Your regex will work fine for some URLs, but you should be aware that URLs can be much more complicated than you might assume, and a regex that can match every URL is VERY complex. You might want to look up a better regex—you only need one complicated enough to handle the sorts of URLs you're expecting to match.
Just to add a little work on this specific question;
I took the original Regex as given by the OP and carried out some tweaks to it:
This is NOT perfect but does improve upon the original.
Added a netagive lookahead to avoid domains beginning with # (such as email addresses)
removed the incorrect {?}
Made the http or www a requirement rather than optional.
added _ and - characters to accepted URL character set ( I know this concept overall can be greatly expanded upon ).
so;
#(?<!#)(www\.|https?:\/\/)[a-z0-9-_]{2,254}\.[a-z0-9]{2,4}(\S*)#gi
Example:
check out my facebook www.prop-ERty-bg.ru/11be check out my facebook
www.property-bg.ru/11be horsae#microsoft.com
catches both www.property-bg.ru/11b but avoids the email address. See it in action.

PHP / RegEx - Convert URLs to links by detecting .com/.net/.org/.edu etc

I know there have been many questions asking for help converting URLs to clickable links in strings, but I haven't found quite what I'm looking for.
I want to be able to match any of the following examples and turn them into clickable links:
http://www.domain.com
https://www.domain.net
http://subdomain.domain.org
www.domain.com/folder
subdomain.domain.net
subdomain.domain.edu/folder/subfolder
domain.net
domain.com/folder
I do not want to match random.stuff.separated.with.periods.
EDIT: Please keep in mind that these URLs need to be found within larger strings of 'normal' text. For example, I want to match 'domain.net' in "Hello! Come check out domain.net!".
I think this could be accomplished with a regex that can determine whether the matching url contains .com, .net, .org, or .edu followed by either a forward slash or whitespace. Other than a user typo, I can't imagine any other case in which a valid URL would have one of those followed by anything else.
I realize there are many valid domain extensions out there, but I don't need to support them all. I can just choose which to support with something like (com|net|org|edu) in the regex. Unfortunately, I'm not skilled enough with regex yet to know how to properly implement this.
I'm hoping someone can help me find a regular expression (for use with PHP's preg_replace) that can match URLs based on just about any text connected by one or more dots and either ending with one of the specified extensions followed by whitespace OR containing one of the specified extensions followed by a slash and possibly folders.
I did several searches and so far have not found what I'm looking for. If there already exists a SO post that answers this, I apologize.
Thanks in advance.
--- EDIT 3 ---
After days of trial and error and some help from SO, here's what works:
preg_replace_callback('#(\s|^)((https?://)?(\w|-)+(\.(\w+|-)*)+(?<=\.net|org|edu|com|cc|br|jp|dk|gs|de)(\:[0-9]+)?(?:/[^\s]*)?)(?=\s|\b)#is',
create_function('$m', 'if (!preg_match("#^(https?://)#", $m[2]))
return $m[1]."".$m[2].""; else return $m[1]."".$m[2]."";'),
$event_desc);
This is a modified version of anubhava's code below and so far seems to do exactly what I want. Thanks!
You can use this regex:
#(\s|^)((?:https?://)?\w+(?:\.\w+)+(?<=\.(net|org|edu|com))(?:/[^\s]*|))(?=\s|\b)#is
Code:
$arr = array(
'http://www.domain.com/?foo=bar',
'http://www.that"sallfolks.com',
'This is really cool site: https://www.domain.net/ isn\'t it?',
'http://subdomain.domain.org',
'www.domain.com/folder',
'Hello! You can visit vertigofx.com/mysite/rocks for some awesome pictures, or just go to vertigofx.com by itself',
'subdomain.domain.net',
'subdomain.domain.edu/folder/subfolder',
'Hello! Check out my site at domain.net!',
'welcome.to.computers',
'Hello.Come visit oursite.com!',
'foo.bar',
'domain.com/folder',
);
foreach($arr as $url) {
$link = preg_replace_callback('#(\s|^)((?:https?://)?\w+(?:\.\w+)+(?<=\.(net|org|edu|com))(?:/[^\s]*|))(?=\s|\b)#is',
create_function('$m', 'if (!preg_match("#^(https?://)#", $m[2]))
return $m[1]."".$m[2].""; else return $m[1]."".$m[2]."";'),
$url);
echo $link . "\n";
OUTPUT:
http://www.domain.com/?foo=bar
http://www.that"sallfolks.com
This is really cool site: https://www.domain.net/ isn't it?
http://subdomain.domain.org
www.domain.com/folder
Hello! You can visit vertigofx.com/mysite/rocks for some awesome pictures, or just go to vertigofx.com by itself
subdomain.domain.net
subdomain.domain.edu/folder/subfolder
Hello! Check out my site at domain.net!
welcome.to.computers
Hello.Come visit oursite.com!
foo.bar
domain.com/folder
PS: This regex only supports http and https scheme in URL. So eg: if you want to support ftp also then you need to modify the regex a little.
'/(http(s)?:\/\/)?[\w\/\.]+(\.((com)|(edu)|(net)|(org)))[\w\/]*/'
That works for your examples. You might want to add extra characters support for "-", "&", "?", ":", etc in the last bracket.
'/(http(s)?:\/\/)?[\w\/\.]+(\.((com)|(edu)|(net)|(org)))[\w\/\?=&-;]*/'
This will support parameters and port numbers.
eg.: www.foo.ca:8888/test?param1=val1&param2=val2
Thanks a ton. I modified his final solution to allow all domains (.ca, .co.uk), not just the specified ones.
$html = preg_replace_callback('#(\s|^)((https?://)?(\w|-)+(\.[a-z]{2,3})+(\:[0-9]+)?(?:/[^\s]*)?)(?=\s|\b)#is',
create_function('$m', 'if (!preg_match("#^(https?://)#", $m[2])) return $m[1]."".$m[2].""; else return $m[1]."".$m[2]."";'),
$url);

PHP regex for validating a URL

I'm looking for a decent regex to match a URL (a full URL with scheme, domain, path etc.)
I would normally use filter_var but I can't in this case as I have to support PHP<5.2!
I've searched the web but can't find anything that I'm confident will be fool-proof, and all I can find on SO is people saying to use filter_var.
Does anybody have a regex that they use for this?
My code (just so you can see what I'm trying to achieve):
function validate_url($url){
if (function_exists('filter_var')){
return filter_var($url, FILTER_VALIDATE_URL);
}
return preg_match(REGEX_HERE, $url);
}
I have created a solution for validating the domain. While it does not specifically cover the entire URL, it is very detailed and specific. The question you need to ask yourself is, "Why am I validating a domain?" If it is to see if the domain actually could exist, then you need to confirm the domain (including valid TLDs). The problem is, too many developers take the shortcut of ([a-z]{2,4}) and call it good. If you think along these lines, then why call it URL validation? It's not. It's just passing the URL through a regex.
I have an open source class that will allow you to validate the domain not only using the single source for TLD management (iana.org), but it will also validate the domain via DNS records to make sure it actually exists. The DNS validating is optional, but the domain will be specifically valid based on TLD.
For example: example.ay is NOT a valid domain as the .ay TLD is invalid. But using the regex posted here ([a-z]{2,4}), it would pass. I have an affinity for quality. I try to express that in the code I write. Others may not really care. So if you want to simply "check" the URL, you can use the examples listed in these responses. If you actually want to validate the domain in the URL, you can have at the class I created to do just that. It can be downloaded at:
http://code.google.com/p/blogchuck/source/browse/trunk/domains.php
It validates based on the RFCs that "govern" (using the term loosely) what determines a valid domain. In a nutshell, here is what the domains class will do:
Basic rules of the domain validation
must be at least one character long
must start with a letter or number
contains letters, numbers, and hyphens
must end in a letter or number
may contain multiple nodes (i.e. node1.node2.node3)
each node can only be 63 characters long max
total domain name can only be 255 characters long max
must end in a valid TLD
can be an IP4 address
It will also download a copy of the master TLD file iana.org only after checking your local copy. If your local copy is outdated by 30 days, it will download a new copy. The TLDs in the file will be used in the REGEX to validate the TLD in the domain you are validating. This prevents the .ay (and other invalid TLDs) from passing validation.
This is a lengthy bit of code, but very compact considering what it does. And it is the most accurate. That's why I asked the question earlier. Do you want to do "validation" or simple "checking"?
You could try this one. I haven't tried it myself but it's surely the biggest regexp I've ever seen, haha.
^(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/|~\/|\/)?(?#Username:Password)(?:\w+:\w+#)?(?#Subdomains)(?:(?:[-\w]+\.)+(?#TopLevel Domains)(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))(?#Port)(?::[\d]{1,5})?(?#Directories)(?:(?:(?:\/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|\/)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?$
!(https?://)?([-_a-z0-9]+\.)*([-_a-z0-9]+)\.([a-z]{2,4})(/?)(.*)!i
I use this regular expression for validating URLs. So far it didn't fail me a single time :)
I've seen a regex that could actually validate any kind of valid URL but it was two pages long...
You're probably better off parsing the url with parse_url and then checking if all of your required bits are in order.
Addition:
This is a snip of my URL class:
public static function IsUrl($test)
{
if (strpos($test, ' ') > -1)
{
return false;
}
if (strpos($test, '.') > 1)
{
$check = #parse_url($test);
return is_array($check)
&& isset($check['scheme'])
&& isset($check['host']) && count(explode('.', $check['host'])) > 1
}
return false;
}
It tests the given string and requires some basics in the url, namely that the scheme is set and the hostname has a dot in it.

Categories