PHP url validation false positives - php

For some odd reason my if statement to check the urls using FILTER_VALIDATE_URL is returning unexpected results.
Simple stuff like https://www.google.nl/ is being blocked but www.google.nl/ isn't? Its not like it blocks every single URL with http or https infront of it either. Some are allowed and others are not, I know there are a bunch of topics for this but most of them are using regex to filter urls. Is this beter than using FILTER_VALIDATE_URL? Or Am I doing something wrong?
The code I use to check the URLS is this
if (!filter_var($linkinput, FILTER_VALIDATE_URL) === FALSE) {
//error code
}

You should filter it like this first. (Just for good measure).
$url = filter_var($url, FILTER_SANITIZE_URL);
The FILTER_VALIDATE_URL only accepts ASCII URL's (ie, need to be encoded). If the above function does not work see PHP urlencode() to encode the URL.
If THAT doesn't work, then you should manually strip the http: from the beginning like this ...
$url = strpos($url, 'http://') === 0 ? substr($url, 7) : $url;
Here are some flags that might help. If all of your URL's will have http:// you can use FILTER_FLAG_SCHEME_REQUIRED
The FILTER_VALIDATE_URL filter validates a URL.
Possible flags:
FILTER_FLAG_SCHEME_REQUIRED - URL must be RFC compliant (like http://example)
FILTER_FLAG_HOST_REQUIRED - URL must include host name (like http://www.example.com)
FILTER_FLAG_PATH_REQUIRED - URL must have a path after the domain name (like www.example.com/example1/)
FILTER_FLAG_QUERY_REQUIRED - URL must have a query string (like "example.php?name=Peter&age=37")
The default behavior of FILTER_VALIDATE_URL
Validates value as URL (according to »
http://www.faqs.org/rfcs/rfc2396), optionally with required
components.
Beware a valid URL may not specify the HTTP protocol
http:// so further validation may be required to determine the URL
uses an expected protocol, e.g. ssh:// or mailto:.
Note that the
function will only find ASCII URLs to be valid; internationalized
domain names (containing non-ASCII characters) will fail.

Related

PHP filter_var URL

For validating a URL path from user input, i'm using the PHP filter_var function.
The input only contains the path (/path/path/script.php).
When validating the path, I add the host. I'm playing around a little bit, testing the input validation etc. Doing so, i notice a strange(??) behavior of the filter URL function.
Code:
$url = "http://www.domain.nl/http://www.google.nl/modules/authorize/test/normal.php";
var_dump(filter_var($url, FILTER_VALIDATE_URL, FILTER_FLAG_HOST_REQUIRED)); //valid
Can someone explane why this is a valid URL? Thanks!
The short answer is, PHP FILTER_VALIDATE_URL checks the URL only against RFC 2396 and your URL, although weird, is valid according to said standard.
Long answer:
The filter you are using is declared to be compliant with RFC, so let's check that standard (RFC 2396).
The regular expression used for parsing a URL and listed there is:
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
12 3 4 5 6 7 8 9
Where:
scheme = $2
authority = $4
path = $5
query = $7
fragment = $9
As we can see, the ":" character is reserved only in the context of scheme and from that point onwards ":" is fair game (this is supported by the text of the standard). For example, it is used freely in the http: scheme to denote a port. A slash can also appear in any place and nothing prohibits the URL from having a "//" somewhere in the middle. So "http://" in the middle should be valid.
Let's look at your URL and try to match it to this regexp:
$url = "http://www.domain.nl/http://www.google.nl/modules/authorize/test/normal.php";
//Escaped a couple slashes to make things work, still the same regexp
$result_rfc = preg_match('/^(([^:\/?#]+):)?(\/\/([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?/',$url);
echo '<p>'.$result_rfc.'</p>';
The test returns '1' so this url is valid. This is to be expected, as the rules don't declare urls that have something like 'http://' in the middle to be invalid as we have seen. PHP simply mirrors this behaviour with FILTER_VALIDATE_URL.
If you want a more rigurous test, you will need to write the required code yourself. For example, you can prevent "://" from appearing more than once:
$url = "http://www.domain.nl/http://www.google.nl/modules/authorize/test/normal.php";
$result_rfc = preg_match('/^(([^:\/?#]+):)?(\/\/([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?/',$url);
if (substr_count($url,'://') != 1) {
$result_non_rfc = false;
} else {
$result_non_rfc = $result_rfc;
}
You can also try and adjust the regular expression itself.

php FILTER_VALIDATE_URL giving unexpected result

my Worketc account URL : akhilesh.worketc.com.
But PHP function FILTER_VALIDATE_URL gives this URL as a invalid url.
So is there any alternate way to solve this problem ?
Add Protocol to it. i.e. append http/https/ftp etc. to your url before testing.
var_dump(filter_var('http://akhilesh.worketc.com', FILTER_VALIDATE_URL));
FILTER_VALIDATE_URL
Validates value as URL (according to » http://www.faqs.org/rfcs/rfc2396), optionally with required components. Beware a valid URL may not specify the HTTP protocol http:// so further validation may be required to determine the URL uses an expected protocol, e.g. ssh:// or mailto:. Note that the function will only find ASCII URLs to be valid; internationalized domain names (containing non-ASCII characters) will fail.
Source

Validate URL with or without protocol

Hi I would like to validate this following urls, so they all would pass with or without http/www part in them as long as there is TLD present like .com, .net, .org etc..
Valid URLs Should Be:
http://www.domain.com
http://domain.com
https://www.domain.com
https://domain.com
www.domain.com
domain.com
To support long tlds:
http://www.domain.com.uk
http://domain.com.uk
https://www.domain.com.uk
https://domain.com.uk
www.domain.com.uk
domain.com.uk
To support dashes (-):
http://www.domain-here.com
http://domain-here.com
https://www.domain-here.com
https://domain-here.com
www.domain-here.com
domain-here.com
Also to support numbers in domains:
http://www.domain1-test-here.com
http://domain1-test-here.com
https://www.domain1-test-here.com
https://domain1-test-here.com
www.domain1-test-here.com
domain-here.com
Also maybe allow even IPs:
127.127.127.127
(but this is extra!)
Also allow dashes (-), forgot to mantion that =)
I've found many functions that validate one or another but not both at same time.
If any one knows good regex for it, please share. Thank you for your help.
For url validation perfect solution.
Above Answer is right but not work on all domains like .me, .it, .in
so please user below for url match:
$pattern = '/(?:https?:\/\/)?(?:[a-zA-Z0-9.-]+?\.(?:[a-zA-Z])|\d+\.\d+\.\d+\.\d+)/';
if(preg_match($pattern, "http://website.in"))
{
echo "valid";
}else{
echo "invalid";
}
When you ignore the path part and look for the domain part only, a simple rule would be
(?:https?://)?(?:[a-zA-Z0-9.-]+?\.(?:com|net|org|gov|edu|mil)|\d+\.\d+\.\d+\.\d+)
If you want to support country TLDs as well you must either supply a complete (current) list or append |.. to the TLD part.
With preg_match you must wrap it between some delimiters
$pattern = ';(?:https?://)?(?:[a-zA-Z0-9.-]+?\.(?:com|net|org|gov|edu|mil)|\d+\.\d+\.\d+\.\d+);';
$index = preg_match($pattern, $url);
Usually, you use /. But in this case, slashes are part of the pattern, so I have chosen some other delimiter. Otherwise I must escape the slashes with \
$pattern = '/(?:https?:\/\/)?(?:[a-zA-Z0-9.-]+?\.(?:com|net|org|gov|edu|mil)|\d+\.\d+\.\d+\.\d+)/';
Don't use a regular expression. Not every problem that involves strings needs to use regexes.
Don't write your own URL validator. URL validation is a solved problem, and there is existing code that has already been written, debugged and testing. In fact, it comes standard with PHP.
Look at PHP's built-in filtering functionality: http://us2.php.net/manual/en/book.filter.php
I think you can use flags for filter_vars.
For FILTER_VALIDATE_URL there is several flags available:
FILTER_FLAG_SCHEME_REQUIRED Requires the URL to contain a scheme
part.
FILTER_FLAG_HOST_REQUIRED Requires the URL to contain a host
part.
FILTER_FLAG_PATH_REQUIRED Requires the URL to contain a path
part.
FILTER_FLAG_QUERY_REQUIRED Requires the URL to contain a query
string.
FILTER_FLAG_SCHEME_REQUIRED and FILTER_FLAG_HOST_REQUIRED used by default.
Lets say you want to check for path part and do not want to check for scheme part, you can do something like this (falg is a bitmask):
filter_var($url, FILTER_VALIDATE_URL, ~FILTER_FLAG_SCHEME_REQUIRED | FILTER_FLAG_PATH_REQUIRED)

Anything start with http:// is validated by FILTER_VALIDATE_URL?

I tested with strings and int that I can imagine, as long as it start with http://, it will be a valid url using FILTER_VALIDATE_URL. So, why we need FILTER_VALIDATE_URL? Why not just add http:// on an input whenever we want to make it valid?
var_dump(filter_var ('http://example',FILTER_VALIDATE_URL ));
Well technically, any URI that starts with a scheme (like http://) and contains valid URI characters after that is valid as per the official URI specification in RFC 3986:
Each URI begins with a scheme name, as defined in Section 3.1, that refers to a specification for assigning identifiers within that scheme. As such, the URI syntax is a federated and extensible naming system wherein each scheme's specification may further restrict the syntax and semantics of identifiers using that scheme.
So there's nothing strange about the return you're getting -- that's what's supposed to happen. As to why you should use the filter_var with the FILTER_VALIDATE_URL flag ... it's way more semantically appropriate than doing something like the following for every possible URL scheme, wouldn't you agree?
if (strpos($url, 'http://') === 0
|| strpos($url, 'ftp://') === 0
|| strpos($url, 'telnet://') === 0
) {
// it's a valid URL!
}

Is FILTER_VALIDATE_URL being too strict?

In PHP, filter_var('www.example.com', FILTER_VALIDATE_URL) returns false. Is this correct? Isn't www.example.com a valid URL, or protocols (http://, ftp://, etc.) need to be explicitly stated in the URL to be formally correct?
It's not a valid URL. Prefixing things with http:// was never a very user-friendly thing, so modern browsers assume you mean http if you just enter a domain name. Software libraries are, rightly, a little bit more picky!
One approach you could take is passing the string through parse_url, and then adding any elements which are missing, e.g.
if ( $parts = parse_url($url) ) {
if ( !isset($parts["scheme"]) )
{
$url = "http://$url";
}
}
Interestingly, when you use FILTER_VALIDATE_URL, it actually uses parse_url internally to figure out what the scheme is (view source). Thanks to salathe for spotting this in the comments below.
The URL have to correspond with the rules set forward in RFC 2396, and according to that spec the protocol is necessary.
In addition to Paul Dixon's answer I want to say that you can use flags for FILTER_VALIDATE_URL to specify which part of the URL must be presented.
FILTER_FLAG_SCHEME_REQUIRED
FILTER_FLAG_HOST_REQUIRED
FILTER_FLAG_PATH_REQUIRED
FILTER_FLAG_QUERY_REQUIRED
Since PHP 5.2.1 FILTER_FLAG_SCHEME_REQUIRED and FILTER_FLAG_HOST_REQUIRED flags used by default and, unfortunately, there is no way to disable them (you can't do something like filter_var($url, FILTER_VALIDATE_URL, ~FILTER_FLAG_SCHEME_REQUIRED); if the existence of the URL scheme part does not necessarily). It seems like a bug for me. There is a relative bugreport.
The scheme ("protocol") part is required for FILTER_VALIDATE_URL.

Categories