Weird behavior with parse_url and different protocols - php

I am trying to use parse_url to decode a DSN and found a weird behavior.
Here are the sample DSNs:
parse_url('redis://localhost'); //Correctly parses
parse_url('file:///var/sessions'); //Correctly parses
parse_url('redis:///var/run/redis.sock'); //Parse error
parse_url('file:///var/run/redis.sock'); //Correctly parses
It seems that it fail to parse urls without a host, but makes an exception for file scheme.
Am I missing something?
Is there a way to disable this behavior?

The manual of parse_url() mention that you cannot use this function for URIs. Specifically, when you have triple slashes for the scheme, it is defined as "invalid" and this function returns false:
Note:
This function is intended specifically for the purpose of parsing URLs and not URIs. However, to comply with PHP's backwards compatibility requirements it makes an exception for the file:// scheme where triple slashes (file:///...) are allowed. For any other scheme this is invalid.

Related

How to use PHP to recognising (and not validating) URL

I'm aware of filter_var() and it's FILTER_VALIDATE_URL filter. The point is there are some URLs which are exist but not count as a valid URL and I need to verify them. For example, URLs with spaces.
At the moment I am checking only those protocols that application is interested in (http, https and ftp) using strpos().
But I was wondering if there is a more generic method in PHP that I could employ?
It might help if i explain that i need to differentiate if the target source is a URL or a local path.
Use function parse_url() to split the URL into components then do some basic analysis on the pieces it returns (or just check if the returned value is an array() or FALSE).
As the documentation says:
This function is not meant to validate the given URL, it only breaks it up into the above listed parts. Partial URLs are also accepted, parse_url() tries its best to parse them correctly.
And also:
On seriously malformed URLs, parse_url() may return FALSE.
It looks like it matches your request pretty well.

PHP url validation false positives

For some odd reason my if statement to check the urls using FILTER_VALIDATE_URL is returning unexpected results.
Simple stuff like https://www.google.nl/ is being blocked but www.google.nl/ isn't? Its not like it blocks every single URL with http or https infront of it either. Some are allowed and others are not, I know there are a bunch of topics for this but most of them are using regex to filter urls. Is this beter than using FILTER_VALIDATE_URL? Or Am I doing something wrong?
The code I use to check the URLS is this
if (!filter_var($linkinput, FILTER_VALIDATE_URL) === FALSE) {
//error code
}
You should filter it like this first. (Just for good measure).
$url = filter_var($url, FILTER_SANITIZE_URL);
The FILTER_VALIDATE_URL only accepts ASCII URL's (ie, need to be encoded). If the above function does not work see PHP urlencode() to encode the URL.
If THAT doesn't work, then you should manually strip the http: from the beginning like this ...
$url = strpos($url, 'http://') === 0 ? substr($url, 7) : $url;
Here are some flags that might help. If all of your URL's will have http:// you can use FILTER_FLAG_SCHEME_REQUIRED
The FILTER_VALIDATE_URL filter validates a URL.
Possible flags:
FILTER_FLAG_SCHEME_REQUIRED - URL must be RFC compliant (like http://example)
FILTER_FLAG_HOST_REQUIRED - URL must include host name (like http://www.example.com)
FILTER_FLAG_PATH_REQUIRED - URL must have a path after the domain name (like www.example.com/example1/)
FILTER_FLAG_QUERY_REQUIRED - URL must have a query string (like "example.php?name=Peter&age=37")
The default behavior of FILTER_VALIDATE_URL
Validates value as URL (according to »
http://www.faqs.org/rfcs/rfc2396), optionally with required
components.
Beware a valid URL may not specify the HTTP protocol
http:// so further validation may be required to determine the URL
uses an expected protocol, e.g. ssh:// or mailto:.
Note that the
function will only find ASCII URLs to be valid; internationalized
domain names (containing non-ASCII characters) will fail.

php FILTER_VALIDATE_URL giving unexpected result

my Worketc account URL : akhilesh.worketc.com.
But PHP function FILTER_VALIDATE_URL gives this URL as a invalid url.
So is there any alternate way to solve this problem ?
Add Protocol to it. i.e. append http/https/ftp etc. to your url before testing.
var_dump(filter_var('http://akhilesh.worketc.com', FILTER_VALIDATE_URL));
FILTER_VALIDATE_URL
Validates value as URL (according to » http://www.faqs.org/rfcs/rfc2396), optionally with required components. Beware a valid URL may not specify the HTTP protocol http:// so further validation may be required to determine the URL uses an expected protocol, e.g. ssh:// or mailto:. Note that the function will only find ASCII URLs to be valid; internationalized domain names (containing non-ASCII characters) will fail.
Source

Is FILTER_VALIDATE_URL being too strict?

In PHP, filter_var('www.example.com', FILTER_VALIDATE_URL) returns false. Is this correct? Isn't www.example.com a valid URL, or protocols (http://, ftp://, etc.) need to be explicitly stated in the URL to be formally correct?
It's not a valid URL. Prefixing things with http:// was never a very user-friendly thing, so modern browsers assume you mean http if you just enter a domain name. Software libraries are, rightly, a little bit more picky!
One approach you could take is passing the string through parse_url, and then adding any elements which are missing, e.g.
if ( $parts = parse_url($url) ) {
if ( !isset($parts["scheme"]) )
{
$url = "http://$url";
}
}
Interestingly, when you use FILTER_VALIDATE_URL, it actually uses parse_url internally to figure out what the scheme is (view source). Thanks to salathe for spotting this in the comments below.
The URL have to correspond with the rules set forward in RFC 2396, and according to that spec the protocol is necessary.
In addition to Paul Dixon's answer I want to say that you can use flags for FILTER_VALIDATE_URL to specify which part of the URL must be presented.
FILTER_FLAG_SCHEME_REQUIRED
FILTER_FLAG_HOST_REQUIRED
FILTER_FLAG_PATH_REQUIRED
FILTER_FLAG_QUERY_REQUIRED
Since PHP 5.2.1 FILTER_FLAG_SCHEME_REQUIRED and FILTER_FLAG_HOST_REQUIRED flags used by default and, unfortunately, there is no way to disable them (you can't do something like filter_var($url, FILTER_VALIDATE_URL, ~FILTER_FLAG_SCHEME_REQUIRED); if the existence of the URL scheme part does not necessarily). It seems like a bug for me. There is a relative bugreport.
The scheme ("protocol") part is required for FILTER_VALIDATE_URL.

PHP Url Validation Error: http://https://example.com (aka https://https://example.com)

I had this url regex pattern in place:
$pattern = "#\b(https?://[^\s()<>\[\]\{\}]{1,".$max_length_allowed_for_each_url."}(?:\([\w\d]+\)|([^[:punct:]\s]|/)))#";
It seemed to work pretty well at validating any URL I threw at it, until I realized that https://http://google.com (apparently even stackoverflow is considering that a valid URL (it made that URL clickable, not me, although it did remove one of the colons) so perhaps I am out of luck?) was a valid URL, when it certainly is not.
I did a little research... and learnt that I should be using filter_var instead of a regex for PHP URL validation anyways... and was disappointed to realize that it too is susceptible to this very same validation problem.
I could easily conquer it with:
str_replace(array("https://http://","http://https://"), array("http://","https://"), $url);
But... that just seems so wrong.
Well, it is a valid URI. Technically. Look at the RFC for URIs if you don't believe me.
The path component of a URI can contain //.
http is a valid host name.
The port is allowed to be missing even if the : is present (it's specified as *digit, not 1*digit). (This is why Stack Overflow removed the colon -- it thought you were using the default port, so it removed it from the URI.)
I suggest writing a special case for this. In a separate step, check to see if the URI starts with https?://https?://, and fix it.

Categories