I'm aware of filter_var() and it's FILTER_VALIDATE_URL filter. The point is there are some URLs which are exist but not count as a valid URL and I need to verify them. For example, URLs with spaces.
At the moment I am checking only those protocols that application is interested in (http, https and ftp) using strpos().
But I was wondering if there is a more generic method in PHP that I could employ?
It might help if i explain that i need to differentiate if the target source is a URL or a local path.
Use function parse_url() to split the URL into components then do some basic analysis on the pieces it returns (or just check if the returned value is an array() or FALSE).
As the documentation says:
This function is not meant to validate the given URL, it only breaks it up into the above listed parts. Partial URLs are also accepted, parse_url() tries its best to parse them correctly.
And also:
On seriously malformed URLs, parse_url() may return FALSE.
It looks like it matches your request pretty well.
Related
I am trying to use parse_url to decode a DSN and found a weird behavior.
Here are the sample DSNs:
parse_url('redis://localhost'); //Correctly parses
parse_url('file:///var/sessions'); //Correctly parses
parse_url('redis:///var/run/redis.sock'); //Parse error
parse_url('file:///var/run/redis.sock'); //Correctly parses
It seems that it fail to parse urls without a host, but makes an exception for file scheme.
Am I missing something?
Is there a way to disable this behavior?
The manual of parse_url() mention that you cannot use this function for URIs. Specifically, when you have triple slashes for the scheme, it is defined as "invalid" and this function returns false:
Note:
This function is intended specifically for the purpose of parsing URLs and not URIs. However, to comply with PHP's backwards compatibility requirements it makes an exception for the file:// scheme where triple slashes (file:///...) are allowed. For any other scheme this is invalid.
In PHP, filter_var('www.example.com', FILTER_VALIDATE_URL) returns false. Is this correct? Isn't www.example.com a valid URL, or protocols (http://, ftp://, etc.) need to be explicitly stated in the URL to be formally correct?
It's not a valid URL. Prefixing things with http:// was never a very user-friendly thing, so modern browsers assume you mean http if you just enter a domain name. Software libraries are, rightly, a little bit more picky!
One approach you could take is passing the string through parse_url, and then adding any elements which are missing, e.g.
if ( $parts = parse_url($url) ) {
if ( !isset($parts["scheme"]) )
{
$url = "http://$url";
}
}
Interestingly, when you use FILTER_VALIDATE_URL, it actually uses parse_url internally to figure out what the scheme is (view source). Thanks to salathe for spotting this in the comments below.
The URL have to correspond with the rules set forward in RFC 2396, and according to that spec the protocol is necessary.
In addition to Paul Dixon's answer I want to say that you can use flags for FILTER_VALIDATE_URL to specify which part of the URL must be presented.
FILTER_FLAG_SCHEME_REQUIRED
FILTER_FLAG_HOST_REQUIRED
FILTER_FLAG_PATH_REQUIRED
FILTER_FLAG_QUERY_REQUIRED
Since PHP 5.2.1 FILTER_FLAG_SCHEME_REQUIRED and FILTER_FLAG_HOST_REQUIRED flags used by default and, unfortunately, there is no way to disable them (you can't do something like filter_var($url, FILTER_VALIDATE_URL, ~FILTER_FLAG_SCHEME_REQUIRED); if the existence of the URL scheme part does not necessarily). It seems like a bug for me. There is a relative bugreport.
The scheme ("protocol") part is required for FILTER_VALIDATE_URL.
I had this url regex pattern in place:
$pattern = "#\b(https?://[^\s()<>\[\]\{\}]{1,".$max_length_allowed_for_each_url."}(?:\([\w\d]+\)|([^[:punct:]\s]|/)))#";
It seemed to work pretty well at validating any URL I threw at it, until I realized that https://http://google.com (apparently even stackoverflow is considering that a valid URL (it made that URL clickable, not me, although it did remove one of the colons) so perhaps I am out of luck?) was a valid URL, when it certainly is not.
I did a little research... and learnt that I should be using filter_var instead of a regex for PHP URL validation anyways... and was disappointed to realize that it too is susceptible to this very same validation problem.
I could easily conquer it with:
str_replace(array("https://http://","http://https://"), array("http://","https://"), $url);
But... that just seems so wrong.
Well, it is a valid URI. Technically. Look at the RFC for URIs if you don't believe me.
The path component of a URI can contain //.
http is a valid host name.
The port is allowed to be missing even if the : is present (it's specified as *digit, not 1*digit). (This is why Stack Overflow removed the colon -- it thought you were using the default port, so it removed it from the URI.)
I suggest writing a special case for this. In a separate step, check to see if the URI starts with https?://https?://, and fix it.
I have this text input, and I need to check if the string is a valid web address, like http://www.example.com. How can be done with regular expressions in PHP?
Use the filter extension:
filter_var($url, FILTER_VALIDATE_URL);
This will be far more robust than any regex you can write.
Found this:
(http|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?
From Here:
A regex that validates a web address and matches an empty string?
You need to first understand a web address before you can begin to parse it effectively. Yes, http://www.example.com is a valid address. So is www.example.com. Or example.com. Or http://example.com. Or prefix.example.com.
Have a look at the specifications for a URI, especially the Syntax components.
I found the below from http://www.roscripts.com/PHP_regular_expressions_examples-136.html
//URL: Different URL parts
//Protocol, domain name, page and CGI parameters are captured into backreferenes 1 through 4
'\b((?#protocol)https?|ftp)://((?#domain)[-A-Z0-9.]+)((?#file)/[-A-Z0-9+&##/%=~_|!:,.;]*)?((?#parameters)\?[-A-Z0-9+&##/%=~_|!:,.;]*)?'
//URL: Different URL parts
//Protocol, domain name, page and CGI parameters are captured into named capturing groups.
//Works as it is with .NET, and after conversion by RegexBuddy on the Use page with Python, PHP/preg and PCRE.
'\b(?<protocol>https?|ftp)://(?<domain>[-A-Z0-9.]+)(?<file>/[-A-Z0-9+&##/%=~_|!:,.;]*)?(?<parameters>\?[-A-Z0-9+&##/%=~_|!:,.;]*)?'
//URL: Find in full text
//The final character class makes sure that if an URL is part of some text, punctuation such as a
//comma or full stop after the URL is not interpreted as part of the URL.
'\b(https?|ftp|file)://[-A-Z0-9+&##/%?=~_|!:,.;]*[-A-Z0-9+&##/%=~_|]'
//URL: Replace URLs with HTML links
preg_replace('\b(https?|ftp|file)://[-A-Z0-9+&##/%?=~_|!:,.;]*[-A-Z0-9+&##/%=~_|]', '\0', $text);
In most cases you don't have to check if a string is a valid address.
Either it is, and a web site will be available or it won't be and the user will simply go back.
You should only escape illegals characters to avoid XSS, if your user doesn't want do give a valid website, it should be his problem.
(In most cases).
PS: If you still want to check URLs, look at nikic's answer.
To match more protocols, you can do:
((https?|s?ftp|gopher|telnet|file|notes|ms-help)://)?[\w:##%/;$()~=\.&-]+
I'm looking for a decent regex to match a URL (a full URL with scheme, domain, path etc.)
I would normally use filter_var but I can't in this case as I have to support PHP<5.2!
I've searched the web but can't find anything that I'm confident will be fool-proof, and all I can find on SO is people saying to use filter_var.
Does anybody have a regex that they use for this?
My code (just so you can see what I'm trying to achieve):
function validate_url($url){
if (function_exists('filter_var')){
return filter_var($url, FILTER_VALIDATE_URL);
}
return preg_match(REGEX_HERE, $url);
}
I have created a solution for validating the domain. While it does not specifically cover the entire URL, it is very detailed and specific. The question you need to ask yourself is, "Why am I validating a domain?" If it is to see if the domain actually could exist, then you need to confirm the domain (including valid TLDs). The problem is, too many developers take the shortcut of ([a-z]{2,4}) and call it good. If you think along these lines, then why call it URL validation? It's not. It's just passing the URL through a regex.
I have an open source class that will allow you to validate the domain not only using the single source for TLD management (iana.org), but it will also validate the domain via DNS records to make sure it actually exists. The DNS validating is optional, but the domain will be specifically valid based on TLD.
For example: example.ay is NOT a valid domain as the .ay TLD is invalid. But using the regex posted here ([a-z]{2,4}), it would pass. I have an affinity for quality. I try to express that in the code I write. Others may not really care. So if you want to simply "check" the URL, you can use the examples listed in these responses. If you actually want to validate the domain in the URL, you can have at the class I created to do just that. It can be downloaded at:
http://code.google.com/p/blogchuck/source/browse/trunk/domains.php
It validates based on the RFCs that "govern" (using the term loosely) what determines a valid domain. In a nutshell, here is what the domains class will do:
Basic rules of the domain validation
must be at least one character long
must start with a letter or number
contains letters, numbers, and hyphens
must end in a letter or number
may contain multiple nodes (i.e. node1.node2.node3)
each node can only be 63 characters long max
total domain name can only be 255 characters long max
must end in a valid TLD
can be an IP4 address
It will also download a copy of the master TLD file iana.org only after checking your local copy. If your local copy is outdated by 30 days, it will download a new copy. The TLDs in the file will be used in the REGEX to validate the TLD in the domain you are validating. This prevents the .ay (and other invalid TLDs) from passing validation.
This is a lengthy bit of code, but very compact considering what it does. And it is the most accurate. That's why I asked the question earlier. Do you want to do "validation" or simple "checking"?
You could try this one. I haven't tried it myself but it's surely the biggest regexp I've ever seen, haha.
^(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/|~\/|\/)?(?#Username:Password)(?:\w+:\w+#)?(?#Subdomains)(?:(?:[-\w]+\.)+(?#TopLevel Domains)(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))(?#Port)(?::[\d]{1,5})?(?#Directories)(?:(?:(?:\/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|\/)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?$
!(https?://)?([-_a-z0-9]+\.)*([-_a-z0-9]+)\.([a-z]{2,4})(/?)(.*)!i
I use this regular expression for validating URLs. So far it didn't fail me a single time :)
I've seen a regex that could actually validate any kind of valid URL but it was two pages long...
You're probably better off parsing the url with parse_url and then checking if all of your required bits are in order.
Addition:
This is a snip of my URL class:
public static function IsUrl($test)
{
if (strpos($test, ' ') > -1)
{
return false;
}
if (strpos($test, '.') > 1)
{
$check = #parse_url($test);
return is_array($check)
&& isset($check['scheme'])
&& isset($check['host']) && count(explode('.', $check['host'])) > 1
}
return false;
}
It tests the given string and requires some basics in the url, namely that the scheme is set and the hostname has a dot in it.