PHP filter_var URL - php

For validating a URL path from user input, i'm using the PHP filter_var function.
The input only contains the path (/path/path/script.php).
When validating the path, I add the host. I'm playing around a little bit, testing the input validation etc. Doing so, i notice a strange(??) behavior of the filter URL function.
Code:
$url = "http://www.domain.nl/http://www.google.nl/modules/authorize/test/normal.php";
var_dump(filter_var($url, FILTER_VALIDATE_URL, FILTER_FLAG_HOST_REQUIRED)); //valid
Can someone explane why this is a valid URL? Thanks!

The short answer is, PHP FILTER_VALIDATE_URL checks the URL only against RFC 2396 and your URL, although weird, is valid according to said standard.
Long answer:
The filter you are using is declared to be compliant with RFC, so let's check that standard (RFC 2396).
The regular expression used for parsing a URL and listed there is:
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
12 3 4 5 6 7 8 9
Where:
scheme = $2
authority = $4
path = $5
query = $7
fragment = $9
As we can see, the ":" character is reserved only in the context of scheme and from that point onwards ":" is fair game (this is supported by the text of the standard). For example, it is used freely in the http: scheme to denote a port. A slash can also appear in any place and nothing prohibits the URL from having a "//" somewhere in the middle. So "http://" in the middle should be valid.
Let's look at your URL and try to match it to this regexp:
$url = "http://www.domain.nl/http://www.google.nl/modules/authorize/test/normal.php";
//Escaped a couple slashes to make things work, still the same regexp
$result_rfc = preg_match('/^(([^:\/?#]+):)?(\/\/([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?/',$url);
echo '<p>'.$result_rfc.'</p>';
The test returns '1' so this url is valid. This is to be expected, as the rules don't declare urls that have something like 'http://' in the middle to be invalid as we have seen. PHP simply mirrors this behaviour with FILTER_VALIDATE_URL.
If you want a more rigurous test, you will need to write the required code yourself. For example, you can prevent "://" from appearing more than once:
$url = "http://www.domain.nl/http://www.google.nl/modules/authorize/test/normal.php";
$result_rfc = preg_match('/^(([^:\/?#]+):)?(\/\/([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?/',$url);
if (substr_count($url,'://') != 1) {
$result_non_rfc = false;
} else {
$result_non_rfc = $result_rfc;
}
You can also try and adjust the regular expression itself.

Related

PHP url validation false positives

For some odd reason my if statement to check the urls using FILTER_VALIDATE_URL is returning unexpected results.
Simple stuff like https://www.google.nl/ is being blocked but www.google.nl/ isn't? Its not like it blocks every single URL with http or https infront of it either. Some are allowed and others are not, I know there are a bunch of topics for this but most of them are using regex to filter urls. Is this beter than using FILTER_VALIDATE_URL? Or Am I doing something wrong?
The code I use to check the URLS is this
if (!filter_var($linkinput, FILTER_VALIDATE_URL) === FALSE) {
//error code
}
You should filter it like this first. (Just for good measure).
$url = filter_var($url, FILTER_SANITIZE_URL);
The FILTER_VALIDATE_URL only accepts ASCII URL's (ie, need to be encoded). If the above function does not work see PHP urlencode() to encode the URL.
If THAT doesn't work, then you should manually strip the http: from the beginning like this ...
$url = strpos($url, 'http://') === 0 ? substr($url, 7) : $url;
Here are some flags that might help. If all of your URL's will have http:// you can use FILTER_FLAG_SCHEME_REQUIRED
The FILTER_VALIDATE_URL filter validates a URL.
Possible flags:
FILTER_FLAG_SCHEME_REQUIRED - URL must be RFC compliant (like http://example)
FILTER_FLAG_HOST_REQUIRED - URL must include host name (like http://www.example.com)
FILTER_FLAG_PATH_REQUIRED - URL must have a path after the domain name (like www.example.com/example1/)
FILTER_FLAG_QUERY_REQUIRED - URL must have a query string (like "example.php?name=Peter&age=37")
The default behavior of FILTER_VALIDATE_URL
Validates value as URL (according to ยป
http://www.faqs.org/rfcs/rfc2396), optionally with required
components.
Beware a valid URL may not specify the HTTP protocol
http:// so further validation may be required to determine the URL
uses an expected protocol, e.g. ssh:// or mailto:.
Note that the
function will only find ASCII URLs to be valid; internationalized
domain names (containing non-ASCII characters) will fail.

URL array codeigniter

I'm using codeigniter (newbie at codeigniter).
I have a function getproducts($p1, $p2, $p3) in a controller.
When I call getproducts/0/0/ from my jquery-script (ajax) it works, but I want to call URL like this:
getproducts/0/0/{"0":"13","1":"24"}
it doesn't work. (I get into to google-search-results instead of staying at my local webserver)
I basically want to pass an array to a function in the url somehow when using codeigniter. How should I solve that? Please help :-)
I think you should at least adjust the Codeigniter's config about allowed characters in the URL to include curly braces, comma and double quotes :
$config['permitted_uri_chars'] = ',{}"a-z 0-9~%.:_()#\-';
The reason why you end up on Google might however be something else (does not seem to be Codeigniter related)
Your browser don't think that that is a URL and navigates to google (thinking that you are searching something), I Think.
The main parts of URLs
A full BNF description of the URL syntax is given in Section 5.
In general, URLs are written as follows:
<scheme>:<scheme-specific-part>
A URL contains the name of the scheme being used () followed
by a colon and then a string (the ) whose
interpretation depends on the scheme.
Scheme names consist of a sequence of characters. The lower case
letters "a"--"z", digits, and the characters plus ("+"), period
("."), and hyphen ("-") are allowed. For resiliency, programs
interpreting URLs should treat upper case letters as equivalent to
lower case in scheme names (e.g., allow "HTTP" as well as "http").
Thus, only alphanumerics, the special characters "$-_.+!*'(),", and
reserved characters used for their reserved purposes may be used
unencoded within a URL.
schemepart = *xchar | ip-schemepart
See http://www.faqs.org/rfcs/rfc1738.html please.
{"0":"13","1":"24"} should be url encoded.
http://php.net/manual/en/function.urlencode.php
I think a better answer for this question would be to use the inbuilt uri to associative array handler. see http://www.codeigniter.com/user_guide/libraries/uri.html?highlight=uri
this stops all that nasty mucking about with config permitted uri characters.
your uri would be: getproducts/p1/0/p2/0/p3/0/p5/13/p6/1/p6/24
and the handler would be something like:
function get_product()
{
$object = $this->uri->uri_to_assoc(4);
}
You need to use URI class's
$this->uri->assoc_to_uri()
Manual wrote,
Takes an associative array as input and generates a URI string from it.
The array keys will be included in the string. Example:
$array = array('product' => 'shoes', 'size' => 'large', 'color' => 'red');
$str = $this->uri->assoc_to_uri($array);
// Produces: product/shoes/size/large/color/red

Using ampersand in pretty URL breaks URL

I have seen plenty of people having this problem and it seems the only way to stop apache treating the encoded ampersand and a URL ampersand is it use the mod rewrite B flag, RewriteRule ^(.*)$ index.php?path=$1 [L,QSA,B].
However, this isn't available in earlier versions of apache and has to be installed which is also not supported by some hosting companies.
I have found a solution that works well for us. We have a url of /search/results/Takeaway+Foods/Inverchorachan,+Argyll+&+Bute+
This obviously breaks the url at & giving us /search/results/Takeaway+Foods/Inverchorachan,+Argyll which then gives a 404 error as there is no such page.
The url is held in the $_GET['url'] array. If it finds an & the it splits the array for each ampersand.
The following code pieces the URL back together by traversing the $_GET array for each piece.
I would like to know if this has any hidden problems that I may not be aware of.
The code:
$newurl = "";
foreach($_GET as $key=>$pcs) {
if($newurl=="")
$newurl = $pcs;
else
$newurl .= "& ".rtrim($key,"_");
}
//echo $newurl;exit;
if($newurl!='') $url=$newurl;
I am trimming the underscore from the piece as apache added this. Not sure why but any help on this would be great.
You said in a cooment:
We want the URL to show the ampersand so substituting with other characters is not an option.
Short answer: Don't do it.
Seriously, don't use ampersands this way in URLs. Even if looks pretty. Ampersands have a special meaning in a URL and trying to override that meaning because it looks nice is a very bad idea.
Most web-based software (including Apache, PHP and all browsers) makes assumptions about what an ampersand means in a URL, which you will find very hard to work around.
In particular, you will utterly confuse Google and other search engines if you've got arbitrary ampersands in the URL, so it will completely destroy your SEO rank.
If you must have an ampersand in the string, use urlencoding to turn it into a URL-friendly %26. This won't look good in the user's URL string, but it will work as intended.
If that's not acceptable, then substitute something different for ampersands; maybe the word "and", or a character like and underscore, or perhaps just remove it from the string without a replacement.
All of these are common practice. Trying to force the URL to have an actual ampersand character in it is not common practice, and for very good reason.
Take a look at urlencode :
You can also replace the "&" char with something not breaking the URI and won't be interpreted by apache like the "|" char.
We have had this fix in place for two weeks now so I believe that this has solved the issue. I hope this will help someone with a similar issue as I searched for weeks for a solution outside of an apache upgrade to include the B flag. Our users can now type in Bed & Breakfast and we can then serve the appropriate page.
Here is the fix in PHP.
$newurl = "";
foreach($_GET as $key=>$pcs)
{
if($newurl=="")
$newurl = $pcs;
else
$newurl .= "& ".rtrim($key,"_");
}
if($newurl!='') $url=$newurl;

Validate URL with or without protocol

Hi I would like to validate this following urls, so they all would pass with or without http/www part in them as long as there is TLD present like .com, .net, .org etc..
Valid URLs Should Be:
http://www.domain.com
http://domain.com
https://www.domain.com
https://domain.com
www.domain.com
domain.com
To support long tlds:
http://www.domain.com.uk
http://domain.com.uk
https://www.domain.com.uk
https://domain.com.uk
www.domain.com.uk
domain.com.uk
To support dashes (-):
http://www.domain-here.com
http://domain-here.com
https://www.domain-here.com
https://domain-here.com
www.domain-here.com
domain-here.com
Also to support numbers in domains:
http://www.domain1-test-here.com
http://domain1-test-here.com
https://www.domain1-test-here.com
https://domain1-test-here.com
www.domain1-test-here.com
domain-here.com
Also maybe allow even IPs:
127.127.127.127
(but this is extra!)
Also allow dashes (-), forgot to mantion that =)
I've found many functions that validate one or another but not both at same time.
If any one knows good regex for it, please share. Thank you for your help.
For url validation perfect solution.
Above Answer is right but not work on all domains like .me, .it, .in
so please user below for url match:
$pattern = '/(?:https?:\/\/)?(?:[a-zA-Z0-9.-]+?\.(?:[a-zA-Z])|\d+\.\d+\.\d+\.\d+)/';
if(preg_match($pattern, "http://website.in"))
{
echo "valid";
}else{
echo "invalid";
}
When you ignore the path part and look for the domain part only, a simple rule would be
(?:https?://)?(?:[a-zA-Z0-9.-]+?\.(?:com|net|org|gov|edu|mil)|\d+\.\d+\.\d+\.\d+)
If you want to support country TLDs as well you must either supply a complete (current) list or append |.. to the TLD part.
With preg_match you must wrap it between some delimiters
$pattern = ';(?:https?://)?(?:[a-zA-Z0-9.-]+?\.(?:com|net|org|gov|edu|mil)|\d+\.\d+\.\d+\.\d+);';
$index = preg_match($pattern, $url);
Usually, you use /. But in this case, slashes are part of the pattern, so I have chosen some other delimiter. Otherwise I must escape the slashes with \
$pattern = '/(?:https?:\/\/)?(?:[a-zA-Z0-9.-]+?\.(?:com|net|org|gov|edu|mil)|\d+\.\d+\.\d+\.\d+)/';
Don't use a regular expression. Not every problem that involves strings needs to use regexes.
Don't write your own URL validator. URL validation is a solved problem, and there is existing code that has already been written, debugged and testing. In fact, it comes standard with PHP.
Look at PHP's built-in filtering functionality: http://us2.php.net/manual/en/book.filter.php
I think you can use flags for filter_vars.
For FILTER_VALIDATE_URL there is several flags available:
FILTER_FLAG_SCHEME_REQUIRED Requires the URL to contain a scheme
part.
FILTER_FLAG_HOST_REQUIRED Requires the URL to contain a host
part.
FILTER_FLAG_PATH_REQUIRED Requires the URL to contain a path
part.
FILTER_FLAG_QUERY_REQUIRED Requires the URL to contain a query
string.
FILTER_FLAG_SCHEME_REQUIRED and FILTER_FLAG_HOST_REQUIRED used by default.
Lets say you want to check for path part and do not want to check for scheme part, you can do something like this (falg is a bitmask):
filter_var($url, FILTER_VALIDATE_URL, ~FILTER_FLAG_SCHEME_REQUIRED | FILTER_FLAG_PATH_REQUIRED)

PHP regex for validating a URL

I'm looking for a decent regex to match a URL (a full URL with scheme, domain, path etc.)
I would normally use filter_var but I can't in this case as I have to support PHP<5.2!
I've searched the web but can't find anything that I'm confident will be fool-proof, and all I can find on SO is people saying to use filter_var.
Does anybody have a regex that they use for this?
My code (just so you can see what I'm trying to achieve):
function validate_url($url){
if (function_exists('filter_var')){
return filter_var($url, FILTER_VALIDATE_URL);
}
return preg_match(REGEX_HERE, $url);
}
I have created a solution for validating the domain. While it does not specifically cover the entire URL, it is very detailed and specific. The question you need to ask yourself is, "Why am I validating a domain?" If it is to see if the domain actually could exist, then you need to confirm the domain (including valid TLDs). The problem is, too many developers take the shortcut of ([a-z]{2,4}) and call it good. If you think along these lines, then why call it URL validation? It's not. It's just passing the URL through a regex.
I have an open source class that will allow you to validate the domain not only using the single source for TLD management (iana.org), but it will also validate the domain via DNS records to make sure it actually exists. The DNS validating is optional, but the domain will be specifically valid based on TLD.
For example: example.ay is NOT a valid domain as the .ay TLD is invalid. But using the regex posted here ([a-z]{2,4}), it would pass. I have an affinity for quality. I try to express that in the code I write. Others may not really care. So if you want to simply "check" the URL, you can use the examples listed in these responses. If you actually want to validate the domain in the URL, you can have at the class I created to do just that. It can be downloaded at:
http://code.google.com/p/blogchuck/source/browse/trunk/domains.php
It validates based on the RFCs that "govern" (using the term loosely) what determines a valid domain. In a nutshell, here is what the domains class will do:
Basic rules of the domain validation
must be at least one character long
must start with a letter or number
contains letters, numbers, and hyphens
must end in a letter or number
may contain multiple nodes (i.e. node1.node2.node3)
each node can only be 63 characters long max
total domain name can only be 255 characters long max
must end in a valid TLD
can be an IP4 address
It will also download a copy of the master TLD file iana.org only after checking your local copy. If your local copy is outdated by 30 days, it will download a new copy. The TLDs in the file will be used in the REGEX to validate the TLD in the domain you are validating. This prevents the .ay (and other invalid TLDs) from passing validation.
This is a lengthy bit of code, but very compact considering what it does. And it is the most accurate. That's why I asked the question earlier. Do you want to do "validation" or simple "checking"?
You could try this one. I haven't tried it myself but it's surely the biggest regexp I've ever seen, haha.
^(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/|~\/|\/)?(?#Username:Password)(?:\w+:\w+#)?(?#Subdomains)(?:(?:[-\w]+\.)+(?#TopLevel Domains)(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))(?#Port)(?::[\d]{1,5})?(?#Directories)(?:(?:(?:\/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|\/)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?$
!(https?://)?([-_a-z0-9]+\.)*([-_a-z0-9]+)\.([a-z]{2,4})(/?)(.*)!i
I use this regular expression for validating URLs. So far it didn't fail me a single time :)
I've seen a regex that could actually validate any kind of valid URL but it was two pages long...
You're probably better off parsing the url with parse_url and then checking if all of your required bits are in order.
Addition:
This is a snip of my URL class:
public static function IsUrl($test)
{
if (strpos($test, ' ') > -1)
{
return false;
}
if (strpos($test, '.') > 1)
{
$check = #parse_url($test);
return is_array($check)
&& isset($check['scheme'])
&& isset($check['host']) && count(explode('.', $check['host'])) > 1
}
return false;
}
It tests the given string and requires some basics in the url, namely that the scheme is set and the hostname has a dot in it.

Categories