Check if URL entered is from a certain domain

Check if URL entered is from a certain domain - php

I have a field where a user can enter a URL. I need to check if that URL is from a certain domain, in this case google.com.
I've tried this, however, it doesn't work for all cases (which I list below):
if(strstr(parse_url($link, PHP_URL_HOST), 'google.com') { // continue }
http://www.google.com/blah - works
https://www.google.com/blah - works
google.com/blah - doesn't work
www.google.com/blah - doesn't work
Is there a way to do this without regex? If not, how would it be done?
Thanks.

parse_url requires a valid URL and google.com/blah isn't valid (as of PHP 5.3.3) -- so it won't work. As a work around, you can append the http if doesn't exist already, and then check the domain.
Use the following function:
function checkRootDomain($url)
{
if (!preg_match("~^(?:f|ht)tps?://~i", $url)) {
$url = "http://" . $url;
}
$domain = implode('.', array_slice(explode('.', parse_url($url, PHP_URL_HOST)), -2));
if ($domain == 'google.com') {
return True;
} else {
return False;
}
}
Test cases:
var_dump(checkRootDomain('http://www.google.com/blah'));
var_dump(checkRootDomain('https://www.google.com/blah '));
var_dump(checkRootDomain('google.com/blah'));
var_dump(checkRootDomain('www.google.com/blah '));
Result:
bool(true)
bool(true)
bool(true)
bool(true)
It is a modified version of my own answer here.
Hope this helps!

This was an issue in php prior to version 5.4.7: http://php.net/manual/en/function.parse-url.php
Version 5.4.7 Fixed host recognition when scheme is omitted and a leading component separator is present.
<?php
$url = '//www.example.com/path?googleguy=googley';
// Prior to 5.4.7 this would show the path as "//www.example.com/path"
var_dump(parse_url($url));
?>
Your options are to upgrade to php >=5.4.7, or to detect a missing http: and add it if it is missing.

parse_url() requires a valid URL. If it doesn't have the scheme (ie the http:// bit) then it's not a valid URL.
You can add a scheme very easily, though. Just check whether the string contains a colon (:), and if not, add http:// to the front of it:
if(strpos($link, ':')===false) {$link = "http://".$link;}
Now your parse_url() call should work a bit better.

Related

PHP Auto-correcting URLs

I dont wan't reinvent wheel, but i couldnt find any library that would do this perfectly.
In my script users can save URLs, i want when they give me list like:
google.com
www.msn.com
http://bing.com/
and so on...
I want to be able to save in database in "correct format".
Thing i do is I check is it there protocol, and if it's not present i add it and then validate URL against RegExp.
For PHP parse_url any URL that contains protocol is valid, so it didnt help a lot.
How guys you are doing this, do you have some idea you would like to share with me?
Edit:
I want to filter out invalid URLs from user input (list of URLs). And more important, to try auto correct URLs that are invalid (ex. doesn't contains protocol). Ones user enter list, it should be validated immediately (no time to open URLs to check those they really exist).
It would be great to extract parts from URL, like parse_url do, but problem with parse_url is, it doesn't work well with invalid URLs. I tried to parse URL with it, and for parts that are missing (and are required) to add default ones (ex. no protocol, add http). But parse_url for "google.com" wont return "google.com" as hostname but as path.
This looks like really common problem to me, but i could not find available solution on internet (found some libraries that will standardize URL, but they wont fix URL if it is invalid).
Is there some "smart" solution to this, or I should stick with my current:
Find first occurrence of :// and validate if it's text before is valid protocol, and add protocol if missing
Found next occurrence of / and validate is hostname is in valid format
For good measure validate once more via RegExp whole URL
I just have feeling I will reject some valid URLs with this, and for me is better to have false positive, that false negative.

I had the same problem with parse_url as OP, this is my quick and dirty solution to auto-correct urls(keep in mind that the code in no way are perfect or cover all cases):
Results:
http:/wwww.example.com/lorum.html => http://www.example.com/lorum.html
gopher:/ww.example.com => gopher://www.example.com
http:/www3.example.com/?q=asd&f=#asd =>http://www3.example.com/?q=asd&f=#asd
asd://.example.com/folder/folder/ =>http://example.com/folder/folder/
.example.com/ => http://example.com/
example.com =>http://example.com
subdomain.example.com => http://subdomain.example.com
function url_parser($url) {
// multiple /// messes up parse_url, replace 2+ with 2
$url = preg_replace('/(\/{2,})/','//',$url);
$parse_url = parse_url($url);
if(empty($parse_url["scheme"])) {
$parse_url["scheme"] = "http";
}
if(empty($parse_url["host"]) && !empty($parse_url["path"])) {
// Strip slash from the beginning of path
$parse_url["host"] = ltrim($parse_url["path"], '\/');
$parse_url["path"] = "";
}
$return_url = "";
// Check if scheme is correct
if(!in_array($parse_url["scheme"], array("http", "https", "gopher"))) {
$return_url .= 'http'.'://';
} else {
$return_url .= $parse_url["scheme"].'://';
}
// Check if the right amount of "www" is set.
$explode_host = explode(".", $parse_url["host"]);
// Remove empty entries
$explode_host = array_filter($explode_host);
// And reassign indexes
$explode_host = array_values($explode_host);
// Contains subdomain
if(count($explode_host) > 2) {
// Check if subdomain only contains the letter w(then not any other subdomain).
if(substr_count($explode_host[0], 'w') == strlen($explode_host[0])) {
// Replace with "www" to avoid "ww" or "wwww", etc.
$explode_host[0] = "www";
}
}
$return_url .= implode(".",$explode_host);
if(!empty($parse_url["port"])) {
$return_url .= ":".$parse_url["port"];
}
if(!empty($parse_url["path"])) {
$return_url .= $parse_url["path"];
}
if(!empty($parse_url["query"])) {
$return_url .= '?'.$parse_url["query"];
}
if(!empty($parse_url["fragment"])) {
$return_url .= '#'.$parse_url["fragment"];
}
return $return_url;
}
echo url_parser('http:/wwww.example.com/lorum.html'); // http://www.example.com/lorum.html
echo url_parser('gopher:/ww.example.com'); // gopher://www.example.com
echo url_parser('http:/www3.example.com/?q=asd&f=#asd'); // http://www3.example.com/?q=asd&f=#asd
echo url_parser('asd://.example.com/folder/folder/'); // http://example.com/folder/folder/
echo url_parser('.example.com/'); // http://example.com/
echo url_parser('example.com'); // http://example.com
echo url_parser('subdomain.example.com'); // http://subdomain.example.com

It's not 100% foolproof, but a 1 liner.
$URL = (((strpos($URL,'https://') === false) && (strpos($URL,'http://') === false))?'http://':'' ).$URL;
EDIT
There was apparently a problem with my initial version if the hostname contain http.
Thanks Trent

Check if URL is from a certain website

Problem
The user can submit a form where he can submit a link to sitea.com. Now what I want to do is check if the user actually submitted an URL coming from sitea.com
What I've tried
I've tried to check if the URL is correct (using regex), and contains sitea.com. But that contains gaps, as anyone can add ?haha=sitea.com to an URL and still have a match. And 'cause I'm no master in regex, my "solution" ends here.
My question
Is it possible to check if $_POST['url'] is actually a link to sitea.com?

I think it's best parse_url() here. Regex may work, but it's best to avoid using regex when a built-in function is available.
I'd do something like:
$url = '...';
$domain = implode('.', array_slice(explode('.', parse_url($url, PHP_URL_HOST)), -2));
if ($domain == 'sitea.com') {
# code...
}
As a function:
function getDomain($url)
{
$domain = implode('.', array_slice(explode('.', parse_url($url, PHP_URL_HOST)), -2));
if ($domain == 'sitea.com') {
return True;
} else {
return False;
}
}
Test cases:
var_dump(getDomain('http://sitea.com/'));
var_dump(getDomain('http://sitea.com/directory'));
var_dump(getDomain('http://subdomain.sitea.com/'));
var_dump(getDomain('http://sub.subdomain.sitea.com/#test'));
var_dump(getDomain('http://subdomain.notsitea.com/#dsdf'));
var_dump(getDomain('http://sitea.somesite.com'));
var_dump(getDomain('http://example.com/sitea.com'));
var_dump(getDomain('http://sitea.example.com/test.php?haha=sitea.com'));
Output:
bool(true)
bool(true)
bool(true)
bool(true)
bool(false)
bool(false)
bool(false)
bool(false)
Demo!

This might not be a job for regexes, but for existing tools in your language of choice. Regexes are not a magic wand you wave at every problem that happens to involve strings. You probably want to use existing code that has already been written, tested, and debugged.
In PHP, use the parse_url function.
Perl: URI module.
Ruby: URI module.
.NET: 'Uri' class

Get subdomain if any

Is there any predefined method in PHP to get sub-domain from url if any?
url pattern may be:
http://www.sd.domain.com
http://domain.com
http://sd.domain.com
http://domain.com
where sd stands for sub-doamin.
Now method must return different values for every case:
case 1 -> return sd
case 2 -> return false or empty
case 3 -> return sd
case 4 -> return false or empty
I found some good links
PHP function to get the subdomain of a URL
Get subdomain from url?
but not specifically apply on my cases.
Any help will be most appreciable.
Thanks

Okay, here I create a script :)
$url = $_SERVER['HTTP_HOST'];
$host = explode('.', $url);
if( !empty($host[0]) && $host[0] != 'www' && $host[0] != 'localhost' ){
$domain = $host[0];
}else{
$domain = 'home';
}

So, there are several possibilities...
First, regular expressions of course:
(http://)?(www\.)?([^\.]*?)\.?([^\.]+)\.([^\.]+)
The entry in the third parenthesis will be your subdomain. Of course, if your url would be https:// or www2 (seen it all...) the regex would break. So this is just a first draft to start working with.
My second idea is, just as yours, explodeing the url. I thought of something like this:
function getSubdomain($url) {
$parts = explode('.', str_replace('http://', '', $url));
if(count($parts) >= 3) {
return $parts[count($parts) - 3];
}
return null;
}
My idea behind this function was, that if an url is splitted by . the subdomain will almost always be the third last entry in the resulting array. The protocol has to be stripped first (see case 3). Of course, this certainly can be done more elegant.
I hope I could give you some ideas.

Try this.
[update] We have a constant defined _SITE_ADDRESS such as www.mysite.com you could use a literal for this.
It works well in our system for what seems like that exact purpose.
public static function getSubDomain()
{
if($_SERVER["SERVER_NAME"] == str_ireplace('http://','',_SITE_ADDRESS)) return ''; //base domain
$host = str_ireplace(array("www.", _SITE_ADDRESS), "", strtolower(trim($_SERVER["HTTP_HOST"])));
$sub = preg_replace('/\..*/', '', $host);
if($sub == $host) return ''; //this is likely an ip address
return $sub;
}
There is an external note on that function but no link, So sorry to any original developer who's code this is based on.

regex to create link from url and strip www

I have a PHP function which takes a passed url and creates a clean link. It puts the full link in the anchor tags and presents just "www.domain.com" from the url. It works well but I would like to modify it so it strips out the "www." part as well.
<?php
// pass a url like: http://www.yelp.com/biz/my-business-name
// should return: yelp.com
function formatURL($url, $target=FALSE) {
if ($target) { $anchor_tag = "\\4"; }
else { $anchor_tag = "\\4"; }
$return_link = preg_replace("`(http|ftp)+(s)?:(//)((\w|\.|\-|_)+)(/)?(\S+)?`i", $anchor_tag, $url);
return $return_link;
}
?>
My regex skills are not that strong so any help greatly appreciated.

Take a look at parse_url: http://us2.php.net/manual/en/function.parse-url.php
This will simplify your logic quite a bit can can make replacing the www. a simple string replace.
$link = 'http://www.yelp.com/biz/my-business-name';
$hostname = parse_url($link, PHP_URL_HOST));
if(strpos($hostname, 'www.') === 0)
{
$hostname = substr($hostname, 4);
}
I have modified my original answer to account for the issue in the comments. The preg_replace in the post below will also work and is a bit more concise, I will leave this here to show an alternative solution that does not require invoking the regex engine if you desire.

This will get your the Domain name minus the www :
$url = preg_replace('/^www./', '', parse_url($url, PHP_URL_HOST));
^ in the regex means only remove www from the start of the string
Working example : http://codepad.org/FTNikw8g

URL parse function

Given this variable:
$variable = foo.com/bar/foo
What function would trim $variable to foo.com ?
Edit: I would like the function to be able to trim anything on a URL that could possibly come after the domain name.
Thanks in advance,
John

Working for OP:
$host = parse_url($url, PHP_URL_HOST);
The version of PHP I have to work with doesn't accept two parameters (Zend Engine 1.3.0). Whatever. Here's the working code for me - you do have to have the full URL including the scheme (http://). If you can safely assume that the scheme is http:// (and not https:// or something else), you could just prepend that to get what you need.
Working for me:
$url = 'http://foo.com/bar/foo';
$parts = parse_url($url);
$host = $parts['host'];
echo "The host is $host\n";

I'm using http://www.google.com/asdf in my example
If you're fine with getting the subdomain as well, you could split by "//" and take the 1th element to effectively remove the protocol and get www.google.com/asdf
You can then split by "/" and get the 0th element.
That seems ugly. Just brainstorming here =)

Try this:
function getDomain($url)
{
if(filter_var($url, FILTER_VALIDATE_URL, FILTER_FLAG_HOST_REQUIRED) === FALSE)
{
return false;
}
/*** get the url parts ***/
$parts = parse_url($url);
/*** return the host domain ***/
return $parts['scheme'].'://'.$parts['host'];
}
$variable = 'foo.com/bar/foo';
echo getDomain($variable);

You can use php's parse_url function and then access the value of the key "host" to get the hostname

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.