Check if URL is from a certain website - php

Problem
The user can submit a form where he can submit a link to sitea.com. Now what I want to do is check if the user actually submitted an URL coming from sitea.com
What I've tried
I've tried to check if the URL is correct (using regex), and contains sitea.com. But that contains gaps, as anyone can add ?haha=sitea.com to an URL and still have a match. And 'cause I'm no master in regex, my "solution" ends here.
My question
Is it possible to check if $_POST['url'] is actually a link to sitea.com?

I think it's best parse_url() here. Regex may work, but it's best to avoid using regex when a built-in function is available.
I'd do something like:
$url = '...';
$domain = implode('.', array_slice(explode('.', parse_url($url, PHP_URL_HOST)), -2));
if ($domain == 'sitea.com') {
# code...
}
As a function:
function getDomain($url)
{
$domain = implode('.', array_slice(explode('.', parse_url($url, PHP_URL_HOST)), -2));
if ($domain == 'sitea.com') {
return True;
} else {
return False;
}
}
Test cases:
var_dump(getDomain('http://sitea.com/'));
var_dump(getDomain('http://sitea.com/directory'));
var_dump(getDomain('http://subdomain.sitea.com/'));
var_dump(getDomain('http://sub.subdomain.sitea.com/#test'));
var_dump(getDomain('http://subdomain.notsitea.com/#dsdf'));
var_dump(getDomain('http://sitea.somesite.com'));
var_dump(getDomain('http://example.com/sitea.com'));
var_dump(getDomain('http://sitea.example.com/test.php?haha=sitea.com'));
Output:
bool(true)
bool(true)
bool(true)
bool(true)
bool(false)
bool(false)
bool(false)
bool(false)
Demo!

This might not be a job for regexes, but for existing tools in your language of choice. Regexes are not a magic wand you wave at every problem that happens to involve strings. You probably want to use existing code that has already been written, tested, and debugged.
In PHP, use the parse_url function.
Perl: URI module.
Ruby: URI module.
.NET: 'Uri' class

Related

How to convert random domain names into lowercase consistent urls?

I have this function in a class:
protected $supportedWebsitesUrls = ['www.youtube.com', 'www.vimeo.com', 'www.dailymotion.com'];
protected function isValid($videoUrl)
{
$urlDetails = parse_url($videoUrl);
if (in_array($urlDetails['host'], $this->supportedWebsitesUrls))
{
return true;
} else {
throw new \Exception('This website is not supported yet!');
return false;
}
}
It basically extracts the host name from any random url and then checks if it is in the $supportedWebsitesUrls array to ensure that it is from a supported website. But if I add say: dailymotion.com instead of www.dailymotion.com it won't detect that url. Also if I try to do WWW.DAILYMOTION.COM it still won't work. What can be done? Please help me.
You can use preg_grep function for this. preg_grep supports regex matches against a given array.
Sample use:
$supportedWebsitesUrls = array('www.dailymotion.com', 'www.youtube.com', 'www.vimeo.com');
$s = 'DAILYMOTION.COM';
if ( empty(preg_grep('/' . preg_quote($s, '/') . '/i', $supportedWebsitesUrls)) )
echo 'This website is not supported yet!\n';
else
echo "found a match\n";
Output:
found a match
You can run a few checks on it;
For lower case vs upper case, the php function strtolower() will sort you out.
as for checking with the www. at the beginning vs without it, you can add an extra check to your if clause;
if (in_array($urlDetails['host'], $this->supportedWebsitesUrls) || in_array('www.'.$urlDetails['host'], $this->supportedWebsitesUrls))

Check if URL entered is from a certain domain

I have a field where a user can enter a URL. I need to check if that URL is from a certain domain, in this case google.com.
I've tried this, however, it doesn't work for all cases (which I list below):
if(strstr(parse_url($link, PHP_URL_HOST), 'google.com') { // continue }
http://www.google.com/blah - works
https://www.google.com/blah - works
google.com/blah - doesn't work
www.google.com/blah - doesn't work
Is there a way to do this without regex? If not, how would it be done?
Thanks.
parse_url requires a valid URL and google.com/blah isn't valid (as of PHP 5.3.3) -- so it won't work. As a work around, you can append the http if doesn't exist already, and then check the domain.
Use the following function:
function checkRootDomain($url)
{
if (!preg_match("~^(?:f|ht)tps?://~i", $url)) {
$url = "http://" . $url;
}
$domain = implode('.', array_slice(explode('.', parse_url($url, PHP_URL_HOST)), -2));
if ($domain == 'google.com') {
return True;
} else {
return False;
}
}
Test cases:
var_dump(checkRootDomain('http://www.google.com/blah'));
var_dump(checkRootDomain('https://www.google.com/blah '));
var_dump(checkRootDomain('google.com/blah'));
var_dump(checkRootDomain('www.google.com/blah '));
Result:
bool(true)
bool(true)
bool(true)
bool(true)
It is a modified version of my own answer here.
Hope this helps!
This was an issue in php prior to version 5.4.7: http://php.net/manual/en/function.parse-url.php
Version 5.4.7 Fixed host recognition when scheme is omitted and a leading component separator is present.
<?php
$url = '//www.example.com/path?googleguy=googley';
// Prior to 5.4.7 this would show the path as "//www.example.com/path"
var_dump(parse_url($url));
?>
Your options are to upgrade to php >=5.4.7, or to detect a missing http: and add it if it is missing.
parse_url() requires a valid URL. If it doesn't have the scheme (ie the http:// bit) then it's not a valid URL.
You can add a scheme very easily, though. Just check whether the string contains a colon (:), and if not, add http:// to the front of it:
if(strpos($link, ':')===false) {$link = "http://".$link;}
Now your parse_url() call should work a bit better.

Get subdomain if any

Is there any predefined method in PHP to get sub-domain from url if any?
url pattern may be:
http://www.sd.domain.com
http://domain.com
http://sd.domain.com
http://domain.com
where sd stands for sub-doamin.
Now method must return different values for every case:
case 1 -> return sd
case 2 -> return false or empty
case 3 -> return sd
case 4 -> return false or empty
I found some good links
PHP function to get the subdomain of a URL
Get subdomain from url?
but not specifically apply on my cases.
Any help will be most appreciable.
Thanks
Okay, here I create a script :)
$url = $_SERVER['HTTP_HOST'];
$host = explode('.', $url);
if( !empty($host[0]) && $host[0] != 'www' && $host[0] != 'localhost' ){
$domain = $host[0];
}else{
$domain = 'home';
}
So, there are several possibilities...
First, regular expressions of course:
(http://)?(www\.)?([^\.]*?)\.?([^\.]+)\.([^\.]+)
The entry in the third parenthesis will be your subdomain. Of course, if your url would be https:// or www2 (seen it all...) the regex would break. So this is just a first draft to start working with.
My second idea is, just as yours, explodeing the url. I thought of something like this:
function getSubdomain($url) {
$parts = explode('.', str_replace('http://', '', $url));
if(count($parts) >= 3) {
return $parts[count($parts) - 3];
}
return null;
}
My idea behind this function was, that if an url is splitted by . the subdomain will almost always be the third last entry in the resulting array. The protocol has to be stripped first (see case 3). Of course, this certainly can be done more elegant.
I hope I could give you some ideas.
Try this.
[update] We have a constant defined _SITE_ADDRESS such as www.mysite.com you could use a literal for this.
It works well in our system for what seems like that exact purpose.
public static function getSubDomain()
{
if($_SERVER["SERVER_NAME"] == str_ireplace('http://','',_SITE_ADDRESS)) return ''; //base domain
$host = str_ireplace(array("www.", _SITE_ADDRESS), "", strtolower(trim($_SERVER["HTTP_HOST"])));
$sub = preg_replace('/\..*/', '', $host);
if($sub == $host) return ''; //this is likely an ip address
return $sub;
}
There is an external note on that function but no link, So sorry to any original developer who's code this is based on.

Most efficient way to check a URL

I'm trying to check if a user submitted URL is valid, it goes directly to the database when the user hits submit.
So far, I have:
$string = $_POST[url];
if (strpos($string, 'www.') && (strpos($string, '/')))
{
echo 'Good';
}
The submitted page should be a page in a directory, not the main site, so http://www.address.com/page
How can I have it check for the second / without it thinking it's from http:// and that doesn't include .com?
Sample input:
Valid:
http://www.facebook.com/pageName
http://www.facebook.com/pageName/page.html
http://www.facebook.com/pageName/page.*
Invalid:
http://www.facebook.com
facebook.com/pageName
facebook.com
if(!parse_url('http://www.address.com/page', PHP_URL_PATH)) {
echo 'no path found';
}
See parse_url reference.
See the parse_url() function. This will give you the "/page" part of the URL in a separate string, which you can then analyze as desired.
filter_var($url, FILTER_VALIDATE_URL, FILTER_FLAG_PATH_REQUIRED)
More information here :
http://ca.php.net/filter_var
Maybe strrpos will help you. It will locate the last occurrence of a string within a string
To check the format of the URL you could use a regular expression:
preg_match [ http://php.net/manual/en/function.preg-match.php ] is a good start, but a knowledge of regular expressions is needed to make it work.
Additionally, if you actually want to check that it's a valid URL, you could check the URL value to see if it actually resolves to a web page:
function check_404($url) {
$return = #get_headers($url);
if (strpos($return[0], ' 404 ') === false)
return true;
else {
return false;
}
}
Try using a regular expression to see that the URL has the correct structure. Here's more reading on this. You need to learn how PCRE works.
A simple example for what you want (disclaimer: not tested, incomplete).
function isValidUrl($url) {
return preg_match('#http://[^/]+/.+#', $url));
}
From here: http://www.blog.highub.com/regular-expression/php-regex-regular-expression/php-regex-validating-a-url/
<?php
/**
* Validate URL
* Allows for port, path and query string validations
* #param string $url string containing url user input
* #return boolean Returns TRUE/FALSE
*/
function validateURL($url)
{
$pattern = '/^(([\w]+:)?\/\/)?(([\d\w]|%[a-fA-f\d]{2,2})+(:([\d\w]|%[a-fA-f\d]{2,2})+)?#)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,4}(:[\d]+)?(\/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?$/';
return preg_match($pattern, $url);
}
$result = validateURL('http://www.google.com');
print $result;
?>

URL parse function

Given this variable:
$variable = foo.com/bar/foo
What function would trim $variable to foo.com ?
Edit: I would like the function to be able to trim anything on a URL that could possibly come after the domain name.
Thanks in advance,
John
Working for OP:
$host = parse_url($url, PHP_URL_HOST);
The version of PHP I have to work with doesn't accept two parameters (Zend Engine 1.3.0). Whatever. Here's the working code for me - you do have to have the full URL including the scheme (http://). If you can safely assume that the scheme is http:// (and not https:// or something else), you could just prepend that to get what you need.
Working for me:
$url = 'http://foo.com/bar/foo';
$parts = parse_url($url);
$host = $parts['host'];
echo "The host is $host\n";
I'm using http://www.google.com/asdf in my example
If you're fine with getting the subdomain as well, you could split by "//" and take the 1th element to effectively remove the protocol and get www.google.com/asdf
You can then split by "/" and get the 0th element.
That seems ugly. Just brainstorming here =)
Try this:
function getDomain($url)
{
if(filter_var($url, FILTER_VALIDATE_URL, FILTER_FLAG_HOST_REQUIRED) === FALSE)
{
return false;
}
/*** get the url parts ***/
$parts = parse_url($url);
/*** return the host domain ***/
return $parts['scheme'].'://'.$parts['host'];
}
$variable = 'foo.com/bar/foo';
echo getDomain($variable);
You can use php's parse_url function and then access the value of the key "host" to get the hostname

Categories