How to detect if user input is a domain?

How to detect if user input is a domain? - php

I am writing some code for my website and I need to find out a way to detect a user's input and determine whether or not it is a domain or an IP. If it is a domain it will then parse the domain from e.g. http://www.google.com to google.com, then resolve it and echo it back into the user input field. If it is a IP then it will skip all of these steps and remain in the IP user input field.
This is what I have so far:
<?php
if( $_POST['submit'] )
{
$ip_address=$_POST["host"];
if (ctype_alpha($ip_address))
{
// get host name from URL
preg_match('#^(?:http://)?([^/]+)#i', $ip_address, $matches);
$host = $matches[1];
// get last two segments of host name
preg_match('/[^.]+\.[^.]+$/', $host, $matches);
$new_ip_address = $matches;
//resolve parsed ip
$presolved = gethostbyname($new_ip_address);
}
else
{
echo "Do stuff.\n";
}
}
?>

Have you considered looking at it from the other direction? Detecting an IP address literal is much easier than parsing and validating a DNS name.
Parse the URL with parse_url
Split $url['host'] by '.'. If the result is an array of four integer elements, then it is probably an IP address.
Bonus points for checking that each integer is in the appropriate range (e.g, 1-255 for the first octet and 0-255 for the remaining octets).
In any case, use parse_url instead of trying to crack this with the regex hammer.

Try using http://php.net/filter_var
var_dump(filter_var('http://example.com', FILTER_VALIDATE_URL, FILTER_FLAG_PATH_REQUIRED));

To check if the input was an ip, you can use php's ip2long (http://php.net/manual/en/function.ip2long.php)
See the example below, which will return an IP or false.
<?php
function checkOrMakeIp($address) {
$long = ip2long($address);
if ($long != -1 || $long !== FALSE) {
// It is a IP, just return the address
return $address;
} elseif($ip = gethostbynamel($address)) {
// It is a hostname, but now we have the ip, return it
return $ip[0];
}
return false;
}
?>
Notice the gethostbynamel, it will return FALSE if the hostname could not be resolved; otherwise an array. (http://www.php.net/manual/en/function.gethostbynamel.php)

Related

Can I normalise IP4/6 addresses to IP6 in PHP without string-hacking?

There are lots of posts here and elsewhere about how to handle IP4 and IP6 addresses in PHP.
I want to write code which will normalise all IP addresses, regardless of their representation, to IP6 format.
I can do that by piecing together various bits of advice and doing some string hacking, but has best practice moved on from that?
Are there built-in functions in modern PHP which will do this correctly for all cases, without any conditional logic in my application code?

No, you can't do this without any conditional logic. On the other hand, the necessary code isn't very difficult:
function normalise_ip($address, $force_ipv6=false) {
# Parse textual representation
$binary = inet_pton($address);
if ($binary === false) {
return false;
}
# Convert back to a normalised string
$normalised = inet_ntop($binary);
# Add IPv4-Mapped IPv6 Address prefix if requested
if ($force_ipv6 && strlen($binary) == 4) {
$normalised = '::ffff:' . $normalised;
}
return $normalised;
}
You can test it like this:
$addresses = array(
'192.000.002.001',
'2001:0db8:0000:0000:0000:0000:0000:0001',
);
foreach ($addresses as $original) {
$normalised = normalise_ip($original);
$normalised_ipv6 = normalise_ip($original, true);
echo "Original: $original; ";
echo "Normalised: $normalised; ";
echo "As IPv6: $normalised_ipv6\n";
}
If you force the result to IPv6 you get IPv4-Mapped IPv6 Addresses instead of IPv4 addresses.

Display different content depending on Referrer

Hey I am trying to display a different phone number for visitors my website from my Google adwords campaign.
The code below works without the else statement (so if I click through to the page from Google it will display a message, and if I visit the site regularly it does not). When I added the else statement it outputs both numbers. Thank you
<?php
// The domain list.
$domains = Array('googleadservices.com', 'google.com');
$url_info = parse_url($_SERVER['HTTP_REFERER']);
if (isset($url_info['host'])) {
foreach($domains as $domain) {
if (substr($url_info['host'], -strlen($domain)) == $domain) {
// GOOGLE NUMBER HERE
echo ('1234');
}
// REGULAR NUMBER HERE
else {
echo ('12345');
}
}
}
?>

Your logic is slightly skewed; you're checking to see if the URL from parse_url matches the domains in your array; but you're running through the whole array each time. So you get both a match and a non-match, because google.com matches one entry but not the other.
I'd suggest making your domains array into an associative array:
$domains = Array('googleadservices.com' => '1234',
'google.com' => '12345' );
Then you just need to check once:
if (isset($url_info['host'])) {
if (isset($domains[$url_info['host']])) {
echo $domains[$url_info['host']];
}
}
I've not tested this, but it should be enough for you to see the logic.
(I've also removed the substr check - you may need to put that back in, to ensure that you're getting the exact string that you need to look for)

PHP Auto-correcting URLs

I dont wan't reinvent wheel, but i couldnt find any library that would do this perfectly.
In my script users can save URLs, i want when they give me list like:
google.com
www.msn.com
http://bing.com/
and so on...
I want to be able to save in database in "correct format".
Thing i do is I check is it there protocol, and if it's not present i add it and then validate URL against RegExp.
For PHP parse_url any URL that contains protocol is valid, so it didnt help a lot.
How guys you are doing this, do you have some idea you would like to share with me?
Edit:
I want to filter out invalid URLs from user input (list of URLs). And more important, to try auto correct URLs that are invalid (ex. doesn't contains protocol). Ones user enter list, it should be validated immediately (no time to open URLs to check those they really exist).
It would be great to extract parts from URL, like parse_url do, but problem with parse_url is, it doesn't work well with invalid URLs. I tried to parse URL with it, and for parts that are missing (and are required) to add default ones (ex. no protocol, add http). But parse_url for "google.com" wont return "google.com" as hostname but as path.
This looks like really common problem to me, but i could not find available solution on internet (found some libraries that will standardize URL, but they wont fix URL if it is invalid).
Is there some "smart" solution to this, or I should stick with my current:
Find first occurrence of :// and validate if it's text before is valid protocol, and add protocol if missing
Found next occurrence of / and validate is hostname is in valid format
For good measure validate once more via RegExp whole URL
I just have feeling I will reject some valid URLs with this, and for me is better to have false positive, that false negative.

I had the same problem with parse_url as OP, this is my quick and dirty solution to auto-correct urls(keep in mind that the code in no way are perfect or cover all cases):
Results:
http:/wwww.example.com/lorum.html => http://www.example.com/lorum.html
gopher:/ww.example.com => gopher://www.example.com
http:/www3.example.com/?q=asd&f=#asd =>http://www3.example.com/?q=asd&f=#asd
asd://.example.com/folder/folder/ =>http://example.com/folder/folder/
.example.com/ => http://example.com/
example.com =>http://example.com
subdomain.example.com => http://subdomain.example.com
function url_parser($url) {
// multiple /// messes up parse_url, replace 2+ with 2
$url = preg_replace('/(\/{2,})/','//',$url);
$parse_url = parse_url($url);
if(empty($parse_url["scheme"])) {
$parse_url["scheme"] = "http";
}
if(empty($parse_url["host"]) && !empty($parse_url["path"])) {
// Strip slash from the beginning of path
$parse_url["host"] = ltrim($parse_url["path"], '\/');
$parse_url["path"] = "";
}
$return_url = "";
// Check if scheme is correct
if(!in_array($parse_url["scheme"], array("http", "https", "gopher"))) {
$return_url .= 'http'.'://';
} else {
$return_url .= $parse_url["scheme"].'://';
}
// Check if the right amount of "www" is set.
$explode_host = explode(".", $parse_url["host"]);
// Remove empty entries
$explode_host = array_filter($explode_host);
// And reassign indexes
$explode_host = array_values($explode_host);
// Contains subdomain
if(count($explode_host) > 2) {
// Check if subdomain only contains the letter w(then not any other subdomain).
if(substr_count($explode_host[0], 'w') == strlen($explode_host[0])) {
// Replace with "www" to avoid "ww" or "wwww", etc.
$explode_host[0] = "www";
}
}
$return_url .= implode(".",$explode_host);
if(!empty($parse_url["port"])) {
$return_url .= ":".$parse_url["port"];
}
if(!empty($parse_url["path"])) {
$return_url .= $parse_url["path"];
}
if(!empty($parse_url["query"])) {
$return_url .= '?'.$parse_url["query"];
}
if(!empty($parse_url["fragment"])) {
$return_url .= '#'.$parse_url["fragment"];
}
return $return_url;
}
echo url_parser('http:/wwww.example.com/lorum.html'); // http://www.example.com/lorum.html
echo url_parser('gopher:/ww.example.com'); // gopher://www.example.com
echo url_parser('http:/www3.example.com/?q=asd&f=#asd'); // http://www3.example.com/?q=asd&f=#asd
echo url_parser('asd://.example.com/folder/folder/'); // http://example.com/folder/folder/
echo url_parser('.example.com/'); // http://example.com/
echo url_parser('example.com'); // http://example.com
echo url_parser('subdomain.example.com'); // http://subdomain.example.com

It's not 100% foolproof, but a 1 liner.
$URL = (((strpos($URL,'https://') === false) && (strpos($URL,'http://') === false))?'http://':'' ).$URL;
EDIT
There was apparently a problem with my initial version if the hostname contain http.
Thanks Trent

Most efficient way to check a URL

I'm trying to check if a user submitted URL is valid, it goes directly to the database when the user hits submit.
So far, I have:
$string = $_POST[url];
if (strpos($string, 'www.') && (strpos($string, '/')))
{
echo 'Good';
}
The submitted page should be a page in a directory, not the main site, so http://www.address.com/page
How can I have it check for the second / without it thinking it's from http:// and that doesn't include .com?
Sample input:
Valid:
http://www.facebook.com/pageName
http://www.facebook.com/pageName/page.html
http://www.facebook.com/pageName/page.*
Invalid:
http://www.facebook.com
facebook.com/pageName
facebook.com

if(!parse_url('http://www.address.com/page', PHP_URL_PATH)) {
echo 'no path found';
}
See parse_url reference.

See the parse_url() function. This will give you the "/page" part of the URL in a separate string, which you can then analyze as desired.

filter_var($url, FILTER_VALIDATE_URL, FILTER_FLAG_PATH_REQUIRED)
More information here :
http://ca.php.net/filter_var

Maybe strrpos will help you. It will locate the last occurrence of a string within a string

To check the format of the URL you could use a regular expression:
preg_match [ http://php.net/manual/en/function.preg-match.php ] is a good start, but a knowledge of regular expressions is needed to make it work.
Additionally, if you actually want to check that it's a valid URL, you could check the URL value to see if it actually resolves to a web page:
function check_404($url) {
$return = #get_headers($url);
if (strpos($return[0], ' 404 ') === false)
return true;
else {
return false;
}
}

Try using a regular expression to see that the URL has the correct structure. Here's more reading on this. You need to learn how PCRE works.
A simple example for what you want (disclaimer: not tested, incomplete).
function isValidUrl($url) {
return preg_match('#http://[^/]+/.+#', $url));
}

From here: http://www.blog.highub.com/regular-expression/php-regex-regular-expression/php-regex-validating-a-url/
<?php
/**
* Validate URL
* Allows for port, path and query string validations
* #param string $url string containing url user input
* #return boolean Returns TRUE/FALSE
*/
function validateURL($url)
{
$pattern = '/^(([\w]+:)?\/\/)?(([\d\w]|%[a-fA-f\d]{2,2})+(:([\d\w]|%[a-fA-f\d]{2,2})+)?#)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,4}(:[\d]+)?(\/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?$/';
return preg_match($pattern, $url);
}
$result = validateURL('http://www.google.com');
print $result;
?>

clean the url in php

I am trying to make a user submit link box. I've been trying all day and can't seem to get it working.
The goal is to make all of these into example.com... (ie. remove all stuff before the top level domain)
Input is $url =
Their are 4 types of url:
www.example.com...
example.com...
http://www.example.com...
http://example.com...
Everything I make works on 1 or 2 types, but not all 4.
How one can do this?

You can use parse_url for that. For example:
function parse($url) {
$parts = parse_url($url);
if ($parts === false) {
return false;
}
return isset($parts['scheme'])
? $parts['host']
: substr($parts['path'], 0, strcspn($parts['path'], '/'));
}
This will leave the "www." part if it already exists, but it's trivial to cut that out with e.g. str_replace. If the url you give it is seriously malformed, it will return false.
Update (an improved solution):
I realized that the above would not work correctly if you try to trick it hard enough. So instead of whipping myself trying to compensate if it does not have a scheme, I realized that this would be better:
function parse($url) {
$parts = parse_url($url);
if ($parts === false) {
return false;
}
if (!isset($parts['scheme'])) {
$parts = parse_url('http://'.$url);
}
if ($parts === false) {
return false;
}
return $parts['host'];
}

Your input can be
www.example.com
example.com
http://www.example.com
http://example.com
$url_arr = parse_url($url);
echo $url_arr['host'];
output is example.com

there's a few steps you can take to get a clean url.
Firstly you need to make sure there is a protocol to make parse_url work correctly so you can do:
//Make sure it has a protocol
if(substr($url,0,7) != 'http://' || substr($url,0,8) != 'https://')
{
$url = 'http://' . $url;
}
Now we run it through parse_url()
$segments = parse_url($url);
But this is where it get's complicated because the way domain names are constructed is that you can have 1,2,3,4,5,6 .. .domain levels, meaning that you cannot detect the domain name from all urls, you have to have a pre compiled list of tld's to check the last portion of the domain, so you then can extract that leaving the website's domain.
There is a list available here : http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1
But you would be better of parsing this list into mysql and then select the row where the tld matches the left side of the domain string.
Then you order by length and limit to 1, if this is found then you can do something like:
$db_found_tld = 'co.uk';
$domain = 'a.b.c.domain.co.uk';
$domain_name = substr($domain,0 - strlen($db_found_tld));
This would leave a.b.c.domain, so you have removed the tld, now the domain name would be extracted like so:
$parts = explode($domain_name);
$base_domain = $parts[count($parts) - 1];
now you have domain.
this seems very lengthy but I hope now you know that its not easy to get just the domain name without tld or sub domains.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.