clean the url in php

clean the url in php - php

I am trying to make a user submit link box. I've been trying all day and can't seem to get it working.
The goal is to make all of these into example.com... (ie. remove all stuff before the top level domain)
Input is $url =
Their are 4 types of url:
www.example.com...
example.com...
http://www.example.com...
http://example.com...
Everything I make works on 1 or 2 types, but not all 4.
How one can do this?

You can use parse_url for that. For example:
function parse($url) {
$parts = parse_url($url);
if ($parts === false) {
return false;
}
return isset($parts['scheme'])
? $parts['host']
: substr($parts['path'], 0, strcspn($parts['path'], '/'));
}
This will leave the "www." part if it already exists, but it's trivial to cut that out with e.g. str_replace. If the url you give it is seriously malformed, it will return false.
Update (an improved solution):
I realized that the above would not work correctly if you try to trick it hard enough. So instead of whipping myself trying to compensate if it does not have a scheme, I realized that this would be better:
function parse($url) {
$parts = parse_url($url);
if ($parts === false) {
return false;
}
if (!isset($parts['scheme'])) {
$parts = parse_url('http://'.$url);
}
if ($parts === false) {
return false;
}
return $parts['host'];
}

Your input can be
www.example.com
example.com
http://www.example.com
http://example.com
$url_arr = parse_url($url);
echo $url_arr['host'];
output is example.com

there's a few steps you can take to get a clean url.
Firstly you need to make sure there is a protocol to make parse_url work correctly so you can do:
//Make sure it has a protocol
if(substr($url,0,7) != 'http://' || substr($url,0,8) != 'https://')
{
$url = 'http://' . $url;
}
Now we run it through parse_url()
$segments = parse_url($url);
But this is where it get's complicated because the way domain names are constructed is that you can have 1,2,3,4,5,6 .. .domain levels, meaning that you cannot detect the domain name from all urls, you have to have a pre compiled list of tld's to check the last portion of the domain, so you then can extract that leaving the website's domain.
There is a list available here : http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1
But you would be better of parsing this list into mysql and then select the row where the tld matches the left side of the domain string.
Then you order by length and limit to 1, if this is found then you can do something like:
$db_found_tld = 'co.uk';
$domain = 'a.b.c.domain.co.uk';
$domain_name = substr($domain,0 - strlen($db_found_tld));
This would leave a.b.c.domain, so you have removed the tld, now the domain name would be extracted like so:
$parts = explode($domain_name);
$base_domain = $parts[count($parts) - 1];
now you have domain.
this seems very lengthy but I hope now you know that its not easy to get just the domain name without tld or sub domains.

Related

Modify function getdomain, would need it without subdomain in PHP

I would need help with my code:
I have a function which only replaces thee www. with a blank space.
For example:
If I add the url: www.testek.com
The user will see testek.com
But if I add the url: s.dada.testek.com
The user will see s.dada.testek.com
So if we use the domain s.dada.testek.com I would like that the end user sees only testek.com.
But I would like to get only the main domain without any subdomains.
Code:
function getdomain($url){
$parsed = parse_url($url);
return str_replace('www.','', strtolower($parsed['host']));
}
I saw a post but it won't work for me.
Thanks for the help!
Now I've changed the code to:
function getdomain($url){
$parsed = parse_url($url);
$bits = explode(".",$parsed["host"]);
$mainDomain = array_filter($bits, function ($i) use ($bits) {
return $i >= count($bits)-2;
}, array(
'www.rover.ebay.com' => 'ebay.com',
's.click.aliexpress.com' => 'aliexpress.com', );
return implode(".", $mainDomain);
}
Am I thinking the right way?
Because now the end user sees like this:
http://i.stack.imgur.com/JddKB.jpg

If you simply want to get the last 2 segments of a URL main domain name then you can do the following:
function getdomain($url){
$parsed = parse_url($url);
$bits = explode(".",$parsed["host"]);
$mainDomain = array_filter($bits, function ($i) use ($bits) {
return $i >= count($bits)-2;
}, ARRAY_FILTER_USE_KEY );
return implode(".", $mainDomain);
}
See how it works in https://eval.in/636860
Unfortunately most of the times there's no "catch all" solution and you have to do a lot of hard-coded things. e.g. the UK has .co.uk but France just .fr so depending on that you may need the last 3 or even 4 segments.

I've fixed it like this:
function getdomain($url){
$parsed = parse_url($url);
$replace = array ("rover.", "www.", "s.click.");
return str_replace($replace,'', strtolower($parsed['host']));
}
I've created an array with the "subdomains" which I don't want to be shown.
And now it works ok.
apokryfos thanks for your support and for opening my mind :)

PHP Auto-correcting URLs

I dont wan't reinvent wheel, but i couldnt find any library that would do this perfectly.
In my script users can save URLs, i want when they give me list like:
google.com
www.msn.com
http://bing.com/
and so on...
I want to be able to save in database in "correct format".
Thing i do is I check is it there protocol, and if it's not present i add it and then validate URL against RegExp.
For PHP parse_url any URL that contains protocol is valid, so it didnt help a lot.
How guys you are doing this, do you have some idea you would like to share with me?
Edit:
I want to filter out invalid URLs from user input (list of URLs). And more important, to try auto correct URLs that are invalid (ex. doesn't contains protocol). Ones user enter list, it should be validated immediately (no time to open URLs to check those they really exist).
It would be great to extract parts from URL, like parse_url do, but problem with parse_url is, it doesn't work well with invalid URLs. I tried to parse URL with it, and for parts that are missing (and are required) to add default ones (ex. no protocol, add http). But parse_url for "google.com" wont return "google.com" as hostname but as path.
This looks like really common problem to me, but i could not find available solution on internet (found some libraries that will standardize URL, but they wont fix URL if it is invalid).
Is there some "smart" solution to this, or I should stick with my current:
Find first occurrence of :// and validate if it's text before is valid protocol, and add protocol if missing
Found next occurrence of / and validate is hostname is in valid format
For good measure validate once more via RegExp whole URL
I just have feeling I will reject some valid URLs with this, and for me is better to have false positive, that false negative.

I had the same problem with parse_url as OP, this is my quick and dirty solution to auto-correct urls(keep in mind that the code in no way are perfect or cover all cases):
Results:
http:/wwww.example.com/lorum.html => http://www.example.com/lorum.html
gopher:/ww.example.com => gopher://www.example.com
http:/www3.example.com/?q=asd&f=#asd =>http://www3.example.com/?q=asd&f=#asd
asd://.example.com/folder/folder/ =>http://example.com/folder/folder/
.example.com/ => http://example.com/
example.com =>http://example.com
subdomain.example.com => http://subdomain.example.com
function url_parser($url) {
// multiple /// messes up parse_url, replace 2+ with 2
$url = preg_replace('/(\/{2,})/','//',$url);
$parse_url = parse_url($url);
if(empty($parse_url["scheme"])) {
$parse_url["scheme"] = "http";
}
if(empty($parse_url["host"]) && !empty($parse_url["path"])) {
// Strip slash from the beginning of path
$parse_url["host"] = ltrim($parse_url["path"], '\/');
$parse_url["path"] = "";
}
$return_url = "";
// Check if scheme is correct
if(!in_array($parse_url["scheme"], array("http", "https", "gopher"))) {
$return_url .= 'http'.'://';
} else {
$return_url .= $parse_url["scheme"].'://';
}
// Check if the right amount of "www" is set.
$explode_host = explode(".", $parse_url["host"]);
// Remove empty entries
$explode_host = array_filter($explode_host);
// And reassign indexes
$explode_host = array_values($explode_host);
// Contains subdomain
if(count($explode_host) > 2) {
// Check if subdomain only contains the letter w(then not any other subdomain).
if(substr_count($explode_host[0], 'w') == strlen($explode_host[0])) {
// Replace with "www" to avoid "ww" or "wwww", etc.
$explode_host[0] = "www";
}
}
$return_url .= implode(".",$explode_host);
if(!empty($parse_url["port"])) {
$return_url .= ":".$parse_url["port"];
}
if(!empty($parse_url["path"])) {
$return_url .= $parse_url["path"];
}
if(!empty($parse_url["query"])) {
$return_url .= '?'.$parse_url["query"];
}
if(!empty($parse_url["fragment"])) {
$return_url .= '#'.$parse_url["fragment"];
}
return $return_url;
}
echo url_parser('http:/wwww.example.com/lorum.html'); // http://www.example.com/lorum.html
echo url_parser('gopher:/ww.example.com'); // gopher://www.example.com
echo url_parser('http:/www3.example.com/?q=asd&f=#asd'); // http://www3.example.com/?q=asd&f=#asd
echo url_parser('asd://.example.com/folder/folder/'); // http://example.com/folder/folder/
echo url_parser('.example.com/'); // http://example.com/
echo url_parser('example.com'); // http://example.com
echo url_parser('subdomain.example.com'); // http://subdomain.example.com

It's not 100% foolproof, but a 1 liner.
$URL = (((strpos($URL,'https://') === false) && (strpos($URL,'http://') === false))?'http://':'' ).$URL;
EDIT
There was apparently a problem with my initial version if the hostname contain http.
Thanks Trent

Get subdomain if any

Is there any predefined method in PHP to get sub-domain from url if any?
url pattern may be:
http://www.sd.domain.com
http://domain.com
http://sd.domain.com
http://domain.com
where sd stands for sub-doamin.
Now method must return different values for every case:
case 1 -> return sd
case 2 -> return false or empty
case 3 -> return sd
case 4 -> return false or empty
I found some good links
PHP function to get the subdomain of a URL
Get subdomain from url?
but not specifically apply on my cases.
Any help will be most appreciable.
Thanks

Okay, here I create a script :)
$url = $_SERVER['HTTP_HOST'];
$host = explode('.', $url);
if( !empty($host[0]) && $host[0] != 'www' && $host[0] != 'localhost' ){
$domain = $host[0];
}else{
$domain = 'home';
}

So, there are several possibilities...
First, regular expressions of course:
(http://)?(www\.)?([^\.]*?)\.?([^\.]+)\.([^\.]+)
The entry in the third parenthesis will be your subdomain. Of course, if your url would be https:// or www2 (seen it all...) the regex would break. So this is just a first draft to start working with.
My second idea is, just as yours, explodeing the url. I thought of something like this:
function getSubdomain($url) {
$parts = explode('.', str_replace('http://', '', $url));
if(count($parts) >= 3) {
return $parts[count($parts) - 3];
}
return null;
}
My idea behind this function was, that if an url is splitted by . the subdomain will almost always be the third last entry in the resulting array. The protocol has to be stripped first (see case 3). Of course, this certainly can be done more elegant.
I hope I could give you some ideas.

Try this.
[update] We have a constant defined _SITE_ADDRESS such as www.mysite.com you could use a literal for this.
It works well in our system for what seems like that exact purpose.
public static function getSubDomain()
{
if($_SERVER["SERVER_NAME"] == str_ireplace('http://','',_SITE_ADDRESS)) return ''; //base domain
$host = str_ireplace(array("www.", _SITE_ADDRESS), "", strtolower(trim($_SERVER["HTTP_HOST"])));
$sub = preg_replace('/\..*/', '', $host);
if($sub == $host) return ''; //this is likely an ip address
return $sub;
}
There is an external note on that function but no link, So sorry to any original developer who's code this is based on.

PHP router without regular expressions

I have been working on a fancy router/dispatcher class for weeks now trying to decide how I wanted it, I got it perfect IMO except performance is not what I am wanting from it. It uses a route map arrap = /forums/viewthread/:id/:page => 'forums/viewthread/(?\d+)' and loops through my map array with regex to get a match, I am trying to get something better on a high traffic site, here is a start...
$uri = "forum/viewforum/id-522/page-3";
$parts = explode("/", $uri);
$controller = $parts['0'];
$method = $parts['1'];
if($parts['2'] != ''){
$idNumber = $parts['2'];
}
if($parts['3'] != ''){
$pageNumber = $parts['3'];
}
Where I need help is sometime an id and a page will not be present sometime one or the other and sometimes both, so obvioulsy my above code would not cover that, it assumes array item 2 is always the id and 3 is always the page, could someone show me a practical way of matchting up the page and id to a variable only if they exist in the URI and without using regular expressions?
You can see what I have so far on my regular expressions versions in this question Is this a good way to match URI to class/method in PHP for MVC

This seems more extendable:
$parts = explode("/", $uri);
$parts_count=count($parts);
//set default values
$page_info=array('id'=>0,'page'=>0);
for($i=2;$i<$parts_count;$i++) {
if(strpos($parts[$i],'-')!==FALSE) {
list($info_type,$info_val)=explode('-',$parts[$i],2);
if(isset($page_info[$info_type])) {
$page_info[$info_type]=(int)$info_val;
}
}
}
then just use $page_info values. You can easily add other values this way and more levels of '/'.

if ( ! empty($parts['2']))
{
if (strpos($parts['2'], 'id-') !== FALSE)
{
$idNumber = str_replace('id-', '', $parts['2']);
}
elseif (strpos($parts['2'], 'page-') !== FALSE)
{
$pageNumber = str_replace('id-', '', $parts['2']);
}
}
And do the same for $part[3]

URL parse function

Given this variable:
$variable = foo.com/bar/foo
What function would trim $variable to foo.com ?
Edit: I would like the function to be able to trim anything on a URL that could possibly come after the domain name.
Thanks in advance,
John

Working for OP:
$host = parse_url($url, PHP_URL_HOST);
The version of PHP I have to work with doesn't accept two parameters (Zend Engine 1.3.0). Whatever. Here's the working code for me - you do have to have the full URL including the scheme (http://). If you can safely assume that the scheme is http:// (and not https:// or something else), you could just prepend that to get what you need.
Working for me:
$url = 'http://foo.com/bar/foo';
$parts = parse_url($url);
$host = $parts['host'];
echo "The host is $host\n";

I'm using http://www.google.com/asdf in my example
If you're fine with getting the subdomain as well, you could split by "//" and take the 1th element to effectively remove the protocol and get www.google.com/asdf
You can then split by "/" and get the 0th element.
That seems ugly. Just brainstorming here =)

Try this:
function getDomain($url)
{
if(filter_var($url, FILTER_VALIDATE_URL, FILTER_FLAG_HOST_REQUIRED) === FALSE)
{
return false;
}
/*** get the url parts ***/
$parts = parse_url($url);
/*** return the host domain ***/
return $parts['scheme'].'://'.$parts['host'];
}
$variable = 'foo.com/bar/foo';
echo getDomain($variable);

You can use php's parse_url function and then access the value of the key "host" to get the hostname

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.