Regex for Domains having three dots ex:- "gov.ac.in" - php

We a list of URL's in this format (http://www.xyz.gov.ac.in). Not all of them look like this, some of them have normal domains. I am confused on how to get the domain name from a 3 dotted url. The code we have is working fine for 2 dotted domain names.
Here is the code we have:
function get_domain($url)
{
$pieces = parse_url($url);
$domain = isset($pieces['host']) ? $pieces['host'] : '';
if (preg_match('/(?P<domain>[a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$/i', $domain, $regs)) {
return $regs['domain'];
}
return false;
}
echo get_domain($url) ;
How can we modify the above code to accommodate for 3 dotted domains as well as the other types?
The echo results should be in this format xyz.gov.ac.in

Basically, you can't. At least not without a lookup table that has all "TLDs".
For example, in my country (The Netherlands) we have .nl and .co.nl. But www.gov.nl is a normal website (I'm trying to illustrate that you can't automatically say that gov. isn't a domain). And www.edu.nl doesn't exist.
Any standard regex that would try to parse them would tell you that the domain is www.gov.nl, while the domain is actually gov.nl. Same for edu.nl.
The only way you can accomplish what you want is by getting a list of all TLDs (and sub-TLDs) and using that to parse them.
I believe that Firefox and Chrome have such a list implemented (for coloring the domain name in the URL) and constantly keep it up-to-date. Maybe look in those sources?

Try this:
/(^[\w|-]+\.)(?P<domain>([\w|-]+\.)+(\w+))/i
Hope this will help..

You should be able to use this Regex instead
/(?P<domain>([a-z0-9][a-z0-9\-]{1,63}\.)+[a-z\.]{2,6})$/i

Related

URL availability from MySQL Database using Php [duplicate]

This question already has answers here:
How do I apply URL normalization rules in PHP?
(1 answer)
Extract hostname name from string
(29 answers)
Closed 8 years ago.
I am trying to create a registration system that allows users to submit their website URLs. Now when a user enters a URL, it checks against the database to see if it already exists, and rejects it if it does.
However, my problem is because of this :
If http://www.example.com is in the database and I enter http://example.com, this counts as a different URL as far as the check is concerned, and it allows the submission.
Is there a proper way to handle this, apart from retrieving all records, removing the www if present, and then comparing? (Which is a terribly inefficient way to do so!)
Note : Adding Laravel tag in case it has any helper functions for this (I am using a laravel-4 installation).
EDIT : This is my current logic for the check :
$exists_url = DB::table("user_urls")
->where('site_url', 'like', $siteurl)
->get();
if($exists_url)
{
return Redirect::to('submiturl')->withErrors('Site already exists!');
}
EDIT 2 : One way is to take the given URL http://www.example.com, and then search the database for http://www.example.com, www.example.com, http://example.com and example.com. However I'm trying to find a more efficient way to do this!
I think before you implement a solution you should abstractly flesh out your policy more thoroughly. There are many parts of a URL which may or may not be equivalent. Do you want to treat protocols as equivalent? https://foo.com vs http://foo.com. Some subdomains might be aliases, some might not. http://www.foo.com vs http://foo.com, or http://site1.foo.com vs http://foo.com. What about the path of the the URL? http://foo.com vs http://foo.com/index.php. I wouldn't waste your time writing a comparison function until you've completely thought through your policy. Good luck!
UPDATE:
Something like this perhaps:
$ignore_subdomains = array('www','web','site');
$domain_parts = explode('.',$siteurl);
$subdomain = strtolower(array_shift($domain_parts));
$siteurl = (in_array($subdomain,$ignore_subdomains)) ? implode('.',$domain_parts) : $siteurl;
//now run your DB comparison query
You can check before sending data to database using PHP. Following is a small example. Obviously you can make it more advanced as per your liking.
<?php
$test = "http://www.google.com";
$testa = "http://google.com";
if (str_replace("www.","",str_replace("http://","",$testa)) == str_replace("www.","",str_replace("http://","",$test))) {
echo "same";
}
else {
echo "different";
}
?>
Following is MySQL Replace example. In this example 'url' is database field.
SELECT * FROM `urls` WHERE REPLACE(url, "http://www","") = "google.com" OR REPLACE(url,"http://","") = "google.com"
Once again this is very basic just for your better understanding.
In case you need any further assistance kindly do let me know

Reduce link (URL) size

Is it possible to reduce the size of a link (in text form) by PHP or JS?
E.g. I might have links like these:
http://www.example.com/index.html <- Redirects to the root
http://www.example.com/folder1/page.html?start=true <- Redirects to page.html
http://www.example.com/folder1/page.html?start=false <- Redirects to page.html?start=false
The purpose is to find out, if the link can be shortened and still point to the same location. In these examples the first two links can be reduces, because the first points to the root, and the second has parameters that can be omitted.
The third link is then the case, where the parameters can't be omitted, meaning that it can't be reduced further than to remove the http://.
So the above links would be reduced like this:
Before: http://www.example.com/index.html
After: www.example.com
Before: http://www.example.com/folder1/page.html?start=true
After: www.example.com/folder1/page.html
Before: http://www.example.com/folder1/page.html?start=false
After: www.example.com/folder1/page.html?start=false
Is this possible by PHP or JS?
Note:
www.example.com is not a domain I own or have access to besides through the URL. The links are potentially unknown, and I'm looking for something like an automatic link shortener that can work by getting the URL and nothing else.
Actually I was thinking of something like a linkchecker that could check if the link works before and after the automatic trim, and if it doesn't then the check will be done again at a less trimmed version of the link. But that seemed like overkill...
Since you want to do this automatically, and you don't know how the parameters change the behaviour, you will have to do this by trial and error: Try to remove parts from an URL, and see if the server responds with a different page.
In the simplest case this could work somehow like this:
<?php
$originalUrl = "http://stackoverflow.com/questions/14135342/reduce-link-url-size";
$originalContent = file_get_contents($originalUrl);
$trimmedUrl = $originalUrl;
while($trimmedUrl) {
$trialUrl = dirname($trimmedUrl);
$trialContent = file_get_contents($trialUrl);
if ($trialContent == $originalContent) {
$trimmedUrl = $trialUrl;
} else {
break;
}
}
echo "Shortest equivalent URL: " . $trimmedUrl;
// output: Shortest equivalent URL: http://stackoverflow.com/questions/14135342
?>
For your usage scenario, your code would be a bit more complicated, as you would have to test for each parameter in turn to see if it is necessary. For a starting point, see the parse_url() and parse_str() functions.
A word of caution: this code is very slow, as it will perform lots of queries to every URL you want to shorten. Also, it will likely fail to shorten many URLs because the server might include stuff like timestamps in the response. This makes the problem very hard, and that's the reason why companies like google have many engineers that think about stuff like this :).
Yea, that's possible:
JS:
var url = 'http://www.example.com/folder1/page.html?start=true';
url = url.replace('http://','').replace('?start=true','').replace('/index.html','');
php:
$url = 'http://www.example.com/folder1/page.html?start=true';
$url = str_replace(array('http://', '?start=true', '/index.html'), "", $url);
(Each item in the array() will be replaced with "")
Here is a JS for you.
function trimURL(url, trimToRoot, trimParam){
var myRegexp = /(http:\/\/|https:\/\/)(.*)/g;
var match = myRegexp.exec(url);
url = match[2];
//alert(url); // www.google.com
if(trimParam===true){
url = url.split('?')[0];
}
if(trimToRoot === true){
url = url.split('/')[0];
}
return url
}
alert(trimURL('https://www.google.com/one/two.php?f=1'));
alert(trimURL('https://www.google.com/one/two.php?f=1', true));
alert(trimURL('https://www.google.com/one/two.php?f=1', false, true));
Fiddle: http://jsfiddle.net/5aRpQ/

Extracting part of a domain name

I have a url, members.exampledomain.com, and I would like to display only exampledomain onto my page.
For example http://members.exampledomain.com's index page has something like
<img src="members/images/logoexampledomain.png" />
Try using:
print_r($_SERVER);
and see if there is something you like there :)
Else if it is allways the same url like that, you can use $pieces = explode('.', $url); and $pieces[1] will contain exampledomain
If that doesn't work for you I have used javascript before to retrieve information out of urls before on pages where php would not run. I believe I retrieved it using window.location and had a function figure out what characters to look at so it would know where to split the url or retrieve out the information you want.
If you wish to do this with the domain as suggested above, the following code should work for you.
$full_domain = 'members.example.com';
$parts = explode('.', $full_domain);
array_slice($parts, -2, 2);
$domain = implode('.', $parts);
echo $domain;
This is a basic string parser for the domain. Things get much more complex when you are using multiple domains on the same script (aka Virtual hosting), and those domains have varying extensions such as .com.au .
If you wish to get the 'member' portion of the domain, you can add the following code after the above code:
$member = rtrim(str_replace($domain, '', $full_domain), '.');
echo $member;

PHP - allow domains, not subdomains

I would appreciate any help that can be provided with this matter.
I am creating a registration form, one field is for the users domain which I will verify is valid with FILTER_VALIDATE_URL and that it exists with dns_check_record.
However a problem I'm having is that using these two methods will also allow subdomains to be submitted to the form which I don't want.
Does anyone know a way to allow domains but not subdomains?
I've tested the following function, from http://syntax.cwarn23.net/PHP/Strip_URL_to_Domain:
function domain($domainb)
{
$bits = explode('/', $domainb);
if ($bits[0]=='http:' || $bits[0]=='https:')
{
$domainb= $bits[2];
} else {
$domainb= $bits[0];
}
unset($bits);
$bits = explode('.', $domainb);
$idz=count($bits);
$idz-=3;
if (strlen($bits[($idz+2)])==2) {
$url=$bits[$idz].'.'.$bits[($idz+1)].'.'.$bits[($idz+2)];
} else if (strlen($bits[($idz+2)])==0) {
$url=$bits[($idz)].'.'.$bits[($idz+1)];
} else {
$url=$bits[($idz+1)].'.'.$bits[($idz+2)];
}
return $url;
However this isn't perfect as any domains such as www.domain.uk.com will appear as uk.com (I know not a common domain extension).
Does anyone know a method better than the above function?
As pointed by Micheal Mior, you have to check for .co.uk, .com.br and many others.
Some browser vendors are maintaining a list of such non-TLD that are effectively TLD: http://publicsuffix.org/. The list is quite huge.
There is a library here that uses this effective TLD list to implement the function you are looking for (download are here). (Found via https://wiki.mozilla.org/Gecko:Effective_TLD_Service.)
Combine them.
dns_check_record will fail on '.co.uk', so you can split your string on the dots, check the domain you get when you combine the last two parts, and if that fails, use a third part too, if any.
You will do a double check for invalid domains, but I assume that won't be an issue.
first you could use parse_url() to get only the host name: http://www.stackoverflow.com -> $url['host'] = 'www.stackoverflow.com'
Second you could count the amount of points in the hostname: explode() --> count() or substr_count()
Has the host more than 1 point a subdomain could be exist.
Now you could use the solution mentioned by GolezTrol or arnaud576875.

How do I detect subdomain and filter it?

I do have a domain search function. In search box you have the option to enter any kind of domain names. what I am looking into is how do I filter sub domain from search or else trim sub domain and keep only main.
for example if a user entered mail.yahoo.com then that to be convert to yahoo.com or it can be omitted from search.
Here's a more concise way to grab the domain and a likely subdomain from a URL.
function find_subdomain($url) {
$parts = parse_url($url);
$domain_parts = explode('.', $parts['host']);
while(count($domain_parts) > 4)
array_shift($domain_parts);
return join('.', $domain_parts);
}
Keep in mind that not everything that looks like a subdomain is really a subdomain. Some countries have their own country-specific domains that everyone uses, like .co.uk and .com.au. You can not rely on the number of dots in the URL to tell you what is and is not a subdomain. In fact, you might need the opposite approach - first remove the top-level domain, then see what's left. Unfortunately then you're left with the second-level domain problem.
Can you tell us more about what exactly you are trying to accomplish? Why are you trying to detect subdomains? You mentioned a search box. What is being searched?
Edit: I have updated the function to up to four of the right-most parts of the domain. Given "http://one.two.three.four.five.six.com" it will return 'four.five.six.com'
I customized an utility function that i'm using, it's close to perfection (but that's what you could get without hard-coding all the possible list of domain extensions).
Here's the catch: the assumes that the main domain contains at least 4 characters. i.e for: sub.mail.com, it returns mail.com But for sub.aol.com it returns sub.aol.com
function get_main_domain($host='') {
if(empty($host))$host=$_SERVER['HTTP_HOST'];
$domain_parts = explode('.',$host);
$count=count($domain_parts);
if($count<=2)return $host;
$permit=0;
for($i=$count-1;$i>=0;$i--){
$permit++;
if(strlen($domain_parts[$i])>3)break;
}
while(count($domain_parts) >$permit)array_shift($domain_parts);
return join('.', $domain_parts);
}
Well that doesnt work for all domain if you forgot to mention it in array...
Here is my solution...but I need to compress it to few lines...is it possible??
function subdomain($domainb){$bits = explode('/', $domainb);
if ($bits[0]=='http:' || $bits[0]=='https:'){
$domainb= $bits[2];
} else {$domainb= $bits[0];}
unset($bits);
$bits = explode('.', $domainb); $idz=0;
while (isset($bits[$idz])){$idz+=1;}
$idz-=4; $idy=0;
while ($idy<$idz){ unset($bits[$idy]);
$idy+=1;} $part=array();
foreach ($bits AS $bit){$part[]=$bit;}
unset($bit); unset($bits); unset($domainb);
if (strlen($part[1])>4){ unset($part[0]);}
foreach($part AS $bit){$domainb.=$bit.'.';}
unset($bit);
return preg_replace('/(.*)\./','$1',$domainb);}

Categories