I have this code right here:
// get host name from URL
preg_match('#^(?:http://)?([^/]+)#i',
"http://www.joomla.subdomain.php.net/index.html", $matches);
$host = $matches[1];
// get last two segments of host name
preg_match('/[^.]+\.[^.]+$/', $host, $matches);
echo "domain name is: {$matches[0]}\n";
The output will be php.net
I need just php without .net
Although regexes are fine here, I'd recommend parse_url
$host = parse_url('http://www.joomla.subdomain.php.net/index.html', PHP_URL_HOST);
$domains = explode('.', $host);
echo $domains[count($domains)-2];
This will work for TLD's like .com, .org, .net, etc. but not for .co.uk or .com.mx. You'd need some more logic (most likely an array of tld's) to parse those out .
Group the first part of your 2nd regex into /([^.]+)\.[^.]+$/ and $matches[1] will be php
Late answer and it doesn't work with subdomains, but it does work with any tld (co.uk, com.de, etc):
$domain = "somesite.co.uk";
$domain_solo = explode(".", $domain)[0];
print($domain_solo);
Demo
It's really easy:
function get_tld($domain) {
$domain=str_replace("http://","",$domain); //remove http://
$domain=str_replace("www","",$domain); //remowe www
$nd=explode(".",$domain);
$domain_name=$nd[0];
$tld=str_replace($domain_name.".","",$domain);
return $tld;
}
To get the domain name, simply return $domain_name, it works only with top level domain. In the case of subdomains you will get the subdomain name.
Related
I'm working with some code used to try and find all the website URLs within a block of text. Right now we've already got checks that work fine for URLs formatted such as http://www.google.com or www.google.com but we're trying to find a regex that can locate a URL in a format such as just google.com
Right now our regex is set to search for every domain that we could find registered which is around 1400 in total, so it looks like this:
/(\S+\.(COM|NET|ORG|CA|EDU|UK|AU|FR|PR)\S+)/i
Except with ALL 1400 domains to check in the group(the full thing is around 8400 characters long). Naturally it's running quite slowly, and we've already had the idea to simply check for the 10 or so most commonly used domains but I wanted to check here first to see if there was a more efficient way to check for this specific formatting of website URLs rather than singling every single one out.
You could use a double pass search.
Search for every url-like string, e.g.:
((http|https):\/\/)?([\w-]+\.)+[\S]{2,5}
On every result do some non-regex checks, like, is the length enough, is the text after the last dot part of your tld list, etc.
function isUrl($urlMatch) {
$tldList = ['com', 'net'];
$urlParts = explode(".", $urlMatch);
$lastPart = end($urlParts);
return in_array($lastPart, $tldList);
}
Example
function get_host($url) {
$host = parse_url($url, PHP_URL_HOST);
$names = explode(".", $host);
if(count($names) == 1) {
return $names[0];
}
$names = array_reverse($names);
return $names[1] . '.' . $names[0];
}
Usage
echo get_host('https://google.com'); // google.com
echo "\n";
echo get_host('https://www.google.com'); // google.com
echo "\n";
echo get_host('https://sub1.sub2.google.com'); // google.com
echo "\n";
echo get_host('http://localhost'); // localhost
Demo
i want to extract only domain name from given URL for any kind of TLD's . TLD can have any number in character or my be two in size
<?php $domain_name = 'http://www.subdomain.domain.co.in' ;
$ParsedURL = parse_url($domain_name);
$domain_name = preg_replace("/^([a-zA-Z0-9].*\.)?([a-zA-Z0-9][a-zA-Z0-9-]{1,61}[a-zA-Z0-9]\.[a-zA-Z.]{2,})$/", '$2', $ParsedURL['host']);
$domain_name = current(explode('.', $domain_name));
print_r($domain_name);
so above solution is satisfying only limited URLs ,it fails when TDL's are three or more in characters e.g org.in ,
Again i am getting different results when there is no subdomain in url .
Kindly help me out as i only want "domain" from Given URL
Try this regex /^\.{1}.*\.[a-zA-Z0-9]{2,3}(\.[a-zA-Z0-9]{2,3})?$/ the only thing would be to strip the first '.' from the matched string: <?php substr($domain_name, 1);?>
Imagine you have this cases
$d='http://www.example.com/';
$d1='http://example.com/';
$d2='http://www.example.com';
$d3='www.example.com/';
$d4='http://www.example.com/';
$d5='http://www.example.com/blabla/blabla.php';
I need to get only example.com and nothing else.
I've tried using parse_url to no avail.
Using parse_url($d1, PHP_URL_HOST); returns nothing in $d3, for example.
Can any of you provide a ereg to match this?
Thank you very much in advance!
There is no path_url function, but you can use the parse_url function to get the host (domain name) out of a URL string:
if(!preg_match('#^https?://#', $str))
{
$domain = 'http://' . $domain;
}
$domain = parse_url($str, PHP_URL_HOST);
I'm trying to get a users ID from a string such as:
http://www.abcxyz.com/123456789/
To appear as 123456789 essentially stripping the info up to the first / and also removing the end /. I did have a look around on the net but there seems to be so many solutions but nothing answering both start and end.
Thanks :)
Update 1
The link can take two forms: mod_rewrite as above and also "http://www.abcxyz.com/profile?user_id=123456789"
I would use parse_url() to cleanly extract the path component from the URL:
$path = parse_URL("http://www.example.com/123456789/", PHP_URL_PATH);
and then split the path into its elements using explode():
$path = trim($path, "/"); // Remove starting and trailing slashes
$path_exploded = explode("/", $path);
and then output the first component of the path:
echo $path_exploded[0]; // Will output 123456789
this method will work in edge cases like
http://www.example.com/123456789?test
http://www.example.com//123456789
www.example.com/123456789/abcdef
and even
/123456789/abcdef
$string = 'http://www.abcxyz.com/123456789/';
$parts = array_filter(explode('/', $string));
$id = array_pop($parts);
If the ID always is the last member of the URL
$url="http://www.abcxyz.com/123456789/";
$id=preg_replace(",.*/([0-9]+)/$,","\\1",$url);
echo $id;
If there is no other numbers in the URL, you can also do
echo filter_var('http://www.abcxyz.com/123456789/', FILTER_SANITIZE_NUMBER_INT);
to strip out everything that is not a digit.
That might be somewhat quicker than using the parse_url+parse_str combination.
If your domain does not contain any numbers, you can handle both situations (with or without user_id) using:
<?php
$string1 = 'http://www.abcxyz.com/123456789/';
$string2 = 'http://www.abcxyz.com/profile?user_id=123456789';
preg_match('/[0-9]+/',$string1,$matches);
print_r($matches[0]);
preg_match('/[0-9]+/',$string2,$matches);
print_r($matches[0]);
?>
I have an url like this:
http://www.w3schools.com/PHP/func_string_str_split.asp
I want to split that url to get the host part only. For that I am using
parse_url($url,PHP_URL_HOST);
it returns www.w3schools.com.
I want to get only 'w3schools.com'.
is there any function for that or do i have to do it manually?
There are many ways you could do this. A simple replace is the fastest if you know you always want to strip off 'www.'
$stripped=str_replace('www.', '', $domain);
A regex replace lets you bind that match to the start of the string:
$stripped=preg_replace('/^www\./', '', $domain);
If it's always the first part of the domain, regardless of whether its www, you could use explode/implode. Though it's easy to read, it's the most inefficient method:
$parts=explode('.', $domain);
array_shift($parts); //eat first element
$stripped=implode('.', $parts);
A regex achieves the same goal more efficiently:
$stripped=preg_replace('/^\w+\./', '', $domain);
Now you might imagine that the following would be more efficient than the above regex:
$period=strpos($domain, '.');
if ($period!==false)
{
$stripped=substr($domain,$period+1);
}
else
{
$stripped=$domain; //there was no period
}
But I benchmarked it and found that over a million iterations, the preg_replace version consistently beat it. Typical results, normalized to the fastest (so it has a unitless time of 1):
Simple str_replace: 1
preg_replace with /^\w+\./: 1.494
strpos/substr: 1.982
explode/implode: 2.472
The above code samples always strip the first domain component, so will work just fine on domains like "www.example.com" and "www.example.co.uk" but not "example.com" or "www.department.example.com". If you need to handle domains that may already be the main domain, or have multiple subdomains (such as "foo.bar.baz.example.com") and want to reduce them to just the main domain ("example.com"), try the following. The first sample in each approach returns only the last two domain components, so won't work with "co.uk"-like domains.
explode:
$parts = explode('.', $domain);
$parts = array_slice($parts, -2);
$stripped = implode('.', $parts);
Since explode is consistently the slowest approach, there's little point in writing a version that handles "co.uk".
regex:
$stripped=preg_replace('/^.*?([^.]+\.[^.]*)$/', '$1', $domain);
This captures the final two parts from the domain and replaces the full string value with the captured part. With multiple subdomains, all the leading parts get stripped.
To work with ".co.uk"-like domains as well as a variable number of subdomains, try:
$stripped=preg_replace('/^.*?([^.]+\.(?:[^.]*|[^.]{2}\.[^.]{2}))$/', '$1', $domain);
str:
$end = strrpos($domain, '.') - strlen($domain) - 1;
$period = strrpos($domain, '.', $end);
if ($period !== false) {
$stripped = substr($domain,$period+1);
} else {
$stripped = $domain;
}
Allowing for co.uk domains:
$len = strlen($domain);
if ($len < 7) {
$stripped = $domain;
} else {
if ($domain[$len-3] === '.' && $domain[$len-6] === '.') {
$offset = -7;
} else {
$offset = -5;
}
$period = strrpos($domain, '.', $offset);
if ($period !== FALSE) {
$stripped = substr($domain,$period+1);
} else {
$stripped = $domain;
}
}
The regex and str-based implementations can be made ever-so-slightly faster by sacrificing edge cases (where the primary domain component is a single letter, e.g. "a.com"):
regex:
$stripped=preg_replace('/^.*?([^.]{3,}\.(?:[^.]+|[^.]{2}\.[^.]{2}))$/', '$1', $domain);
str:
$period = strrpos($domain, '.', -7);
if ($period !== FALSE) {
$stripped = substr($domain,$period+1);
} else {
$stripped = $domain;
}
Though the behavior is changed, the rankings aren't (most of the time). Here they are, with times normalized to the quickest.
multiple subdomain regex: 1
.co.uk regex (fast): 1.01
.co.uk str (fast): 1.056
.co.uk regex (correct): 1.1
.co.uk str (correct): 1.127
multiple subdomain str: 1.282
multiple subdomain explode: 1.305
Here, the difference between times is so small that it wasn't unusual for . The fast .co.uk regex, for example, often beat the basic multiple subdomain regex. Thus, the exact implementation shouldn't have a noticeable impact on speed. Instead, pick one based on simplicity and clarity. As long as you don't need to handle .co.uk domains, that would be the multiple subdomain regex approach.
You have to strip off the subdomain part by yourself - there is no built-in function for this.
// $domain beeing www.w3scools.com
$domain = implode('.', array_slice(explode('.', $domain), -2));
The above example also works for subdomains of a unlimited depth as it'll alwas return the last two domain parts (domain and top-level-domain).
If you only want to strip off www. you can simply do a str_replace(), which will be faster indeed:
$domain = str_replace('www.', '', $domain);
You need to strip off any characters before the first occurencec of [.] character (along with the [.] itself) if and only if there are more than 1 occurence of [.] in the returned string.
for example if the returned string is www-139.in.ibm.com then the regular expression should be such that it returns in.ibm.com since that would be the domain.
If the returned string is music.domain.com then the regular expression should return domain.com
In rare cases you get to access the site without the prefix of the server that is you can access the site using http://domain.com/pageurl, in this case you would get the domain directly as domain.com, in such case the regex should not strip anything
IMO this should be the pseudo logic of the regex, if you want I can form a regex for you that would include these things.