PHP URL Parsing & disecting

PHP URL Parsing & disecting - php

www.example.com
foo.example.com
foo.example.co.uk
foo.bar.example.com
foo.bar.example.co.uk
I've got these URL's here, and want to always end up with 2 variables:
$domainName = "example"
$domainNameSuffix = ".com" OR ".co.uk"
If I someone could get me from $url being one of the urls, all the way down to $newUrl being close to "example.co.uk", it would be a blessing.
Note that the urls are going to be completely "random", we might end up having "foo.bar.example2.com.au" too, so ... you know... ugh. (asking for the impossible?)
Cheers,

We had a few questions like this before, but I can't find a good one right now either. The crux is, this cannot be done reliably. You would need a long list of special TLDs (like .uk and .au) which have their own .com/.net level.
But as general approach and simple solution you could use:
preg_match('#([\w-]+)\.(\w+(\.(au|uk))?)\.?$#i', $domain, $m);
list(, $domain, $suffix) = $m;

The "domainNameSuffix" is called a top level domain (tld for short), and there is no easy way to extract it.
Every country has it's own tld, and some countries have opted to further subdivide their tld. And since the number of subdomains (my.own.subdomain.example.com) is also variable, there is no easy "one-regexp-fits-all".
As mentioned, you need a list. Fortunately for you there are lists publicly available: http://publicsuffix.org/

You will need to maintain a list of extensions for most accurate results I believe.
$possibleExtensions = array(
'.com',
'.co.uk',
'.com.au'
);
// parse_url() needs a protocol.
$str = 'http://' . $str;
// Use parse_url() to take into account any paths
// or fragments that may end up being there.
$host = parse_url($str, PHP_URL_HOST);
foreach($possibleExtensions as $ext) {
if (preg_match('/' . preg_quote($ext, '/') . '\Z/', $host)) {
$domainNameSuffix = $ext;
// Strip extension
$domainName = substr($str, 0, -strlen($ext));
// Strip off http://
$domainName = substr($domainName, 7);
var_dump($domainName, $domainNameSuffix);
break;
}
}
If you never have any paths or extra stuff, you can of course skip the parse_url() and the http:// adding and removal.
It worked for all your tests.

There isn't a builtin function for this.
A quick google search lead me to http://www.wallpaperama.com/forums/php-function-remove-domain-name-get-tld-splitter-split-t5824.html
This leads me to believe you need to maintain a list of valid TLD's to split URLs on.

Alright chaps, here's how I solved it, for now. Implementation of more domain names will be done as well, at some point in the future. Don't know what technique I'll use, yet.
# Setting options, single and dual part domain extentions
$v2_onePart = array(
"com"
);
$v2_twoPart = array(
"co.uk",
"com.au"
);
$v2_url = $_SERVER['SERVER_NAME']; # "example.com" OR "example.com.au"
$v2_bits = explode(".", $v2_url); # "example", "com" OR "example", "com", "au"
$v2_bits = array_reverse($v2_bits); # "com", "example" OR "au", "com", "example" (Reversing to eliminate foo.bar.example.com.au problems.)
switch ($v2_bits) {
case in_array($v2_bits[1] . "." . $v2_bits[0], $v2_twoPart):
$v2_class = $v2_bits[2] . " " . $v2_bits[1] . "_" . $v2_bits[0]; # "example com_au"
break;
case in_array($v2_bits[0], $v2_onePart):
$v2_class = $v2_bits[1] . " " . $v2_bits[0]; # "example com"
break;
}

Related

Regular Expressions - two captured groups, either one or both have to be present

I have a following link structure:
/type1
/type2
/type3
those links correspond to the default language of the site. Unfortunately the client didn't want to add the default language in front of the URL for consistency, therefore only other languages will have URLs like:
/en
/en/type1
/de/type2
/de
/fr/type3
/fr
There are a lot of other variables but only this part is dynamic. My Regex starts as following:
(en|de|fr)?\/?(type1|type2|type3)?\/?
which basically means capture the language if exists, and then capture the page if exists. But it performs a lot more matches than required and also would capture empty string etc.
I'm trying to figure out how to capture all these options:
/en
/en/type1
/type1
in one expression, of course if possible. How can I make one of the two groups to be required, so basically the URL has either both or one of them but never none? I looked at backreferences in conjunction with look-aheads but I think I'm missing some crucial information here...
I would like to preserve the groups so that group1 = language and group2 = page

I can't think of a way to do what you want with a single regex. But, another possibility would be to use a single regex to just match URL patterns which you want. Then, use a short PHP script to extract the language (if it exists) and page:
$path = "/de/type1";
if (preg_match("/^(?:\/(?:en|de|fr))?(?:\/(?:type1|type2|type3))?$/i", $path, $match)) {
$parts = preg_split("/\//", $path);
if (sizeof($parts) == 3) {
echo "language: " . $parts[1] . ", page: " . $parts[2];
}
else {
if (preg_match("/^(?:en|de|fr)$/i", $parts[1], $match)) {
echo "language: " . $parts[1] . ", page:";
}
else {
echo "language: default, page: " . $parts[1];
}
}
}
Demo
This is the pattern I used for matching:
^(?:/(?:en|de|fr))?(?:/(?:type1|type2|type3))?$
It allows for /(type1|type2|type3), optionally preceded by a language path.

This one will give you one or the other (whichever comes first), but doesn't require that if you provide both, they match (e.g. you could specify /en/type3, and it would give you /en):
<?php
$pat = '~(/(?:en|de|fr)\b|/type\d\b)~';
$test = ['/en', '/type1', '/en/type1', '/en/type3', '/english/type1'];
foreach ($test as $t) if (preg_match($pat, $t, $match)) echo "'{$t}' = '{$match[1]}'\n";
?>
Which gives you:
'/en' = '/en'
'/type1' = '/type1'
'/en/type1' = '/en'
'/en/type3' = '/en'
'/english/type1' = '/type1'
(the last example is to demonstrate why you need the \b in the pattern)

Regex to find URLs without http or www

I'm working with some code used to try and find all the website URLs within a block of text. Right now we've already got checks that work fine for URLs formatted such as http://www.google.com or www.google.com but we're trying to find a regex that can locate a URL in a format such as just google.com
Right now our regex is set to search for every domain that we could find registered which is around 1400 in total, so it looks like this:
/(\S+\.(COM|NET|ORG|CA|EDU|UK|AU|FR|PR)\S+)/i
Except with ALL 1400 domains to check in the group(the full thing is around 8400 characters long). Naturally it's running quite slowly, and we've already had the idea to simply check for the 10 or so most commonly used domains but I wanted to check here first to see if there was a more efficient way to check for this specific formatting of website URLs rather than singling every single one out.

You could use a double pass search.
Search for every url-like string, e.g.:
((http|https):\/\/)?([\w-]+\.)+[\S]{2,5}
On every result do some non-regex checks, like, is the length enough, is the text after the last dot part of your tld list, etc.
function isUrl($urlMatch) {
$tldList = ['com', 'net'];
$urlParts = explode(".", $urlMatch);
$lastPart = end($urlParts);
return in_array($lastPart, $tldList);
}

Example
function get_host($url) {
$host = parse_url($url, PHP_URL_HOST);
$names = explode(".", $host);
if(count($names) == 1) {
return $names[0];
}
$names = array_reverse($names);
return $names[1] . '.' . $names[0];
}
Usage
echo get_host('https://google.com'); // google.com
echo "\n";
echo get_host('https://www.google.com'); // google.com
echo "\n";
echo get_host('https://sub1.sub2.google.com'); // google.com
echo "\n";
echo get_host('http://localhost'); // localhost
Demo

urlencode only the directory and file names of a URL

I need to URL encode just the directory path and file name of a URL using PHP.
So I want to encode something like http://example.com/file name and have it result in http://example.com/file%20name.
Of course, if I do urlencode('http://example.com/file name'); then I end up with http%3A%2F%2Fexample.com%2Ffile+name.
The obvious (to me, anyway) solution is to use parse_url() to split the URL into scheme, host, etc. and then just urlencode() the parts that need it like the path. Then, I would reassemble the URL using http_build_url().
Is there a more elegant solution than that? Or is that basically the way to go?

#deceze definitely got me going down the right path, so go upvote his answer. But here is exactly what worked:
$encoded_url = preg_replace_callback('#://([^/]+)/([^?]+)#', function ($match) {
return '://' . $match[1] . '/' . join('/', array_map('rawurlencode', explode('/', $match[2])));
}, $unencoded_url);
There are a few things to note:
http_build_url requires a PECL install so if you are distributing your code to others (as I am in this case) you might want to avoid it and stick with reg exp parsing like I did here (stealing heavily from #deceze's answer--again, go upvote that thing).
urlencode() is not the way to go! You need rawurlencode() for the path so that spaces get encoded as %20 and not +. Encoding spaces as + is fine for query strings, but not so hot for paths.
This won't work for URLs that need a username/password encoded. For my use case, I don't think I care about those, so I'm not worried. But if your use case is different in that regard, you'll need to take care of that.

As you say, something along these lines should do it:
$parts = parse_url($url);
if (!empty($parts['path'])) {
$parts['path'] = join('/', array_map('rawurlencode', explode('/', $parts['path'])));
}
$url = http_build_url($parts);
Or possibly:
$url = preg_replace_callback('#https?://.+/([^?]+)#', function ($match) {
return join('/', array_map('rawurlencode', explode('/', $match[1])));
}, $url);
(Regex not fully tested though)

function encode_uri($url){
$exp = "{[^0-9a-z_.!~*'();,/?:#&=+$#%\[\]-]}i";
return preg_replace_callback($exp, function($m){
return sprintf('%%%02X',ord($m[0]));
}, $url);
}

Much simpler:
$encoded = implode("/", array_map("rawurlencode", explode("/", $path)));

I think this function ok:
function newUrlEncode ($url) {
return str_replace(array('%3A', '%2F'), '/', urlencode($url));
}

PHP extract just the main domain not subdomain from URL

I have a set of URL's in an array. Some are just domains (http://google.com) and some are subdomains (http://test.google.com).
I am trying to extract just the domain part from each of them without the subdomain.
parse_url($domain)
still keeps the subdomain.
Is there another way?

If you're only concerned with actual top level domains, the simple answer is to just get whatever's before the last dot in the domain name.
However, if you're looking for "whatever you buy from a registrar", that is much more tricky. IANA delegats authority for each country-specific TLD to the national registrars, which means that allocation policy varies for each TLD. Famous examples include .co.uk, .org.uk, etc, but there are countless others that are less known (for example .priv.no).
If you need a solution that will work correctly for every single TLD in existence, you will have to research policy for each TLD, which is quite an undertaking since many national registrars have horrible websites with unclear policies that, just to make it even more confusin, often are not available in English.
In practice however, you probably don't need to account for every TLD or for every available subdomain within every TLD. So a practical solution would be to compile a list of known 2-part (and more) TLD's that you need to support. Anything that doesn't match that list, you can treat as a 1-part TLD. Like so:
<?php
$special_domains = array('co.uk', 'org.uk, /* ... etc */');
function getDomain($domain)
{
global $special_domains;
for($i = 0; $i < count($special_domains); $i++)
{
if(substr($domain, -strlen($special_domains[i])) == $special_domains[i])
{
$domain = substr($domain, 0, -strlen($special_domains[i])));
$lastdot = strrchr($domain, '.');
return ($lastdot ? substr($domain, $lastdot) : $domain;
}
$domain = substr($domain, 0, strrchr($domain, "."));
$lastdot = strrchr($domain, '.');
return ($lastdot ? substr($domain, $lastdot) : $domain;
}
}
?>
PS: I haven't tested this code so it may need some modification but the basic logic should be ok.

There might be a work-around for .co.uk problem.
Let's presume that if it is possible to register *.co.uk, *.org.uk, *.mil.ae and similar domains, then it is not possible to resolve DNS of co.uk, org.uk and mil.ae. I've checked some URL's and it seemed to be true.
Then you can use something like this:
$testdomains = array(
'http://google.com',
'http://probablynotexisting.com',
'http://subdomain.bbc.co.uk', // should resolve into bbc.co.uk, because it is not possible to ping co.uk
'http://bbc.co.uk'
);
foreach ($testdomains as $raw_domain) {
$domain = join('.', array_slice(explode('.', parse_url($raw_domain, PHP_URL_HOST)), -2));
$ip = gethostbyname($domain);
if ($ip == $domain) {
// failure, let's include another dot
$domain = join('.', array_slice(explode('.', parse_url($raw_domain, PHP_URL_HOST)), -3));
$ip = gethostbyname($domain);
if ($ip == $domain) {
// another failure, shall we give up and move on!
echo $raw_domain . ": failed<br />\n";
continue;
}
}
echo $raw_domain . ' -> ' . $domain . ": ok [" . $ip . "]<br />\n";
}
The output is like this:
http://google.com -> google.com: ok [72.14.204.147]
http://probablynotexisting.com: failed
http://subdomain.bbc.co.uk -> bbc.co.uk: ok [212.58.241.131]
http://bbc.co.uk -> bbc.co.uk: ok [212.58.241.131]
Note: resolving DNS is a slow process.

Let dig do the hard work for you. Extract the required base domain from the first field in the AUTHORITY section of a dig on any sub-domain (which doesn't need to exist) of the sub-domain/domain in question. Examples (in bash not php sorry)...
dig #8.8.8.8 notexist.google.com|grep -A1 ';; AUTHORITY SECTION:'|tail -n1|sed "s/[[:space:]]\+/~/g"|cut -d'~' -f1
google.com.
or
dig #8.8.8.8 notexist.test.google.com|grep -A1 ';; AUTHORITY SECTION:'|tail -n1|sed "s/[[:space:]]\+/~/g"|cut -d'~' -f1
google.com.
or
dig #8.8.8.8 notexist.www.xn--zgb6acm.xn--mgberp4a5d4ar|grep -A1 ';; AUTHORITY SECTION:'|tail -n1|sed "s/[[:space:]]\+/~/g"|cut -d'~' -f1
xn--zgb6acm.xn--mgberp4a5d4ar.
Where
grep -A1 filters out all lines except the line with the string ;; AUTHORITY SECTION: and 1 line after it.
tail -n1 leaves only the last 1 line of the above 2 lines.
sed "s/[[:space:]]\+/~/g" replaces dig's delimeters (1 or more consecutive spaces or tabs) with some custom delimiter ~. Could be any character which never occurs on the line.
cut -d'~' -f1 extracts the first field where the fields are delimited by the custom delimiter from above.

Separating the extention from its domain name

I want to work around email addresses and I want to explode them using php's explode function.
It's ok to separate the user from the domain or the host doing like this:
list( $user, $domain ) = explode( '#', $email );
but when trying to explode the domain to domain_name and domain_extention I realised that when exploding them using the "." as the argument it will not always be foo.bar, it can sometimes be foo.ba.ar like fooooo.co.uk
so how to separate "fooooo.co" from "uk" and let the co with the fooooo. so finally I will get the TLD separated from the other part.
I know that co.uk is supposed to be treated as the TLD but it's not official, like fooooo.nat.tn or fooooo.gov.tn
Thank You.

Just use strripos() to find the last occurrence of ".":
$blah = "hello.co.uk";
$i = strripos($blah, ".");
echo "name = " . substr($blah, 0, $i) . "\n";
echo "TLD = " . substr($blah, $i + 1) . "\n";

Better use imap_rfc822_parse_adrlist or mailparse_rfc822_parse_addresses to parse the email address if available. And for removing the “public suffix” from the domain name, see my answer to Remove domain extension.

Expanding on Oli's answer...
substr($address, (strripos($address, '.') + 1));
Will give the TLD without the '.'. Lose the +1 and you get the dot, too.

end(explode('.', $email)); will give you the TLD. To get the domain name without that, you can do any number of other string manipulation tricks, such as subtracting off that length.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP URL Parsing & disecting - php

There isn't a builtin function for this. A quick google search lead me to http://www.wallpaperama.com/forums/php-function-remove-domain-name-get-tld-splitter-split-t5824.html This leads me to believe you need to maintain a list of valid TLD's to split URLs on.

Related

Regular Expressions - two captured groups, either one or both have to be present

Regex to find URLs without http or www

urlencode only the directory and file names of a URL

PHP extract just the main domain not subdomain from URL

Separating the extention from its domain name

Categories

Resources