Url splitting in php - php

I have an url like this:
http://www.w3schools.com/PHP/func_string_str_split.asp
I want to split that url to get the host part only. For that I am using
parse_url($url,PHP_URL_HOST);
it returns www.w3schools.com.
I want to get only 'w3schools.com'.
is there any function for that or do i have to do it manually?

There are many ways you could do this. A simple replace is the fastest if you know you always want to strip off 'www.'
$stripped=str_replace('www.', '', $domain);
A regex replace lets you bind that match to the start of the string:
$stripped=preg_replace('/^www\./', '', $domain);
If it's always the first part of the domain, regardless of whether its www, you could use explode/implode. Though it's easy to read, it's the most inefficient method:
$parts=explode('.', $domain);
array_shift($parts); //eat first element
$stripped=implode('.', $parts);
A regex achieves the same goal more efficiently:
$stripped=preg_replace('/^\w+\./', '', $domain);
Now you might imagine that the following would be more efficient than the above regex:
$period=strpos($domain, '.');
if ($period!==false)
{
$stripped=substr($domain,$period+1);
}
else
{
$stripped=$domain; //there was no period
}
But I benchmarked it and found that over a million iterations, the preg_replace version consistently beat it. Typical results, normalized to the fastest (so it has a unitless time of 1):
Simple str_replace: 1
preg_replace with /^\w+\./: 1.494
strpos/substr: 1.982
explode/implode: 2.472
The above code samples always strip the first domain component, so will work just fine on domains like "www.example.com" and "www.example.co.uk" but not "example.com" or "www.department.example.com". If you need to handle domains that may already be the main domain, or have multiple subdomains (such as "foo.bar.baz.example.com") and want to reduce them to just the main domain ("example.com"), try the following. The first sample in each approach returns only the last two domain components, so won't work with "co.uk"-like domains.
explode:
$parts = explode('.', $domain);
$parts = array_slice($parts, -2);
$stripped = implode('.', $parts);
Since explode is consistently the slowest approach, there's little point in writing a version that handles "co.uk".
regex:
$stripped=preg_replace('/^.*?([^.]+\.[^.]*)$/', '$1', $domain);
This captures the final two parts from the domain and replaces the full string value with the captured part. With multiple subdomains, all the leading parts get stripped.
To work with ".co.uk"-like domains as well as a variable number of subdomains, try:
$stripped=preg_replace('/^.*?([^.]+\.(?:[^.]*|[^.]{2}\.[^.]{2}))$/', '$1', $domain);
str:
$end = strrpos($domain, '.') - strlen($domain) - 1;
$period = strrpos($domain, '.', $end);
if ($period !== false) {
$stripped = substr($domain,$period+1);
} else {
$stripped = $domain;
}
Allowing for co.uk domains:
$len = strlen($domain);
if ($len < 7) {
$stripped = $domain;
} else {
if ($domain[$len-3] === '.' && $domain[$len-6] === '.') {
$offset = -7;
} else {
$offset = -5;
}
$period = strrpos($domain, '.', $offset);
if ($period !== FALSE) {
$stripped = substr($domain,$period+1);
} else {
$stripped = $domain;
}
}
The regex and str-based implementations can be made ever-so-slightly faster by sacrificing edge cases (where the primary domain component is a single letter, e.g. "a.com"):
regex:
$stripped=preg_replace('/^.*?([^.]{3,}\.(?:[^.]+|[^.]{2}\.[^.]{2}))$/', '$1', $domain);
str:
$period = strrpos($domain, '.', -7);
if ($period !== FALSE) {
$stripped = substr($domain,$period+1);
} else {
$stripped = $domain;
}
Though the behavior is changed, the rankings aren't (most of the time). Here they are, with times normalized to the quickest.
multiple subdomain regex: 1
.co.uk regex (fast): 1.01
.co.uk str (fast): 1.056
.co.uk regex (correct): 1.1
.co.uk str (correct): 1.127
multiple subdomain str: 1.282
multiple subdomain explode: 1.305
Here, the difference between times is so small that it wasn't unusual for . The fast .co.uk regex, for example, often beat the basic multiple subdomain regex. Thus, the exact implementation shouldn't have a noticeable impact on speed. Instead, pick one based on simplicity and clarity. As long as you don't need to handle .co.uk domains, that would be the multiple subdomain regex approach.

You have to strip off the subdomain part by yourself - there is no built-in function for this.
// $domain beeing www.w3scools.com
$domain = implode('.', array_slice(explode('.', $domain), -2));
The above example also works for subdomains of a unlimited depth as it'll alwas return the last two domain parts (domain and top-level-domain).
If you only want to strip off www. you can simply do a str_replace(), which will be faster indeed:
$domain = str_replace('www.', '', $domain);

You need to strip off any characters before the first occurencec of [.] character (along with the [.] itself) if and only if there are more than 1 occurence of [.] in the returned string.
for example if the returned string is www-139.in.ibm.com then the regular expression should be such that it returns in.ibm.com since that would be the domain.
If the returned string is music.domain.com then the regular expression should return domain.com
In rare cases you get to access the site without the prefix of the server that is you can access the site using http://domain.com/pageurl, in this case you would get the domain directly as domain.com, in such case the regex should not strip anything
IMO this should be the pseudo logic of the regex, if you want I can form a regex for you that would include these things.

Related

Regex to find URLs without http or www

I'm working with some code used to try and find all the website URLs within a block of text. Right now we've already got checks that work fine for URLs formatted such as http://www.google.com or www.google.com but we're trying to find a regex that can locate a URL in a format such as just google.com
Right now our regex is set to search for every domain that we could find registered which is around 1400 in total, so it looks like this:
/(\S+\.(COM|NET|ORG|CA|EDU|UK|AU|FR|PR)\S+)/i
Except with ALL 1400 domains to check in the group(the full thing is around 8400 characters long). Naturally it's running quite slowly, and we've already had the idea to simply check for the 10 or so most commonly used domains but I wanted to check here first to see if there was a more efficient way to check for this specific formatting of website URLs rather than singling every single one out.
You could use a double pass search.
Search for every url-like string, e.g.:
((http|https):\/\/)?([\w-]+\.)+[\S]{2,5}
On every result do some non-regex checks, like, is the length enough, is the text after the last dot part of your tld list, etc.
function isUrl($urlMatch) {
$tldList = ['com', 'net'];
$urlParts = explode(".", $urlMatch);
$lastPart = end($urlParts);
return in_array($lastPart, $tldList);
}
Example
function get_host($url) {
$host = parse_url($url, PHP_URL_HOST);
$names = explode(".", $host);
if(count($names) == 1) {
return $names[0];
}
$names = array_reverse($names);
return $names[1] . '.' . $names[0];
}
Usage
echo get_host('https://google.com'); // google.com
echo "\n";
echo get_host('https://www.google.com'); // google.com
echo "\n";
echo get_host('https://sub1.sub2.google.com'); // google.com
echo "\n";
echo get_host('http://localhost'); // localhost
Demo

regex to trim down subdomain in the url

I have a regexp that match to something like : wiseman.google.com.jp, me.co.uk, paradise.museum, abcd-abc.net, www.google.jp, 12345-daswe-23dswe-dswedsswe-54eddss.info, del.icio.us, jo.ggi.ng, all of this is from a textarea value.
used regexp (in preg_match_all($regex1, $str, $match)) to get the above values: /(?:[a-zA-Z0-9]{2,}\.)?[-a-zA-Z0-9]{2,}\.[a-zA-Z0-9]{2,7}(?:\.[-a-zA-Z0-9]{2,3})?/
Now, my question is : how can I make the regexp to trim down the "wiseman.google.com.jp" into "google.com.jp" and "www.google.jp" into "google.jp"?
I am thingking to make a second preg_match($regex2, $str, $match) function with each value coming from the preg_match_all function.
I have tried this regexp in $regex2 : ([-a-zA-Z0-9\x{0080}-\x{00FF}]{2,}+)\.[a-zA-Z0-9\x{0080}-\x{00FF}]{2,7}(?:\.[-a-zA-Z0-9\x{0080}-\x{00FF}]{2,3})? but it doesn't work.
Any inputs? TIA
here is my little solution :
preg_match_all($regex, $str, $matches, PREG_PATTERN_ORDER);
$arrlength=count($matches[0]);
for($x=0;$x<$arrlength;$x++){
$dom = $matches[0][$x];
$newstringcount = substr_count($dom, '.'); // this line is to count how many "." present in the string.
if($newstringcount == 3){ // if there are 3 '.' present in the string = true
$pos = strpos($dom, '.', 0); // this line is to find the first occurence of the '.' in the string
$find = substr($dom, $pos+1); //this line is to get the value after the first occurence of the '.' in the string
echo $find;
}else if($newstringcount == 2){
if ($pos = strpos($dom,'www.') !== false) {
$find = substr($dom, $pos+3);
echo $find;
}else{
echo $dom;
}
}else if($newstringcount == 1){
echo $dom;
}
echo "<br>";
}
(Caution: this answer will only fit your needs if you HAVE to use regex or you're somewhat... desperate...)
What you want to achieve isn't possible with general rules due to domains like .com.jp or .co.uk.
The only general rule one can find is:
When read from right to left there are one or two TLDs followed by one second level domain
Thus, we have to whitelist all available TLDs. I think i'll call the following the "domain-kraken".
Release the kraken!
([a-z0-9\-]{2,63}(?:\.(?:a(?:cademy|ero|rpa|sia|[cdefgilmnoqrstuwxz])|b(?:ike
|iz|uilders|uzz|[abdefghijlmnoqrstvwyz])|c(?:ab|amera|amp|areers|at|enter|eo
|lothing|odes|offee|om(?:pany|puter)?|onstruction|ontractors|oop|
[acdfghiklmnoruvwxyz])|d(?:iamonds|irectory|omains|[ejkmoz])|e(?:du(?:cation)?
|mail|nterprises|quipment|state|[ceghrstu])|f(?:arm|lorist|[ijkmor])|g(?:allery|
lass|raphics|uru|[abdefghlmnpqrstuwy])|h(?:ol(?:dings|iday)|ouse|[kmnrtu])|
i(?:mmobilien|n(?:fo|stitute|ternational)|[delmnoqrst])|j(?:obs|[emop])|
k(?:aufen|i(?:tchen|wi)|[eghimnprwxyz])|l(?:and|i(?:ghting|mo)|[abcikrstuvy])|
m(?:anagement|enu|il|obi|useum|[acdefghklmnopqrstuvwxyz])|n(?:ame|et|inja|
[acefgilopruz])|o(?:m|nl|rg)|p(?:hoto(?:graphy|s)|lumbing|ost|ro|[aefghklmnrstwy])|
r(?:e(?:cipes|pair)|uhr|[eosuw])|s(?:exy|hoes|ingles|ol(?:ar|utions)|upport|
ystems|[abcdeghijklmnorstuvxyz])|t(?:attoo|echnology|el|ips|oday|
[cdfghjklmnoprtvwz])|u(?:no|[agkmsyz])|v(?:entures|iajes|oyage|[aceginu])|
w(?:ang|ien|[fs])|xxx|y(?:[et])|z(?:[amw]))){1,2})$
Use it together with the i and m flags.
This supposes your data is on mutiple lines.
In case your data is seperated by a ,, change the last character in the regex ($) to ,? and use the g and i flags.
Demos are available on regex101 and debuggex.
(Both of the demos have an explanation: regex101 describes it with text while debuggex visualizes the beast)
A list of available TLDs can be found at iana.org, the used TLDs in the regex are as of January 2014.

Trim a url to just the domain name using PHP

I have a database table column that stores urls of a persons website. This column is unique as I don't want people using the same website twice!
However a person could get around this by doing:
domain.com
domain.com/hello123
www.domain.com
So my plan is to make it so that when a person saves their record it will remove everything after the first slash to make sure only the domain is saved into the database.
How would I do this though? I'm presuming this has been done lots of times before, but I'm looking for something VERY VERY simple and not interested in using libraries or other long code snippets. Just something that strips out the rest and keeps just the domain name.
See PHP: parse_url
// Force URL to begin with "http://" or "https://" so 'parse_url' works
$url = preg_replace('/^(?!https?:\/\/)(.*:\/\/)/i', 'http://', $inputURL);
$parts = parse_url($url);
// var_dump($parts); // To see the parsed URL parts, uncomment this line
print $parts['host'];
Note, the subdomains are not unique using the code as listed. www.domain.com and domain.com will be separate entries.
Use parse_url:
$hostname = parse_url($userwebsite,PHP_URL_HOST);
$sDomain = NULL;
foreach (explode('/', $sInput) as $sPart) {
switch ($sPart) {
case 'http:':
case 'https:':
case '':
break;
default:
$sDomain = $sPart;
break 2;
}
}
if ($sDomain !== NULL) {
echo $sDomain;
}
First, all slashes are used as separators. Next, all "known/supported" schemes are ignored, as well as the empty part which happens from "http://". Finally, whatever is next will be stored in $sDomain.
If you do not mind the dependency of PCRE, you can use a regular expression as well:
if (preg_match('/^https?:\/\/([^\/]+)/', $sInput, $aisMatch) === 1) {
echo $aisMatch[1];
}
You could try
int strrpos ( string $haystack , string $needle [, int $offset = 0 ] )
and then put the result of that into
string substr ( string $string , int $start [, int $length ] )
using $needle = "/" and $needle = "."

PHP extract just the main domain not subdomain from URL

I have a set of URL's in an array. Some are just domains (http://google.com) and some are subdomains (http://test.google.com).
I am trying to extract just the domain part from each of them without the subdomain.
parse_url($domain)
still keeps the subdomain.
Is there another way?
If you're only concerned with actual top level domains, the simple answer is to just get whatever's before the last dot in the domain name.
However, if you're looking for "whatever you buy from a registrar", that is much more tricky. IANA delegats authority for each country-specific TLD to the national registrars, which means that allocation policy varies for each TLD. Famous examples include .co.uk, .org.uk, etc, but there are countless others that are less known (for example .priv.no).
If you need a solution that will work correctly for every single TLD in existence, you will have to research policy for each TLD, which is quite an undertaking since many national registrars have horrible websites with unclear policies that, just to make it even more confusin, often are not available in English.
In practice however, you probably don't need to account for every TLD or for every available subdomain within every TLD. So a practical solution would be to compile a list of known 2-part (and more) TLD's that you need to support. Anything that doesn't match that list, you can treat as a 1-part TLD. Like so:
<?php
$special_domains = array('co.uk', 'org.uk, /* ... etc */');
function getDomain($domain)
{
global $special_domains;
for($i = 0; $i < count($special_domains); $i++)
{
if(substr($domain, -strlen($special_domains[i])) == $special_domains[i])
{
$domain = substr($domain, 0, -strlen($special_domains[i])));
$lastdot = strrchr($domain, '.');
return ($lastdot ? substr($domain, $lastdot) : $domain;
}
$domain = substr($domain, 0, strrchr($domain, "."));
$lastdot = strrchr($domain, '.');
return ($lastdot ? substr($domain, $lastdot) : $domain;
}
}
?>
PS: I haven't tested this code so it may need some modification but the basic logic should be ok.
There might be a work-around for .co.uk problem.
Let's presume that if it is possible to register *.co.uk, *.org.uk, *.mil.ae and similar domains, then it is not possible to resolve DNS of co.uk, org.uk and mil.ae. I've checked some URL's and it seemed to be true.
Then you can use something like this:
$testdomains = array(
'http://google.com',
'http://probablynotexisting.com',
'http://subdomain.bbc.co.uk', // should resolve into bbc.co.uk, because it is not possible to ping co.uk
'http://bbc.co.uk'
);
foreach ($testdomains as $raw_domain) {
$domain = join('.', array_slice(explode('.', parse_url($raw_domain, PHP_URL_HOST)), -2));
$ip = gethostbyname($domain);
if ($ip == $domain) {
// failure, let's include another dot
$domain = join('.', array_slice(explode('.', parse_url($raw_domain, PHP_URL_HOST)), -3));
$ip = gethostbyname($domain);
if ($ip == $domain) {
// another failure, shall we give up and move on!
echo $raw_domain . ": failed<br />\n";
continue;
}
}
echo $raw_domain . ' -> ' . $domain . ": ok [" . $ip . "]<br />\n";
}
The output is like this:
http://google.com -> google.com: ok [72.14.204.147]
http://probablynotexisting.com: failed
http://subdomain.bbc.co.uk -> bbc.co.uk: ok [212.58.241.131]
http://bbc.co.uk -> bbc.co.uk: ok [212.58.241.131]
Note: resolving DNS is a slow process.
Let dig do the hard work for you. Extract the required base domain from the first field in the AUTHORITY section of a dig on any sub-domain (which doesn't need to exist) of the sub-domain/domain in question. Examples (in bash not php sorry)...
dig #8.8.8.8 notexist.google.com|grep -A1 ';; AUTHORITY SECTION:'|tail -n1|sed "s/[[:space:]]\+/~/g"|cut -d'~' -f1
google.com.
or
dig #8.8.8.8 notexist.test.google.com|grep -A1 ';; AUTHORITY SECTION:'|tail -n1|sed "s/[[:space:]]\+/~/g"|cut -d'~' -f1
google.com.
or
dig #8.8.8.8 notexist.www.xn--zgb6acm.xn--mgberp4a5d4ar|grep -A1 ';; AUTHORITY SECTION:'|tail -n1|sed "s/[[:space:]]\+/~/g"|cut -d'~' -f1
xn--zgb6acm.xn--mgberp4a5d4ar.
Where
grep -A1 filters out all lines except the line with the string ;; AUTHORITY SECTION: and 1 line after it.
tail -n1 leaves only the last 1 line of the above 2 lines.
sed "s/[[:space:]]\+/~/g" replaces dig's delimeters (1 or more consecutive spaces or tabs) with some custom delimiter ~. Could be any character which never occurs on the line.
cut -d'~' -f1 extracts the first field where the fields are delimited by the custom delimiter from above.

Determine User Input Contains URL

I have a input form field which collects mixed strings.
Determine if a posted string contains an URL (e.g. http://link.com, link.com, www.link.com, etc) so it can then be anchored properly as needed.
An example of this would be something as micro blogging functionality where processing script will anchor anything with a link. Other sample could be this same post where 'http://link.com' got anchored automatically.
I believe I should approach this on display and not on input. How could I go about it?
You can use regular expressions to call a function on every match in PHP. You can for example use something like this:
<?php
function makeLink($match) {
// Parse link.
$substr = substr($match, 0, 6);
if ($substr != 'http:/' && $substr != 'https:' && $substr != 'ftp://' && $substr != 'news:/' && $substr != 'file:/') {
$url = 'http://' . $match;
} else {
$url = $match;
}
return '' . $match . '';
}
function makeHyperlinks($text) {
// Find links and call the makeLink() function on them.
return preg_replace('/((www\.|(http|https|ftp|news|file)+\:\/\/)[_.a-z0-9-]+\.[a-z0-9\/_:#=.+?,##%&~-]*[^.|\'|\# |!|\(|?|,| |>|<|;|\)])/e', "makeLink('$1')", $text);
}
?>
You will want to use a regular expression to match common URL patterns. PHP offers a function called preg_match that allows you to do this.
The regular expression itself could take several forms, but here is something to get you started (also maybe just Google 'URL regex':
'/^(((http|https|ftp)://)?([[a-zA-Z0-9]-.])+(.)([[a-zA-Z0-9]]){2,4}([[a-zA-Z0-9]/+=%&_.~?-]))$/'
So your code should look something this:
$matches = array(); // will hold the results of the regular expression match
$string = "http://www.astringwithaurl.com";
$regexUrl = '/^(((http|https|ftp):\/\/)?([[a-zA-Z0-9]\-\.])+(\.)([[a-zA-Z0-9]]){2,4}([[a-zA-Z0-9]\/+=%&_\.~?\-]*))*$/';
preg_match($regexUrl, $string, $matches);
print_r($matches); // an array of matched patterns
From here, you just want to wrap those URL patterns in an anchor/href tag and you're done.
Just how accurate do you want to be? Given just how varied URLs can be, you're going to have to draw the line somewhere. For instance. www.ca is a perfectly valid hostname and does bring up a site, but it's not something you'd EXPECT to work.
You should investigate regular expressions for this.
You will build a pattern that will match the part of your string that looks like a URL and format it appropriately.
It will come out something like this (lifted this, haven't tested it);
$pattern = "((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:##%/;$()~_?\+-=\\\.&]*)";
preg_match($pattern, $input_string, $url_matches, PREG_OFFSET_CAPTURE, 3);
$url_matches will contain an array of all of the parts of the input string that matched the url pattern.
You can use $_SERVER['HTTP_HOST'] to get the host information.
<?php
$host = $SERVER['HTTP_HOST'];
?>
Post

Categories