Are two URLs identical? Ignore the param order - php

I have two URLs and am looking for the best way to decide if they are identical.
Example:
$url1 = 'http://example.com/page.php?tab=items&msg=3&sort=title';
$url2 = 'http://example.com/page.php?tab=items&sort=title&msg=3';
In the two URLs only the sort and msg param are switched, so I consider them equal.
However I cannot simply do if ( $url1 == $url2 ) { … }
I'm having a list of URLs and need to find duplicates, so the code should be fast as it is run inside a loop. (As a side note: The domain/page.php will always be same, it's only about finding URLs by params.)

Maybe like this?
function compare_url($url1, $url2){
return (parse_url($url1,PHP_URL_QUERY) == parse_url($url2,PHP_URL_QUERY));
}

It's not as easy as it might sound to find out if an URI is identical or not, especially as you take the query parameter into account here.
One common way to do this is to have a function that normalizes the URL and then compare the normalized URIs:
$url1 = 'http://example.com/page.php?tab=items&msg=3&sort=title';
$url2 = 'http://example.com/page.php?tab=items&sort=title&msg=3';
var_dump(url_nornalize($url1) == url_nornalize($url2)); # bool(true)
Into such a normalization function you put in your requirements. First of all the URL should be normalized according to the specs:
function url_nornalize($url, $separator = '&')
{
// normalize according RFC 3986
$url = new Net_URL2($url);
$url->normalize();
And then you can take care of additional normalization steps, for example, sorting the sub-parts of the query:
// normalize query if applicable
$query = $url->getQuery();
if (false !== $query) {
$params = explode($separator, $query);
sort($params);
$query = implode($separator, $params);
$url->setQuery($query);
}
Additional steps can be though of, like removing default parameters or not allowed ones, or duplicate ones and what not.
Finally the string of normalized URL is returned
return (string) $url;
}
Using an array/hash-map for the parameters isn't bad as well, I just wanted to show an alternative approach. Full example:
<?php
/**
* http://stackoverflow.com/questions/27667182/are-two-urls-identical-ignore-the-param-order
*/
require_once 'Net/URL2.php';
function url_nornalize($url, $separator = '&')
{
// normalize according RFC 3986
$url = new Net_URL2($url);
$url->normalize();
// normalize query if applicable
$query = $url->getQuery();
if (false !== $query) {
$params = explode($separator, $query);
// remove empty parameters
$params = array_filter($params, 'strlen');
// sort parameters
sort($params);
$query = implode($separator, $params);
$url->setQuery($query);
}
return (string)$url;
}
$url1 = 'http://EXAMPLE.com/p%61ge.php?tab=items&&&msg=3&sort=title';
$url2 = 'http://example.com:80/page.php?tab=items&sort=title&msg=3';
var_dump(url_nornalize($url1) == url_nornalize($url2)); # bool(true)

To make sure that both URLs are identical, we need to compare at least 4 elements:
The scheme(e.g. http, https, ftp)
The host, i.e. the domain name of the URL
The path, i.e. the "file" that was requested
Query parameters of the request.
Some notes:
(1) and (2) are case-insensitive, which means http://example.org is identical to HTTP://EXAMPLE.ORG.
(3) can have leading or trailing slashes, that should be ignored: example.org is identical to example.org/
(4) could include parameters in varying order.
We can safely ignore anchor text, or "fragment" (#anchor after the query parameters), as they are only parsed by the browser.
URLs can also include port-numbers, a username and password - I think we can ignore those elements, as they are used so rarely that they do not need to be checked here.
Solution:
Here's a complete function that checks all those details:
/**
* Check if two urls match while ignoring order of params
*
* #param string $url1
* #param string $url2
* #return bool
*/
function do_urls_match( $url1, $url2 ) {
// Parse urls
$parts1 = parse_url( $url1 );
$parts2 = parse_url( $url2 );
// Scheme and host are case-insensitive.
$scheme1 = strtolower( $parts1[ 'scheme' ] ?? '' );
$scheme2 = strtolower( $parts2[ 'scheme' ] ?? '' );
$host1 = strtolower( $parts1[ 'host' ] ?? '' );
$host2 = strtolower( $parts2[ 'host' ] ?? '' );
if ( $scheme1 !== $scheme2 ) {
// URL scheme mismatch (http <-> https): URLs are not identical.
return false;
}
if ( $host1 !== $host2 ) {
// Different host (domain name): Not identical.
return false;
}
// Remvoe leading/trailing slashes, url-decode special characters.
$path1 = trim( urldecode( $parts1[ 'path' ] ?? '' ), '/' );
$path2 = trim( urldecode( $parts2[ 'path' ] ?? '' ), '/' );
if ( $path1 !== $path2 ) {
// The request-path is different: Different URLs.
return false;
}
// Convert the query-params into arrays.
parse_str( $parts1['query'] ?? '', $query1 );
parse_str( $parts2['query'] ?? '', $query2 );
if ( count( $query1 ) !== count( $query2 ) ) {
// Both URLs have a differnt number of params: They cannot match.
return false;
}
// Only compare the query-arrays when params are present.
if (count( $query1 ) > 0 ) {
ksort( $query1 );
ksort( $query2 );
if ( array_diff( $query1, $query2 ) ) {
// Query arrays have differencs: URLs do not match.
return false;
}
}
// All checks passed, URLs are identical.
return true;
} // End do_urls_match()
Test cases:
$base_urls = [
'https://example.org/',
'https://example.org/index.php?sort=asc&field=id&filter=foo',
'http://EXAMPLE.com/p%61ge.php?tab=items&&&msg=3&sort=title',
];
$compare_urls = [
'https://example.org/',
'https://Example.Org',
'https://example.org/index.php?sort=asc&&field=id&filter=foo',
'http://example.org/index.php?sort=asc&field=id&filter=foo',
'https://company.net/page.php?sort=asc&field=id&filter=foo',
'https://example.org/index.php?sort=asc&&&field=id&filter=foo#anchor',
'https://example.org/index.php?field=id&filter=foo&sort=asc',
'http://example.com:80/page.php?tab=items&sort=title&msg=3',
];
foreach ( $base_urls as $url1 ) {
printf( "\n\n%s", $url1 );
foreach ( $compare_urls as $url2 ) {
if (do_urls_match( $url1, $url2 )) {
printf( "\n [MATCHES] %s", $url2 );
}
}
}
/* Output:
https://example.org/
[MATCHES] https://example.org/
[MATCHES] https://Example.Org
https://example.org/index.php?sort=asc&field=id&filter=foo
[MATCHES] https://example.org/index.php?sort=asc&&field=id&filter=foo
[MATCHES] https://example.org/index.php?sort=asc&&&field=id&filter=foo#anchor
[MATCHES] https://example.org/index.php?field=id&filter=foo&sort=asc
http://EXAMPLE.com/p%61ge.php?tab=items&&&msg=3&sort=title
[MATCHES] http://example.com:80/page.php?tab=items&sort=title&msg=3
*/

Related

turning this trainwreck of a function into a recursive one

I've been trying to build this recursive function for the better part of a day now, but I just can't seem to get it to work the way I want.
First, I have a property which holds some data that the function have to access:
$this->data
And then I have this string which the intention is to turn into a relative path:
$path = 'path.to.%id%-%folder%.containing.%info%';
The part of the string that are like this: %value% will load some dynamic values found in the $this->data property (like so: $this->data['id']; or $this->data['folder'];
and to make things really interesting, the property can reference itself again like so: $this->data['folder'] = 'foldername.%subfolder%'; and also have two %values% separated by a - that would have to be left alone.
So to the problem, I've been trying to make a recursive function that will load the dynamic values from the data property, and then again if the new value contains another %value% and so on until no more %value%'s are loaded.
So far, this is what I've been able to come up with:
public function recursiveFolder( $folder, $pathArr = null )
{
$newPathArr = explode( '.', $folder );
if ( count ( $newPathArr ) !== 1 )
{
foreach( $newPathArr as $id => $folder )
{
$value = $this->recursiveFolder( $folder, $newPathArr );
$resultArr = explode( '.', $value );
if ( count ( $resultArr ) !== 1 )
{
foreach ( $resultArr as $nid => $result )
{
$nvalue = $this->recursiveFolder( $result, $newPathArr );
$resultArr[$nid] = $nvalue;
}
}
$resultArr = implode( '.',$resultArr );
$newPathArr[$id] = $resultArr;
}
}
else
{
$pattern = '/%(.*?)%/si';
preg_match_all( $pattern, $folder, $matches );
if ( empty( $matches[0] ) )
{
return $folder;
}
foreach ( $matches[1] as $mid => $match )
{
if ( isset( $this->data[$match] ) && $this->data[$match] != '' )
{
$folder = str_replace( $matches[0][$mid], $this->data[$match], $folder );
return $folder;
}
}
}
return $newPathArr;
}
Unfortunately it is not a recursive function at all as it grinds to a halt when it has multiple layers of %values%, but works with two layers -barely-. (I just coded it so that it would work at a bare minimalistic level this point).
Here's how it should work:
It should turn:
'files.%folder%.blog-%type%.and.%time%'
into:
'files.foldername.blog-post.and.2013.feb-12th.09'
based on this:
$data['folder'] = 'foldername';
$data['type'] = 'post';
$data['time'] = '%year%.%month%-%day%';
$data['year'] = 2013;
$data['month'] = 'feb';
$data['day'] = '12th.%hour%';
$data['hour'] = '09';
Hope you can help!
Jay
I don't see the need for this too be solved recursively:
<?php
function putData($str, $data)
{
// Repeat the replacing process until no more matches are found:
while (preg_match("/%(.*?)%/si", $str, $matches))
{
// Use $matches to make your replaces
}
return $str;
}
?>

Access multiple GET parameters in PHP without bracket notation

PHP automatically creates arrays in $_GET, when the parameter name is followed by [] or [keyname].
However for a public API I'd love to have the same behaviour without explicit brackets in the URL. For example, the query
?foo=bar&foo=baz
should result in a $_GET (or similar) like this:
$_GET["foo"] == array("bar", "baz");
Is there any possibility to get this behaviour in PHP easily? I.e., not parsing $_SERVER['QUERY_STRING'] myself or preg_replacing = with []= in the query string before feeding it to parse_str()?
There's no built in way to support ?foo=bar&foo=baz.
Daniel Morell proposed a solution which manually parses the URL string and iteratively builds up an array when multiple instances of the parameter exist, or returns a string when only one parameter exists (ie; matches the default behaviour).
It supports both types of URLs, with and without a bracket:
?foo=bar&foo=baz // works
?foo[]=bar&foo[]=baz // works
/**
* Parses GET and POST form input like $_GET and $_POST, but without requiring multiple select inputs to end the name
* in a pair of brackets.
*
* #param string $method The input method to use 'GET' or 'POST'.
* #param string $querystring A custom form input in the query string format.
* #return array $output Returns an array containing the input keys and values.
*/
function bracketless_input( $method, $querystring=null ) {
// Create empty array to
$output = array();
// Get query string from function call
if( $querystring !== null ) {
$query = $querystring;
// Get raw POST data
} elseif ($method == 'POST') {
$query = file_get_contents('php://input');
// Get raw GET data
} elseif ($method == 'GET') {
$query = $_SERVER['QUERY_STRING'];
}
// Separerate each parameter into key value pairs
foreach( explode( '&', $query ) as $params ) {
$parts = explode( '=', $params );
// Remove any existing brackets and clean up key and value
$parts[0] = trim(preg_replace( '(\%5B|\%5D|[\[\]])', '', $parts[0] ) );
$parts[0] = preg_replace( '([^0-9a-zA-Z])', '_', urldecode($parts[0]) );
$parts[1] = urldecode($parts[1]);
// Create new key in $output array if param does not exist.
if( !key_exists( $parts[0], $output ) ) {
$output[$parts[0]] = $parts[1];
// Add param to array if param key already exists in $output
} elseif( is_array( $output[$parts[0]] ) ) {
array_push( $output[$parts[0]], $parts[1] );
// Otherwise turn $output param into array and append current param
} else {
$output[$parts[0]] = array( $output[$parts[0]], $parts[1] );
}
}
return $output;
}
you can try something like this:
foreach($_GET as $slug => $value) {
#whatever you want to do, for example
print $_GET[$slug];
}

How to combine query strings in PHP

Given a url, and a query string, how can I get the url resulting from the combination of the query string with the url?
I'm looking for functionality similar to .htaccess's qsa. I realize this would be fairly trivial to implement completely by hand, however are there built-in functions that deal with query strings which could either simplify or completely solve this?
Example input/result sets:
Url="http://www.example.com/index.php/page?a=1"
QS ="?b=2"
Result="http://www.example.com/index.php/page?a=1&b=2"
-
Url="page.php"
QS ="?b=2"
Result="page.php?b=2"
How about something that uses no PECL extensions and isn't a huge set of copied-and-pasted functions? It's still a tad complex because you're splicing together two query strings and want to do it in a way that isn't just $old .= $new;
We'll use parse_url to extract the query string from the desired url, parse_str to parse the query strings you wish to join, array_merge to join them together, and http_build_query to create the new, combined string for us.
// Parse the URL into components
$url = 'http://...';
$url_parsed = parse_url($url);
$new_qs_parsed = array();
// Grab our first query string
parse_str($url_parsed['query'], $new_qs_parsed);
// Here's the other query string
$other_query_string = 'that=this&those=these';
$other_qs_parsed = array();
parse_str($other_query_string, $other_qs_parsed);
// Stitch the two query strings together
$final_query_string_array = array_merge($new_qs_parsed, $other_qs_parsed);
$final_query_string = http_build_query($final_query_string_array);
// Now, our final URL:
$new_url = $url_parsed['scheme']
. '://'
. $url_parsed['host']
. $url_parsed['path']
. '?'
. $final_query_string;
You can get the query string part from url using:
$_SERVER['QUERY_STRING']
and then append it to url normally.
If you want to specify your own custom variables in query string, have a look at:
http_build_query
This is a series of functions taken from the WordPress "framework" that will do it, but this could quite well be too much:
add_query_arg()
/**
* Retrieve a modified URL query string.
*
* You can rebuild the URL and append a new query variable to the URL query by
* using this function. You can also retrieve the full URL with query data.
*
* Adding a single key & value or an associative array. Setting a key value to
* emptystring removes the key. Omitting oldquery_or_uri uses the $_SERVER
* value.
*
* #since 1.0
*
* #param mixed $param1 Either newkey or an associative_array
* #param mixed $param2 Either newvalue or oldquery or uri
* #param mixed $param3 Optional. Old query or uri
* #return string New URL query string.
*/
public function add_query_arg() {
$ret = '';
if ( is_array( func_get_arg(0) ) ) {
$uri = ( #func_num_args() < 2 || false === #func_get_arg( 1 ) ) ? $_SERVER['REQUEST_URI'] : #func_get_arg( 1 );
} else {
$uri = ( #func_num_args() < 3 || false === #func_get_arg( 2 ) ) ? $_SERVER['REQUEST_URI'] : #func_get_arg( 2 );
}
if ( $frag = strstr( $uri, '#' ) ) {
$uri = substr( $uri, 0, -strlen( $frag ) );
} else {
$frag = '';
}
if ( preg_match( '|^https?://|i', $uri, $matches ) ) {
$protocol = $matches[0];
$uri = substr( $uri, strlen( $protocol ) );
} else {
$protocol = '';
}
if ( strpos( $uri, '?' ) !== false ) {
$parts = explode( '?', $uri, 2 );
if ( 1 == count( $parts ) ) {
$base = '?';
$query = $parts[0];
} else {
$base = $parts[0] . '?';
$query = $parts[1];
}
} elseif ( !empty( $protocol ) || strpos( $uri, '=' ) === false ) {
$base = $uri . '?';
$query = '';
} else {
$base = '';
$query = $uri;
}
parse_str( $query, $qs );
if ( get_magic_quotes_gpc() )
$qs = format::stripslashes_deep( $qs );
$qs = format::urlencode_deep( $qs ); // this re-URL-encodes things that were already in the query string
if ( is_array( func_get_arg( 0 ) ) ) {
$kayvees = func_get_arg( 0 );
$qs = array_merge( $qs, $kayvees );
} else {
$qs[func_get_arg( 0 )] = func_get_arg( 1 );
}
foreach ( ( array ) $qs as $k => $v ) {
if ( $v === false )
unset( $qs[$k] );
}
$ret = http_build_query( $qs, '', '&' );
$ret = trim( $ret, '?' );
$ret = preg_replace( '#=(&|$)#', '$1', $ret );
$ret = $protocol . $base . $ret . $frag;
$ret = rtrim( $ret, '?' );
return $ret;
}
stripslashes_deep()
/**
* Navigates through an array and removes slashes from the values.
*
* If an array is passed, the array_map() function causes a callback to pass the
* value back to the function. The slashes from this value will removed.
*
* #since 1.0
*
* #param array|string $value The array or string to be stripped
* #return array|string Stripped array (or string in the callback).
*/
function stripslashes_deep( $value ) {
return is_array( $value ) ? array_map( array('self', 'stripslashes_deep'), $value ) : stripslashes( $value );
}
urlencode_deep()
/**
* Navigates through an array and encodes the values to be used in a URL.
*
* Uses a callback to pass the value of the array back to the function as a
* string.
*
* #since 1.0
*
* #param array|string $value The array or string to be encoded.
* #return array|string $value The encoded array (or string from the callback).
*/
public function urlencode_deep( $value ) {
return is_array($value) ? array_map( array('self', 'urlencode_deep'), $value) : urlencode($value);
}
THere is no built-in function to do this. However, you can use this function from http PECL extension,
http://usphp.com/manual/en/function.http-build-url.php
For example,
$url = http_build_url("http://www.example.com/index.php/page?a=1",
array(
"b" => "2"
)
);
So what happens if the urls conflict? If both urls contain a b= component in the querystring? You'd need to decided which holds sway.
Here's a chunk of code that does what you want, parsing each string as a url, then extracting the query url part and implode() ing them back together.
$url="http://www.example.com/index.php/page?a=1";
$qs ="?b=2";
$url_parsed = parse_url($url);
$qs_parsed = parse_url($qs);
$args = array(
$url_parsed['query'],
$qs_parsed['query'],
);
$new_url = $url_parsed['scheme'];
$new_url .= '://';
$new_url .= $url_parsed['host'];
$new_url .= $url_parsed['path'];
$new_url .= '?';
$new_url .= implode('&', $args);
print $new_url;

Get domain name (not subdomain) in php

I have a URL which can be any of the following formats:
http://example.com
https://example.com
http://example.com/foo
http://example.com/foo/bar
www.example.com
example.com
foo.example.com
www.foo.example.com
foo.bar.example.com
http://foo.bar.example.com/foo/bar
example.net/foo/bar
Essentially, I need to be able to match any normal URL. How can I extract example.com (or .net, whatever the tld happens to be. I need this to work with any TLD.) from all of these via a single regex?
Well you can use parse_url to get the host:
$info = parse_url($url);
$host = $info['host'];
Then, you can do some fancy stuff to get only the TLD and the Host
$host_names = explode(".", $host);
$bottom_host_name = $host_names[count($host_names)-2] . "." . $host_names[count($host_names)-1];
Not very elegant, but should work.
If you want an explanation, here it goes:
First we grab everything between the scheme (http://, etc), by using parse_url's capabilities to... well.... parse URL's. :)
Then we take the host name, and separate it into an array based on where the periods fall, so test.world.hello.myname would become:
array("test", "world", "hello", "myname");
After that, we take the number of elements in the array (4).
Then, we subtract 2 from it to get the second to last string (the hostname, or example, in your example)
Then, we subtract 1 from it to get the last string (because array keys start at 0), also known as the TLD
Then we combine those two parts with a period, and you have your base host name.
It is not possible to get the domain name without using a TLD list to compare with as their exist many cases with completely the same structure and length:
nas.db.de (Subdomain)
bbc.co.uk (Top-Level-Domain)
www.uk.com (Subdomain)
big.uk.com (Second-Level-Domain)
Mozilla's public suffix list should be the best option as it is used by all major browsers:
https://publicsuffix.org/list/public_suffix_list.dat
Feel free to use my function:
function tld_list($cache_dir=null) {
// we use "/tmp" if $cache_dir is not set
$cache_dir = isset($cache_dir) ? $cache_dir : sys_get_temp_dir();
$lock_dir = $cache_dir . '/public_suffix_list_lock/';
$list_dir = $cache_dir . '/public_suffix_list/';
// refresh list all 30 days
if (file_exists($list_dir) && #filemtime($list_dir) + 2592000 > time()) {
return $list_dir;
}
// use exclusive lock to avoid race conditions
if (!file_exists($lock_dir) && #mkdir($lock_dir)) {
// read from source
$list = #fopen('https://publicsuffix.org/list/public_suffix_list.dat', 'r');
if ($list) {
// the list is older than 30 days so delete everything first
if (file_exists($list_dir)) {
foreach (glob($list_dir . '*') as $filename) {
unlink($filename);
}
rmdir($list_dir);
}
// now set list directory with new timestamp
mkdir($list_dir);
// read line-by-line to avoid high memory usage
while ($line = fgets($list)) {
// skip comments and empty lines
if ($line[0] == '/' || !$line) {
continue;
}
// remove wildcard
if ($line[0] . $line[1] == '*.') {
$line = substr($line, 2);
}
// remove exclamation mark
if ($line[0] == '!') {
$line = substr($line, 1);
}
// reverse TLD and remove linebreak
$line = implode('.', array_reverse(explode('.', (trim($line)))));
// we split the TLD list to reduce memory usage
touch($list_dir . $line);
}
fclose($list);
}
#rmdir($lock_dir);
}
// repair locks (should never happen)
if (file_exists($lock_dir) && mt_rand(0, 100) == 0 && #filemtime($lock_dir) + 86400 < time()) {
#rmdir($lock_dir);
}
return $list_dir;
}
function get_domain($url=null) {
// obtain location of public suffix list
$tld_dir = tld_list();
// no url = our own host
$url = isset($url) ? $url : $_SERVER['SERVER_NAME'];
// add missing scheme ftp:// http:// ftps:// https://
$url = !isset($url[5]) || ($url[3] != ':' && $url[4] != ':' && $url[5] != ':') ? 'http://' . $url : $url;
// remove "/path/file.html", "/:80", etc.
$url = parse_url($url, PHP_URL_HOST);
// replace absolute domain name by relative (http://www.dns-sd.org/TrailingDotsInDomainNames.html)
$url = trim($url, '.');
// check if TLD exists
$url = explode('.', $url);
$parts = array_reverse($url);
foreach ($parts as $key => $part) {
$tld = implode('.', $parts);
if (file_exists($tld_dir . $tld)) {
return !$key ? '' : implode('.', array_slice($url, $key - 1));
}
// remove last part
array_pop($parts);
}
return '';
}
What it makes special:
it accepts every input like URLs, hostnames or domains with- or without scheme
the list is downloaded row-by-row to avoid high memory usage
it creates a new file per TLD in a cache folder so get_domain() only needs to check through file_exists() if it exists so it does not need to include a huge database on every request like TLDExtract does it.
the list will be automatically updated every 30 days
Test:
$urls = array(
'http://www.example.com',// example.com
'http://subdomain.example.com',// example.com
'http://www.example.uk.com',// example.uk.com
'http://www.example.co.uk',// example.co.uk
'http://www.example.com.ac',// example.com.ac
'http://example.com.ac',// example.com.ac
'http://www.example.accident-prevention.aero',// example.accident-prevention.aero
'http://www.example.sub.ar',// sub.ar
'http://www.congresodelalengua3.ar',// congresodelalengua3.ar
'http://congresodelalengua3.ar',// congresodelalengua3.ar
'http://www.example.pvt.k12.ma.us',// example.pvt.k12.ma.us
'http://www.example.lib.wy.us',// example.lib.wy.us
'com',// empty
'.com',// empty
'http://big.uk.com',// big.uk.com
'uk.com',// empty
'www.uk.com',// www.uk.com
'.uk.com',// empty
'stackoverflow.com',// stackoverflow.com
'.foobarfoo',// empty
'',// empty
false,// empty
' ',// empty
1,// empty
'a',// empty
);
Recent version with explanations (German):
http://www.programmierer-forum.de/domainnamen-ermitteln-t244185.htm
My solution in https://gist.github.com/pocesar/5366899
and the tests are here http://codepad.viper-7.com/GAh1tP
It works with any TLD, and hideous subdomain patterns (up to 3 subdomains).
There's a test included with many domain names.
Won't paste the function here because of the weird indentation for code in StackOverflow (could have fenced code blocks like github)
echo getDomainOnly("http://example.com/foo/bar");
function getDomainOnly($host){
$host = strtolower(trim($host));
$host = ltrim(str_replace("http://","",str_replace("https://","",$host)),"www.");
$count = substr_count($host, '.');
if($count === 2){
if(strlen(explode('.', $host)[1]) > 3) $host = explode('.', $host, 2)[1];
} else if($count > 2){
$host = getDomainOnly(explode('.', $host, 2)[1]);
}
$host = explode('/',$host);
return $host[0];
}
I recommend using TLDExtract library for all operations with domain name.
I think the best way to handle this problem is:
$second_level_domains_regex = '/\.asn\.au$|\.com\.au$|\.net\.au$|\.id\.au$|\.org\.au$|\.edu\.au$|\.gov\.au$|\.csiro\.au$|\.act\.au$|\.nsw\.au$|\.nt\.au$|\.qld\.au$|\.sa\.au$|\.tas\.au$|\.vic\.au$|\.wa\.au$|\.co\.at$|\.or\.at$|\.priv\.at$|\.ac\.at$|\.avocat\.fr$|\.aeroport\.fr$|\.veterinaire\.fr$|\.co\.hu$|\.film\.hu$|\.lakas\.hu$|\.ingatlan\.hu$|\.sport\.hu$|\.hotel\.hu$|\.ac\.nz$|\.co\.nz$|\.geek\.nz$|\.gen\.nz$|\.kiwi\.nz$|\.maori\.nz$|\.net\.nz$|\.org\.nz$|\.school\.nz$|\.cri\.nz$|\.govt\.nz$|\.health\.nz$|\.iwi\.nz$|\.mil\.nz$|\.parliament\.nz$|\.ac\.za$|\.gov\.za$|\.law\.za$|\.mil\.za$|\.nom\.za$|\.school\.za$|\.net\.za$|\.co\.uk$|\.org\.uk$|\.me\.uk$|\.ltd\.uk$|\.plc\.uk$|\.net\.uk$|\.sch\.uk$|\.ac\.uk$|\.gov\.uk$|\.mod\.uk$|\.mil\.uk$|\.nhs\.uk$|\.police\.uk$/';
$domain = $_SERVER['HTTP_HOST'];
$domain = explode('.', $domain);
$domain = array_reverse($domain);
if (preg_match($second_level_domains_regex, $_SERVER['HTTP_HOST']) {
$domain = "$domain[2].$domain[1].$domain[0]";
} else {
$domain = "$domain[1].$domain[0]";
}
$onlyHostName = implode('.', array_slice(explode('.', parse_url($link, PHP_URL_HOST)), -2));
Using https://subdomain.domain.com/some/path as example
parse_url($link, PHP_URL_HOST) returns subdomain.domain.com
explode('.', parse_url($link, PHP_URL_HOST)) then breaks subdomain.domain.com into an array:
array(3) {
[0]=>
string(5) "subdomain"
[1]=>
string(7) "domain"
[2]=>
string(3) "com"
}
array_slice then slices the array so only the last 2 values are in the array (signified by the -2):
array(2) {
[0]=>
string(6) "domain"
[1]=>
string(3) "com"
}
implode then combines those two array values back together, ultimately giving you the result of domain.com
Note: this will only work when end domain you're expecting only has one . in it, like something.domain.com or else.something.domain.net
It will not work for something.domain.co.uk where you would expect domain.co.uk
There are two ways to extract subdomain from a host:
The first method that is more accurate is to use a database of tlds (like public_suffix_list.dat) and match domain with it. This is a little heavy in some cases. There are some PHP classes for using it like php-domain-parser and TLDExtract.
The second way is not as accurate as the first one, but is very fast and it can give the correct answer in many case, I wrote this function for it:
function get_domaininfo($url) {
// regex can be replaced with parse_url
preg_match("/^(https|http|ftp):\/\/(.*?)\//", "$url/" , $matches);
$parts = explode(".", $matches[2]);
$tld = array_pop($parts);
$host = array_pop($parts);
if ( strlen($tld) == 2 && strlen($host) <= 3 ) {
$tld = "$host.$tld";
$host = array_pop($parts);
}
return array(
'protocol' => $matches[1],
'subdomain' => implode(".", $parts),
'domain' => "$host.$tld",
'host'=>$host,'tld'=>$tld
);
}
Example:
print_r(get_domaininfo('http://mysubdomain.domain.co.uk/index.php'));
Returns:
Array
(
[protocol] => https
[subdomain] => mysubdomain
[domain] => domain.co.uk
[host] => domain
[tld] => co.uk
)
Here's a function I wrote to grab the domain without subdomain(s), regardless of whether the domain is using a ccTLD or a new style long TLD, etc... There is no lookup or huge array of known TLDs, and there's no regex. It can be a lot shorter using the ternary operator and nesting, but I expanded it for readability.
// Per Wikipedia: "All ASCII ccTLD identifiers are two letters long,
// and all two-letter top-level domains are ccTLDs."
function topDomainFromURL($url) {
$url_parts = parse_url($url);
$domain_parts = explode('.', $url_parts['host']);
if (strlen(end($domain_parts)) == 2 ) {
// ccTLD here, get last three parts
$top_domain_parts = array_slice($domain_parts, -3);
} else {
$top_domain_parts = array_slice($domain_parts, -2);
}
$top_domain = implode('.', $top_domain_parts);
return $top_domain;
}
function getDomain($url){
$pieces = parse_url($url);
$domain = isset($pieces['host']) ? $pieces['host'] : '';
if(preg_match('/(?P<domain>[a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$/i', $domain, $regs)){
return $regs['domain'];
}
return FALSE;
}
echo getDomain("http://example.com"); // outputs 'example.com'
echo getDomain("http://www.example.com"); // outputs 'example.com'
echo getDomain("http://mail.example.co.uk"); // outputs 'example.co.uk'
I had problems with the solution provided by pocesar.
When I would use for instance subdomain.domain.nl it would not return domain.nl. Instead it would return subdomain.domain.nl
Another problem was that domain.com.br would return com.br
I am not sure but i fixed these issues with the following code (i hope it will help someone, if so I am a happy man):
function get_domain($domain, $debug = false){
$original = $domain = strtolower($domain);
if (filter_var($domain, FILTER_VALIDATE_IP)) {
return $domain;
}
$debug ? print('<strong style="color:green">»</strong> Parsing: '.$original) : false;
$arr = array_slice(array_filter(explode('.', $domain, 4), function($value){
return $value !== 'www';
}), 0); //rebuild array indexes
if (count($arr) > 2){
$count = count($arr);
$_sub = explode('.', $count === 4 ? $arr[3] : $arr[2]);
$debug ? print(" (parts count: {$count})") : false;
if (count($_sub) === 2){ // two level TLD
$removed = array_shift($arr);
if ($count === 4){ // got a subdomain acting as a domain
$removed = array_shift($arr);
}
$debug ? print("<br>\n" . '[*] Two level TLD: <strong>' . join('.', $_sub) . '</strong> ') : false;
}elseif (count($_sub) === 1){ // one level TLD
$removed = array_shift($arr); //remove the subdomain
if (strlen($arr[0]) === 2 && $count === 3){ // TLD domain must be 2 letters
array_unshift($arr, $removed);
}elseif(strlen($arr[0]) === 3 && $count === 3){
array_unshift($arr, $removed);
}else{
// non country TLD according to IANA
$tlds = array(
'aero',
'arpa',
'asia',
'biz',
'cat',
'com',
'coop',
'edu',
'gov',
'info',
'jobs',
'mil',
'mobi',
'museum',
'name',
'net',
'org',
'post',
'pro',
'tel',
'travel',
'xxx',
);
if (count($arr) > 2 && in_array($_sub[0], $tlds) !== false){ //special TLD don't have a country
array_shift($arr);
}
}
$debug ? print("<br>\n" .'[*] One level TLD: <strong>'.join('.', $_sub).'</strong> ') : false;
}else{ // more than 3 levels, something is wrong
for ($i = count($_sub); $i > 1; $i--){
$removed = array_shift($arr);
}
$debug ? print("<br>\n" . '[*] Three level TLD: <strong>' . join('.', $_sub) . '</strong> ') : false;
}
}elseif (count($arr) === 2){
$arr0 = array_shift($arr);
if (strpos(join('.', $arr), '.') === false && in_array($arr[0], array('localhost','test','invalid')) === false){ // not a reserved domain
$debug ? print("<br>\n" .'Seems invalid domain: <strong>'.join('.', $arr).'</strong> re-adding: <strong>'.$arr0.'</strong> ') : false;
// seems invalid domain, restore it
array_unshift($arr, $arr0);
}
}
$debug ? print("<br>\n".'<strong style="color:gray">«</strong> Done parsing: <span style="color:red">' . $original . '</span> as <span style="color:blue">'. join('.', $arr) ."</span><br>\n") : false;
return join('.', $arr);
}
Here's one that works for all domains, including those with second level domains like "co.uk"
function strip_subdomains($url){
# credits to gavingmiller for maintaining this list
$second_level_domains = file_get_contents("https://raw.githubusercontent.com/gavingmiller/second-level-domains/master/SLDs.csv");
# presume sld first ...
$possible_sld = implode('.', array_slice(explode('.', $url), -2));
# and then verify it
if (strpos($second_level_domains, $possible_sld)){
return implode('.', array_slice(explode('.', $url), -3));
} else {
return implode('.', array_slice(explode('.', $url), -2));
}
}
Looks like there's a duplicate question here: delete-subdomain-from-url-string-if-subdomain-is-found
Very late, I see that you marked regex as a keyword and my function works like a charm, so far I haven't found a url that fails:
function get_domain_regex($url){
$pieces = parse_url($url);
$domain = isset($pieces['host']) ? $pieces['host'] : '';
if (preg_match('/(?P<domain>[a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$/i', $domain, $regs)) {
return $regs['domain'];
}else{
return false;
}
}
if you want one without regex I have this one, which I am sure I also took from this post
function get_domain($url){
$parseUrl = parse_url($url);
$host = $parseUrl['host'];
$host_array = explode(".", $host);
$domain = $host_array[count($host_array)-2] . "." . $host_array[count($host_array)-1];
return $domain;
}
They both work amazing, BUT, this took me a while to realize if the url doesn't start with http:// or https:// it will fail so make sure the url string starts with the protocol.
Simply try this:
preg_match('/(www.)?([^.]+\.[^.]+)$/', $yourHost, $matches);
echo "domain name is: {$matches[0]}\n";
this working for majority of domains.
This function will return the domain name without the extension of any url given even if you parse a url without the http:// or https://
You can extend this code
(?:\.co)?(?:\.com)?(?:\.gov)?(?:\.net)?(?:\.org)?(?:\.id)?
with more extensions if you want to handle more second level domainnames.
function get_domain_name($url){
$pieces = parse_url($url);
$domain = isset($pieces['host']) ? $pieces['host'] : $url;
$domain = strtolower($domain);
$domain = preg_replace('/.international$/', '.com', $domain);
if (preg_match('/(?P<domain>[a-z0-9][a-z0-9\-]{1,90}\.[a-z\.]{2,6})$/i', $domain, $regs)) {
if (preg_match('/(.*?)((?:\.co)?(?:\.com)?(?:\.gov)?(?:\.net)?(?:\.org)?(?:\.id)?(?:\.asn)?.[a-z]{2,6})$/i', $regs['domain'], $matches)) {
return $matches[1];
}else return $regs['domain'];
}else{
return $url;
}
}
I'm using this to achieve the same target and it always works, I hope it will help others.
$url = https://use.fontawesome.com/releases/v5.11.2/css/all.css?ver=2.7.5
$handle = pathinfo( parse_url( $url )['host'] )['filename'];
$final_handle = substr( $handle , strpos( $handle , '.' ) + 1 );
print_r($final_handle); // fontawesome
Simplest solution
#preg_replace('#\/(.)*#', '', #preg_replace('#^https?://(www.)?#', '', $url))
Simply try this:
<?php
$host = $_SERVER['HTTP_HOST'];
preg_match("/[^\.\/]+\.[^\.\/]+$/", $host, $matches);
echo "domain name is: {$matches[0]}\n";
?>

Finding tags in query string with regular expression

I have to set some routing rules in my php application, and they should be in the form
/%var/something/else/%another_var
In other words i beed a regex that returns me every URI piece marked by the % character, String marked by % represent var names so they can be almost every string.
another example:
from /%lang/module/controller/action/%var_1
i want the regex to extract lang and var_1
i tried something like
/.*%(.*)[\/$]/
but it doesn't work.....
Seeing as it's routing rules, and you may need all the pieces at some point, you could also split the string the classical way:
$path_exploded = explode("/", $path);
foreach ($path_exploded as $fragment) if ($fragment[0] == "%")
echo "Found $fragment";
$str='/%var/something/else/%another_var';
$s = explode("/",$str);
$whatiwant = preg_grep("/^%/",$s);
print_r($whatiwant);
I don’t see the need to slow down your script with a regex … trim() and explode() do everything you need:
function extract_url_vars($url)
{
if ( FALSE === strpos($url, '%') )
{
return $url;
}
$found = array();
$parts = explode('/%', trim($url, '/') );
foreach ( $parts as $part )
{
$tmp = explode('/', $part);
$found[] = ltrim( array_shift($tmp), '%');
}
return $found;
}
// Test
print_r( extract_url_vars('/%lang/module/controller/action/%var_1') );
// Result:
Array
(
[0] => lang
[1] => var_1
)
You can use:
$str = '/%lang/module/controller/action/%var_1';
if(preg_match('#/%(.*?)/[^%]*%(.*?)$#',$str,$matches)) {
echo "$matches[1] $matches[2]\n"; // prints lang var_1
}

Categories