PHP: How to get base URL from HTML page

PHP: How to get base URL from HTML page - php

I'm struggling with figuring out how to do this. I have an absolute URL to an HTML page, and I need to get the base URL for this. So the URLs could be for example:
http://www.example.com/
https://www.example.com/foo/
http://www.example.com/foo/bar.html
https://alice#www.example.com/foo
And so on. So, first problem is to find the base URL from those and other URLs. The second problem is that some HTML pages contain a base tag, which could be for example http://example.com/ or simply / (although I think some browser only support the one starting with protocol://?).
Either way, how can I do this in PHP corrrectly? I have the URL and I have the HTML loaded up in a DOMDocument so should be able to grab the base tag fairly easily if it exists. How do browsers solve this for example?
Clarification on why I need this
I'm trying to create something which takes a URL to a web page and returns the absolute URL to all the images this web page links to. Since some/many/all of these images might have relative URLs, I need to find the base URL to use when I make them absolute. This might be the base URL of the web page, or it might be a base URL specified in the HTML itself.
I have managed to fetch the HTML and find the URLs. I think I've also found a working method of making the URLs absolute when I have the base URL to use. But finding the base URL is what I'm missing, and what I'm asking about here.

See parse_url().
$result=parse_url('http://www.google.com');
print_r($result);
Pick out of there whichever element you are looking for. You probably want $result['path'].

Fun with snippets!
if (!function_exists('base_url')) {
function base_url($atRoot=FALSE, $atCore=FALSE, $parse=FALSE){
if (isset($_SERVER['HTTP_HOST'])) {
$http = isset($_SERVER['HTTPS']) && strtolower($_SERVER['HTTPS']) !== 'off' ? 'https' : 'http';
$hostname = $_SERVER['HTTP_HOST'];
$dir = str_replace(basename($_SERVER['SCRIPT_NAME']), '', $_SERVER['SCRIPT_NAME']);
$core = preg_split('#/#', str_replace($_SERVER['DOCUMENT_ROOT'], '', realpath(dirname(__FILE__))), NULL, PREG_SPLIT_NO_EMPTY);
$core = $core[0];
$tmplt = $atRoot ? ($atCore ? "%s://%s/%s/" : "%s://%s/") : ($atCore ? "%s://%s/%s/" : "%s://%s%s");
$end = $atRoot ? ($atCore ? $core : $hostname) : ($atCore ? $core : $dir);
$base_url = sprintf( $tmplt, $http, $hostname, $end );
}
else $base_url = 'http://localhost/';
if ($parse) {
$base_url = parse_url($base_url);
if (isset($base_url['path'])) if ($base_url['path'] == '/') $base_url['path'] = '';
}
return $base_url;
}
}
Use as simple as:
// url like: http://stackoverflow.com/questions/2820723/how-to-get-base-url-with-php
echo base_url(); // will produce something like: http://stackoverflow.com/questions/2820723/
echo base_url(TRUE); // will produce something like: http://stackoverflow.com/
echo base_url(TRUE, TRUE); || echo base_url(NULL, TRUE); // will produce something like: http://stackoverflow.com/questions/
// and finally
echo base_url(NULL, NULL, TRUE);
// will produce something like:
// array(3) {
// ["scheme"]=>
// string(4) "http"
// ["host"]=>
// string(12) "stackoverflow.com"
// ["path"]=>
// string(35) "/questions/2820723/"
// }

Related

Adding _blank targets to external links

I just want to know how can I add a _blank target type to a link if a link is pointing to an external domain (_self for internals). I was doing this by checking the url but it was really hard coded and not reusable for other sites.
Do you have any idea of how to do it properly with PHP ?
$target_type=(strpos($ref, $_SERVER['HTTP_HOST'])>-1
|| strpos($ref,'/')===0? '_self' : '_blank');
if ($ref<>'#' && substr($ref,0,4)<>'http') $ref='http://'.$ref;
$array['href']=$ref;
if (substr($ref,0,1)<>'#') $array['target']= $target_type;
$array['rel']='nofollow';
if (empty($array['text'])) $array['text']=str_replace('http://','',$ref);
This is only working for the main domain, but when using domain.com/friendlyurl/, is not working.
Thanks in advance
NOTE : Links can contain whether http:// protocol or not and they use to be absolute links. Links are added by users in the system

The easiest way is to use parse_url function. For me it should look like this:
<?php
$link = $_GET['link'];
$urlp = parse_url($link);
$target = '_self';
if (isset($urlp['host']) && $urlp['host'] != $_SERVER['HTTP_HOST'])
{
if (preg_replace('#^(www.)(.*)#D','$2',$urlp['host']) != $_SERVER['HTTP_HOST'] && preg_replace('#^(www.)(.*)#D','$2',$_SERVER['HTTP_HOST']) != $urlp['host'])
$target = '_blank';
}
$anchor = 'LINK';
// creating html code
echo $target.'<br>';
echo '' . $link . '';
In this code I use $_GET['link'] variable, but you should use your own link from as a value of $link. You can check this script here: http://kolodziej.in/help/link_target.php?link=http://blog.kolodziej.in/2013/06/i-know-jquery-not-javascript/ (return _blank link), http://kolodziej.in/help/link_target.php?link=http://kolodziej.in/ (return _self link).

How to safely get full URL of parent directory of current PHP page

I'm using:
$domain = $_SERVER['HTTP_HOST'];
$path = $_SERVER['SCRIPT_NAME'];
$themeurl = $domain . $path;
But this of course gives the full URL.
Instead I need the full URL minus the current file and up one directory and minus the trailing slash.
so no matter what the browser URL domain is eg localhost, https://, http://, etc that the full real (bypassing any mod rewrites) URL path of the parent directory is given without a trailing slash.
How is this done?
Safely so no XSS as I guess (from reading) using anything but 'SCRIPT_NAME' has such risk.. not sure though ofc.. just been reading a ton trying to figure this out.
examples:
if given:
https://stackoverflow.com/questions/somequestions/index.php
need:
https://stackoverflow.com/questions
without the trailing slash.
and should also work for say:
http://localhost/GetSimple/admin/load.php
to get
http://localhost/GetSimple
which is what I'm trying to do.
Thank you.
Edit:
Here's the working solution I used:
$url = isset($_SERVER['HTTPS']) ? 'https://' : 'http://';
$url .= $_SERVER['SERVER_NAME'];
$url .= htmlspecialchars($_SERVER['REQUEST_URI']);
$themeurl = dirname(dirname($url)) . "/theme";
it works perfectly.

Thats easy - using the function dirname twice :)
echo dirname(dirname('https://stackoverflow.com/questions/somequestions/index.php'));
Also note #Sid's comment. When you you need the full uri to the current script, with protocol and server the use something like this:
$url = isset($_SERVER['HTTPS']) ? 'https://' : 'http://';
$url .= $_SERVER['SERVER_NAME'];
$url .= $_SERVER['REQUEST_URI'];
echo dirname(dirname($url));

I have more simple syntax to get parent addres with port and url
lets try my code
dirname($_SERVER['PHP_SELF'])
with this code you can got a direct parent of adres
if you want to 2x roll back directory you can looping
dirname(dirname($_SERVER['PHP_SELF']))
dirname is fungtion to get parent addrest web and $_SERVER['PHP_SELF'] can showing current addres web.
thakyou Sir https://stackoverflow.com/users/171318/hek2mgl

I do not suggest using dirname()as it is for directories and not for URIs. Examples:
dirname("http://example.com/foo/index.php") returns http://example.com/foo
dirname("http://example.com/foo/") returns http://example.com
dirname("http://example.com/") returns http:
dirname("http://example.com") returns http:
So you have to be very carful which $_SERVER var you use and of course it works only for this specific problem. A much better general solution would be to use currentdir() on which basis you could use this to get the parent directory:
function parentdir($url) {
// note: parent of "/" is "/" and parent of "http://example.com" is "http://example.com/"
// remove filename and query
$url = currentdir($url);
// get parent
$len = strlen($url);
return currentdir(substr($url, 0, $len && $url[ $len - 1 ] == '/' ? -1 : $len));
}
Examples:
parentdir("http://example.com/foo/bar/index.php") returns
http://example.com/foo/
parentdir("http://example.com/foo/index.php") returns http://example.com/
parentdir("http://example.com/foo/") returns http://example.com/
parentdir("http://example.com/") returns http://example.com/
parentdir("http://example.com") returns http://example.com/
So you would have much more stable results. Maybe you could explain why you wanted to remove the trailing slash. My experience is that it produces more problems as you are not able to differentiate between a file named "/foo" and a folder with the same name without using is_dir(). But if this is important for you, you could remove the last char.

This example works with ports
function full_url($s)
{
$ssl = (!empty($s['HTTPS']) && $s['HTTPS'] == 'on') ? true:false;
$sp = strtolower($s['SERVER_PROTOCOL']);
$protocol = substr($sp, 0, strpos($sp, '/')) . (($ssl) ? 's' : '');
$port = $s['SERVER_PORT'];
$port = ((!$ssl && $port=='80') || ($ssl && $port=='443')) ? '' : ':'.$port;
$host = isset($s['HTTP_HOST']) ? $s['HTTP_HOST'] : $s['SERVER_NAME'];
return $protocol . '://' . $host . $port . $s['REQUEST_URI'];
}
$themeurl = dirname(dirname(full_url($_SERVER))).'/theme';
echo 'Theme URL';
Source: https://stackoverflow.com/a/8891890/175071

I'm with hek2mgl. However, just in case the script isn't always specifically 2 directories below your target, you could use explode:
$parts = explode("/",ltrim($_SERVER['SCRIPT_NAME'],"/"));
echo $_SERVER['HTTP_HOST'] . "/" . $parts[0];

As hek2mgl mentioned, it's correct, and a more dynamic approach would be dirname(dirname(htmlspecialchars($_SERVER['REQUEST_URI'])));.
EDIT:
$_SERVER['REQUEST_URI'] will omit the domain name. Referring #hek2mgl's post, you can echo dirname(dirname(htmlspecialchars($url)));

Here are useful commands to get the desired path:
( For example, you are executing in http:// yoursite.com/folder1/folder2/file.php)
__FILE__ (on L.Hosting) === /home/xfiddlec/http_docs/folder1/folder2/yourfile.php
__FILE__ (on Localhost) === C:\wamp\www\folder1\folder2\yourfile.php
$_SERVER['HTTP_HOST'] === www.yoursite.com (or without WWW)
$_SERVER["PHP_SELF"] === /folder1/folder2/yourfile.php
$_SERVER["REQUEST_URI"] === /folder1/folder2/yourfile.php?var=blabla
$_SERVER["DOCUMENT_ROOT"] === /home/xfiddlec/http_docs
// BASENAME and DIRNAME (lets say,when __file__ is '/folder1/folder2/yourfile.php'
basename(__FILE__) ==== yourfile.php
dirname(__FILE__) ==== /folder1/folder2
Examples:
*HOME url ( yoursite.com )
<?php echo $_SERVER['HTTP_HOST'];?>
*file's BASE url ( yoursite.com/anyfolder/myfile.php )
<?php echo $_SERVER['HTTP_HOST'].$_SERVER['PHP_SELF']; ?>
*COMPLETE current url ( yoursite.com/anyfolder/myfile.php?action=blabla )
<?php echo $_SERVER['HTTP_HOST'].$_SERVER["REQUEST_URI"];?>
*CURRENT FOLDER's URL ( yoursite.com/anyfolder/ )
<?php echo $_SERVER['HTTP_HOST'] . dirname($_SERVER['REQUEST_URI']); ?>
*To get RealPath to the file (even if it is included) (change /var/public_html to your desired root)
<?php
$cur_file=str_replace('\\','/',__FILE__); //Then Remove the root path::
$cur_file=preg_replace('/(.*?)\/var\/public_html/','',$cur_file);
?>
p.s.for wordpress, there exist already pre-defined functions to get plugins or themes url.
i.e. get plugin folder ( http://yoursite.com/wp-content/plugins/pluginName/ )
<?php echo plugin_dir_url( __FILE__ );?>

working with links : identifying external links and full address of links

i'm trying to create a sitemap for my website
so basically i scan the homepage for links
and extract the links and do the same thing recursively for extracted links
function get_contents($url = '' ) {
if($url == '' ) { $url = $this->base_url; }
$curl = new cURL;
$content = $curl->get($url);
$this->get_links($content);
}
public function get_links($contents){
$DOM = new DOMDocument();
$DOM->loadHTML($contents);
$a = $DOM->getElementsByTagName('a');
foreach($a as $link){
$h = $link->getAttribute('href');
$l = $this->base.'/'.$h;
$this->links[] = $l ;
$this->get_contents($l);
}
}
it works fine but there are couple of problems
1-
i get some links ike
www.mysite.com/http://www.external.com
i can do something like
if( stripos( $link , 'http') !== false
||
stripos( $link , 'www.') !== false
||
stripos( $link , 'https') !== false
)
{
if(stripos( $link , 'mysite.com') !== false)
{
//ignor this link (yeah i suck at regex and string mapping)
}
}
but it's seems very complicated and slow , is there any standard and clean way to find out if a link is a external link ?
2 -
is there any way to deal with relative paths ?
i get some thing like
www.mysite.com/../Domain/List3.html
obviusly this isn't right
i can remove (../) from link but it might not work with all links
is there anyway to find out full address of a link ?

For relative paths, you could take a look at realpath()
use parse_url() to get domain for example so you can easy check
if the domain is equal to your domain. Notice that parse_url() requires a SCHEME to be defined
so maybe add http:// if there is no http[s].

PHP - replace http with https in URL

I am trying to figure out how to add s after HTTP once a user checks a box in the html form.
I have in my PHP,
$url = 'http://google.com';
if(!isset($_POST['https'])) {
//something here
}
So basically, when the user checks a box with the name="https" i want to add s to $url's http making it https://google.com.
I have little knowledge on PHP and if someone can explain to me how to go about doing this, this would be really helpful! thanks.

$url = preg_replace("/^http:/i", "https:", $url);

$url = str_replace( 'http://', 'https://', $url );

One way:
$url = '%s//google.com';
$protocol = 'http:';
if(!isset($_POST['https'])) {
$protocol = 'https:';
}
$url = sprintf($url, $protocol);

I don't know on how many pages you want this to happen onward the user checks the box, but one answer is JavaScript and the base tag.
With the base tag, you can force a different origin, what your relative URL-s will be resolved against.
I you are using it ina form, and the user ticks the checkbox them sumbits the form, all other pages will be viewed from the https site, so you can use relative URL-s everywhere, just insert a different base tag when the user wants to change the site form or to http(s).

A solution that doesn't replace urls which contain other urls, for instance http://foo.com/redirect-to/http://newfoo.com
$desiredScheme = "http"; // convert to this scheme;
$parsedRedirectUri = parse_url($myCurrentUrl);
if($parsedRedirectUri['scheme'] !== $desiredScheme) {
$myCurrentUrl= substr_replace($myCurrentUrl, $desiredScheme, 0, strlen( $parsedRedirectUri['scheme'] ));
}

When you consider case-insensitive, then use str_ireplace function.
Example:
$url = str_ireplace( 'http://', 'https://', $url );
Alternatively, incase you ant to replace URL scheme in CDN files.
Example:
$url = str_ireplace( 'http:', 'https:', $url );
This... (with 'https:')
<link rel="stylesheet" href="https://ajax.googleapis.com/ajax/libs/jqueryui/1.12.1/themes/smoothness/jquery-ui.css">
<script src="https://ajax.googleapis.com/ajax/libs/jqueryui/1.12.1/jquery-ui.min.js"></script>
Becomes... ('https:' has been replaced)
<link rel="stylesheet" href="//ajax.googleapis.com/ajax/libs/jqueryui/1.12.1/themes/smoothness/jquery-ui.css">
<script src="//ajax.googleapis.com/ajax/libs/jqueryui/1.12.1/jquery-ui.min.js"></script>

$count = 1;
$url = str_replace("http://", "https://", $url, $count);
Note : Passing 1 directly will throw fatal error (Fatal error: Only variables can be passed by reference) so you have to pass it last parameter by reference.

1 bug to kill... Letting PHP Generate The Canonical

for building a clean canonical url, that always returns 1 base URL, im stuck in following case:
<?php
# every page
$extensions = $_SERVER['REQUEST_URI']; # path like: /en/home.ast?ln=ja
$qsIndex = strpos($extensions, '?'); # removes the ?ln=de part
$pageclean = $qsIndex !== FALSE ? substr($extensions, 0, $qsIndex) : $extensions;
$canonical = "http://website.com" . $pageclean; # basic canonical url
?>
<html><head><link rel="canonical" href="<?=$canonical?>"></head>
when URL : http://website.com/de/home.ext?ln=de
canonical: http://website.com/de/home.ext
BUT I want to remove the file extension aswell, whether its .php, .ext .inc or whatever two or three char extension .[xx] or .[xxx] so the base url becomes: http://website.com/en/home
Aaah much nicer! but How do i achieve that in current code?
Any hints are much appreciated +!

Think this should do it, just strip off the end if there is an extension, just like you did for the query string:
$pageclean = $qsIndex !== FALSE ? substr($extensions, 0, $qsIndex) : $extensions;
$dotIndex = strrpos($pageclean, '.');
$pagecleanNoExt = $dotIndex !== FALSE ? substr($pageclean, 0, $dotIndex) : $pageclean;
$canonical = "http://website.com" . $pagecleanNoExt; # basic canonical url

try this:
preg_match("/(.*)\.([^\?]{2,3})(\?(.*)){0,1}$/msiU", $_SERVER['REQUEST_URI'], $res);
$canonical = "http://website.com" . $res[1];
and $res[1] => clean url;
$res[2] = extension;
$res[4] = everything after the "?" (if present and if you need it)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP: How to get base URL from HTML page - php

See parse_url(). $result=parse_url('http://www.google.com'); print_r($result); Pick out of there whichever element you are looking for. You probably want $result['path'].

Related

Adding _blank targets to external links

How to safely get full URL of parent directory of current PHP page

working with links : identifying external links and full address of links

PHP - replace http with https in URL

1 bug to kill... Letting PHP Generate The Canonical

Categories

Resources