How to, using php, transform relative path to absolute URL?
function rel2abs($rel, $base)
{
/* return if already absolute URL */
if (parse_url($rel, PHP_URL_SCHEME) != '') return $rel;
/* queries and anchors */
if ($rel[0]=='#' || $rel[0]=='?') return $base.$rel;
/* parse base URL and convert to local variables:
$scheme, $host, $path */
extract(parse_url($base));
/* remove non-directory element from path */
$path = preg_replace('#/[^/]*$#', '', $path);
/* destroy path if relative url points to root */
if ($rel[0] == '/') $path = '';
/* dirty absolute URL */
$abs = "$host$path/$rel";
/* replace '//' or '/./' or '/foo/../' with '/' */
$re = array('#(/\.?/)#', '#/(?!\.\.)[^/]+/\.\./#');
for($n=1; $n>0; $abs=preg_replace($re, '/', $abs, -1, $n)) {}
/* absolute URL is ready! */
return $scheme.'://'.$abs;
}
I love the code that jordanstephens provided from the link! I voted it up. l0oky inspired me to make sure that the function is port, username, and password URL compatible. I needed it for my project.
function rel2abs( $rel, $base )
{
/* return if already absolute URL */
if( parse_url($rel, PHP_URL_SCHEME) != '' )
return( $rel );
/* queries and anchors */
if( $rel[0]=='#' || $rel[0]=='?' )
return( $base.$rel );
/* parse base URL and convert to local variables:
$scheme, $host, $path */
extract( parse_url($base) );
/* remove non-directory element from path */
$path = preg_replace( '#/[^/]*$#', '', $path );
/* destroy path if relative url points to root */
if( $rel[0] == '/' )
$path = '';
/* dirty absolute URL */
$abs = '';
/* do we have a user in our URL? */
if( isset($user) )
{
$abs.= $user;
/* password too? */
if( isset($pass) )
$abs.= ':'.$pass;
$abs.= '#';
}
$abs.= $host;
/* did somebody sneak in a port? */
if( isset($port) )
$abs.= ':'.$port;
$abs.=$path.'/'.$rel;
/* replace '//' or '/./' or '/foo/../' with '/' */
$re = array('#(/\.?/)#', '#/(?!\.\.)[^/]+/\.\./#');
for( $n=1; $n>0; $abs=preg_replace( $re, '/', $abs, -1, $n ) ) {}
/* absolute URL is ready! */
return( $scheme.'://'.$abs );
}
Added support to keep the current query. Helps a lot for ?page=1 and so on...
function rel2abs($rel, $base)
{
/* return if already absolute URL */
if (parse_url($rel, PHP_URL_SCHEME) != '')
return ($rel);
/* queries and anchors */
if ($rel[0] == '#' || $rel[0] == '?')
return ($base . $rel);
/* parse base URL and convert to local variables: $scheme, $host, $path, $query, $port, $user, $pass */
extract(parse_url($base));
/* remove non-directory element from path */
$path = preg_replace('#/[^/]*$#', '', $path);
/* destroy path if relative url points to root */
if ($rel[0] == '/')
$path = '';
/* dirty absolute URL */
$abs = '';
/* do we have a user in our URL? */
if (isset($user)) {
$abs .= $user;
/* password too? */
if (isset($pass))
$abs .= ':' . $pass;
$abs .= '#';
}
$abs .= $host;
/* did somebody sneak in a port? */
if (isset($port))
$abs .= ':' . $port;
$abs .= $path . '/' . $rel . (isset($query) ? '?' . $query : '');
/* replace '//' or '/./' or '/foo/../' with '/' */
$re = ['#(/\.?/)#', '#/(?!\.\.)[^/]+/\.\./#'];
for ($n = 1; $n > 0; $abs = preg_replace($re, '/', $abs, -1, $n)) {
}
/* absolute URL is ready! */
return ($scheme . '://' . $abs);
}
I updated the function to fix relative URL starting with '//' improving execution speed.
function getAbsoluteUrl($relativeUrl, $baseUrl){
// if already absolute URL
if (parse_url($relativeUrl, PHP_URL_SCHEME) !== null){
return $relativeUrl;
}
// queries and anchors
if ($relativeUrl[0] === '#' || $relativeUrl[0] === '?'){
return $baseUrl.$relativeUrl;
}
// parse base URL and convert to: $scheme, $host, $path, $query, $port, $user, $pass
extract(parse_url($baseUrl));
// if base URL contains a path remove non-directory elements from $path
if (isset($path) === true){
$path = preg_replace('#/[^/]*$#', '', $path);
}
else {
$path = '';
}
// if realtive URL starts with //
if (substr($relativeUrl, 0, 2) === '//'){
return $scheme.':'.$relativeUrl;
}
// if realtive URL starts with /
if ($relativeUrl[0] === '/'){
$path = null;
}
$abs = null;
// if realtive URL contains a user
if (isset($user) === true){
$abs .= $user;
// if realtive URL contains a password
if (isset($pass) === true){
$abs .= ':'.$pass;
}
$abs .= '#';
}
$abs .= $host;
// if realtive URL contains a port
if (isset($port) === true){
$abs .= ':'.$port;
}
$abs .= $path.'/'.$relativeUrl.(isset($query) === true ? '?'.$query : null);
// replace // or /./ or /foo/../ with /
$re = ['#(/\.?/)#', '#/(?!\.\.)[^/]+/\.\./#'];
for ($n = 1; $n > 0; $abs = preg_replace($re, '/', $abs, -1, $n)) {
}
// return absolute URL
return $scheme.'://'.$abs;
}
A web browser uses the page URL or the base tag to resolve relative URLs.
This script can resolve a URL relative to a base URL.
/** Build a URL
*
* #param array $parts An array that follows the parse_url scheme
* #return string
*/
function build_url($parts)
{
if (empty($parts['user'])) {
$url = $parts['scheme'] . '://' . $parts['host'];
} elseif(empty($parts['pass'])) {
$url = $parts['scheme'] . '://' . $parts['user'] . '#' . $parts['host'];
} else {
$url = $parts['scheme'] . '://' . $parts['user'] . ':' . $parts['pass'] . '#' . $parts['host'];
}
if (!empty($parts['port'])) {
$url .= ':' . $parts['port'];
}
if (!empty($parts['path'])) {
$url .= $parts['path'];
}
if (!empty($parts['query'])) {
$url .= '?' . $parts['query'];
}
if (!empty($parts['fragment'])) {
return $url . '#' . $parts['fragment'];
}
return $url;
}
/** Convert a relative path in to an absolute path
*
* #param string $path
* #return string
*/
function abs_path($path)
{
$path_array = explode('/', $path);
// Solve current and parent folder navigation
$translated_path_array = array();
$i = 0;
foreach ($path_array as $name) {
if ($name === '..') {
unset($translated_path_array[--$i]);
} elseif (!empty($name) && $name !== '.') {
$translated_path_array[$i++] = $name;
}
}
return '/' . implode('/', $translated_path_array);
}
/** Convert a relative URL in to an absolute URL
*
* #param string $url URL or URI
* #param string $base Absolute URL
* #return string
*/
function abs_url($url, $base)
{
$url_parts = parse_url($url);
$base_parts = parse_url($base);
// Handle the path if it is specified
if (!empty($url_parts['path'])) {
// Is the path relative
if (substr($url_parts['path'], 0, 1) !== '/') {
if (substr($base_parts['path'], -1) === '/') {
$url_parts['path'] = $base_parts['path'] . $url_parts['path'];
} else {
$url_parts['path'] = dirname($base_parts['path']) . '/' . $url_parts['path'];
}
}
// Make path absolute
$url_parts['path'] = abs_path($url_parts['path']);
}
// Use the base URL to populate the unfilled components until a component is filled
foreach (['scheme', 'host', 'path', 'query', 'fragment'] as $comp) {
if (!empty($url_parts[$comp])) {
break;
}
$url_parts[$comp] = $base_parts[$comp];
}
return build_url($url_parts);
}
Test
// Base URL
$base_url = 'https://example.com/path1/path2/path3/path4/file.ext?field1=value1&field2=value2#fragment';
// URL and URIs (_ is used to see what is coming from relative URL)
$test_urls = array(
"http://_example.com/_path1/_path2/_file.ext?_field1=_value1&_field2=_value2#_fragment", // URL
"//_example.com/_path1/_path2/_file.ext?_field1=_value1&_field2=_value2#_fragment", // URI without scheme
"//_example.com", // URI with host only
"/_path1/_path2/_file.ext?_field1=_value1&_field2=_value2#_fragment", // URI without scheme and host
"_path1/_path2/_file.ext", // URI with path only
"./../../_path1/../_path2/file.ext#_fragment", // URI with path and fragment
"?_field1=_value1&_field2=_value2#_fragment", // URI with query and fragment
"#_fragment" // URI with fragment only
);
// Expected result
$expected_urls = array(
"http://_example.com/_path1/_path2/_file.ext?_field1=_value1&_field2=_value2#_fragment",
"https://_example.com/_path1/_path2/_file.ext?_field1=_value1&_field2=_value2#_fragment",
"https://_example.com",
"https://example.com/_path1/_path2/_file.ext?_field1=_value1&_field2=_value2#_fragment",
"https://example.com/path1/path2/path3/path4/_path1/_path2/_file.ext",
"https://example.com/path1/path2/_path2/file.ext#_fragment",
"https://example.com/path1/path2/path3/path4/file.ext?_field1=_value1&_field2=_value2#_fragment",
"https://example.com/path1/path2/path3/path4/file.ext?field1=value1&field2=value2#_fragment"
);
foreach ($test_urls as $i => $url) {
$abs_url = abs_url($url, $base_url);
if ( $abs_url == $expected_urls[$i] ) {
echo "[OK] " . $abs_url . PHP_EOL;
} else {
echo "[WRONG] " . $abs_url . PHP_EOL;
}
}
Result
[OK] http://_example.com/_path1/_path2/_file.ext?_field1=_value1&_field2=_value2#_fragment
[OK] https://_example.com/_path1/_path2/_file.ext?_field1=_value1&_field2=_value2#_fragment
[OK] https://_example.com
[OK] https://example.com/_path1/_path2/_file.ext?_field1=_value1&_field2=_value2#_fragment
[OK] https://example.com/path1/path2/path3/path4/_path1/_path2/_file.ext
[OK] https://example.com/path1/path2/_path2/file.ext#_fragment
[OK] https://example.com/path1/path2/path3/path4/file.ext?_field1=_value1&_field2=_value2#_fragment
[OK] https://example.com/path1/path2/path3/path4/file.ext?field1=value1&field2=value2#_fragment
Wasn't in fact the question about converting path and not url? PHP actually has a function for this: realpath(). The only thing you should be aware of are symlinks.
Example from PHP manual:
chdir('/var/www/');
echo realpath('./../../etc/passwd') . PHP_EOL;
// Prints: /etc/passwd
echo realpath('/tmp/') . PHP_EOL;
// Prints: /tmp
A easy way to do this is using phpUri a small php library for converting relative urls to absolute.
Usage is simple:
require_once 'phpuri.php';
$absolute = phpUri::parse( $base_path )->join( $relative_path );
You don't even need to check that the path passed to join is actually relative. If it is absolute then parse will return it.
If the relative directory already exists this will do the job:
function rel2abs($relPath, $baseDir = './')
{
if ('' == trim($path))
{
return $baseDir;
}
$currentDir = getcwd();
chdir($baseDir);
$path = realpath($path);
chdir($currentDir);
return $path;
}
I used the same code from: http://nashruddin.com/PHP_Script_for_Converting_Relative_to_Absolute_URL
but I modified It a little bit so If base url contains PORT number it returns the relative URL with port number in it.
function rel2abs($rel, $base)
{
/* return if already absolute URL */
if (parse_url($rel, PHP_URL_SCHEME) != '') return $rel;
/* queries and anchors */
if ($rel[0]=='#' || $rel[0]=='?') return $base.$rel;
/* parse base URL and convert to local variables:
$scheme, $host, $path */
extract(parse_url($base));
/* remove non-directory element from path */
$path = preg_replace('#/[^/]*$#', '', $path);
/* destroy path if relative url points to root */
if ($rel[0] == '/') $path = '';
/* dirty absolute URL // with port number if exists */
if (parse_url($base, PHP_URL_PORT) != ''){
$abs = "$host:".parse_url($base, PHP_URL_PORT)."$path/$rel";
}else{
$abs = "$host$path/$rel";
}
/* replace '//' or '/./' or '/foo/../' with '/' */
$re = array('#(/\.?/)#', '#/(?!\.\.)[^/]+/\.\./#');
for($n=1; $n>0; $abs=preg_replace($re, '/', $abs, -1, $n)) {}
/* absolute URL is ready! */
return $scheme.'://'.$abs;
}
Hope this helps someone!
This function will resolve relative URL's to a given current page url in $pgurl without regex. It successfully resolves:
/home.php?example types,
same-dir nextpage.php types,
../...../.../parentdir types,
full http://example.net urls,
and shorthand //example.net urls
//Current base URL (you can dynamically retrieve from $_SERVER)
$pgurl = 'http://example.com/scripts/php/absurl.php';
function absurl($url) {
global $pgurl;
if(strpos($url,'://')) return $url; //already absolute
if(substr($url,0,2)=='//') return 'http:'.$url; //shorthand scheme
if($url[0]=='/') return parse_url($pgurl,PHP_URL_SCHEME).'://'.parse_url($pgurl,PHP_URL_HOST).$url; //just add domain
if(strpos($pgurl,'/',9)===false) $pgurl .= '/'; //add slash to domain if needed
return substr($pgurl,0,strrpos($pgurl,'/')+1).$url; //for relative links, gets current directory and appends new filename
}
function nodots($path) { //Resolve dot dot slashes, no regex!
$arr1 = explode('/',$path);
$arr2 = array();
foreach($arr1 as $seg) {
switch($seg) {
case '.':
break;
case '..':
array_pop($arr2);
break;
case '...':
array_pop($arr2); array_pop($arr2);
break;
case '....':
array_pop($arr2); array_pop($arr2); array_pop($arr2);
break;
case '.....':
array_pop($arr2); array_pop($arr2); array_pop($arr2); array_pop($arr2);
break;
default:
$arr2[] = $seg;
}
}
return implode('/',$arr2);
}
Usage Example:
echo nodots(absurl('../index.html'));
nodots() must be called after the URL is converted to absolute.
The dots function is kind of redundant, but is readable, fast, doesn't use regex's, and will resolve 99% of typical urls (if you want to be 100% sure, just extend the switch block to support 6+ dots, although I've never seen that many dots in a URL).
Hope this helps,
function url_to_absolute($baseURL, $relativeURL) {
$relativeURL_data = parse_url($relativeURL);
if (isset($relativeURL_data['scheme'])) {
return $relativeURL;
}
$baseURL_data = parse_url($baseURL);
if (!isset($baseURL_data['scheme'])) {
return $relativeURL;
}
$absoluteURL_data = $baseURL_data;
if (isset($relativeURL_data['path']) && $relativeURL_data['path']) {
if (substr($relativeURL_data['path'], 0, 1) == '/') {
$absoluteURL_data['path'] = $relativeURL_data['path'];
} else {
$absoluteURL_data['path'] = (isset($absoluteURL_data['path']) ? preg_replace('#[^/]*$#', '', $absoluteURL_data['path']) : '/') . $relativeURL_data['path'];
}
if (isset($relativeURL_data['query'])) {
$absoluteURL_data['query'] = $relativeURL_data['query'];
} else if (isset($absoluteURL_data['query'])) {
unset($absoluteURL_data['query']);
}
} else {
$absoluteURL_data['path'] = isset($absoluteURL_data['path']) ? $absoluteURL_data['path'] : '/';
if (isset($relativeURL_data['query'])) {
$absoluteURL_data['query'] = $relativeURL_data['query'];
} else if (isset($absoluteURL_data['query'])) {
$absoluteURL_data['query'] = $absoluteURL_data['query'];
}
}
if (isset($relativeURL_data['fragment'])) {
$absoluteURL_data['fragment'] = $relativeURL_data['fragment'];
} else if (isset($absoluteURL_data['fragment'])) {
unset($absoluteURL_data['fragment']);
}
$absoluteURL_path = ltrim($absoluteURL_data['path'], '/');
$absoluteURL_path_parts = array();
for ($i = 0, $i2 = 0; $i < strlen($absoluteURL_path); $i++) {
if (isset($absoluteURL_path_parts[$i2])) {
$absoluteURL_path_parts[$i2] .= $absoluteURL_path[$i];
} else {
$absoluteURL_path_parts[$i2] = $absoluteURL_path[$i];
}
if ($absoluteURL_path[$i] == '/') {
$i2++;
}
}
reset($absoluteURL_path_parts);
while (true) {
if (rtrim(current($absoluteURL_path_parts), '/') == '.') {
unset($absoluteURL_path_parts[key($absoluteURL_path_parts)]);
continue;
} else if (rtrim(current($absoluteURL_path_parts), '/') == '..') {
if (prev($absoluteURL_path_parts) !== false) {
unset($absoluteURL_path_parts[key($absoluteURL_path_parts)]);
} else {
reset($absoluteURL_path_parts);
}
unset($absoluteURL_path_parts[key($absoluteURL_path_parts)]);
continue;
}
if (next($absoluteURL_path_parts) === false) {
break;
}
}
$absoluteURL_data['path'] = '/' . implode('', $absoluteURL_path_parts);
$absoluteURL = isset($absoluteURL_data['scheme']) ? $absoluteURL_data['scheme'] . ':' : '';
$absoluteURL .= (isset($absoluteURL_data['user']) || isset($absoluteURL_data['host'])) ? '//' : '';
$absoluteURL .= isset($absoluteURL_data['user']) ? $absoluteURL_data['user'] : '';
$absoluteURL .= isset($absoluteURL_data['pass']) ? ':' . $absoluteURL_data['pass'] : '';
$absoluteURL .= isset($absoluteURL_data['user']) ? '#' : '';
$absoluteURL .= isset($absoluteURL_data['host']) ? $absoluteURL_data['host'] : '';
$absoluteURL .= isset($absoluteURL_data['port']) ? ':' . $absoluteURL_data['port'] : '';
$absoluteURL .= isset($absoluteURL_data['path']) ? $absoluteURL_data['path'] : '';
$absoluteURL .= isset($absoluteURL_data['query']) ? '?' . $absoluteURL_data['query'] : '';
$absoluteURL .= isset($absoluteURL_data['fragment']) ? '#' . $absoluteURL_data['fragment'] : '';
return $absoluteURL;
}
You can use this composer package to do that.
https://packagist.org/packages/wa72/url
composer require wa72/url
Parse URL strings to objects
add and modify query parameters
set and modify any part of the url
test for equality of URLs with query parameters in a PHP-fashioned
way
supports protocol-relative urls
convert absolute, host-relative and protocol-relative urls to
relative and vice versa
This make suit of #jordansstephens's answer that doesn't support absolute url begins with '//'.
function rel2abs($rel, $base)
{
/* return if already absolute URL */
if (parse_url($rel, PHP_URL_SCHEME) != '') return $rel;
/* Url begins with // */
if($rel[0] == '/' && $rel[1] == '/'){
return 'https:' . $rel;
}
/* queries and anchors */
if ($rel[0]=='#' || $rel[0]=='?') return $base.$rel;
/* parse base URL and convert to local variables:
$scheme, $host, $path */
extract(parse_url($base));
/* remove non-directory element from path */
$path = preg_replace('#/[^/]*$#', '', $path);
/* destroy path if relative url points to root */
if ($rel[0] == '/') $path = '';
/* dirty absolute URL */
$abs = "$host$path/$rel";
/* replace '//' or '/./' or '/foo/../' with '/' */
$re = array('#(/\.?/)#', '#/(?!\.\.)[^/]+/\.\./#');
for($n=1; $n>0; $abs=preg_replace($re, '/', $abs, -1, $n)) {}
/* absolute URL is ready! */
return $scheme.'://'.$abs;
}
Related
How to get relative filename from absolute one and some given path?
For example:
foo('/a/b/c/1.txt', '/a/d/e'); // '../../b/c/1.txt'
foo('/a/b/../d/1.txt', '/a/d/e'); // '../c/1.txt'
Is there some native function for this?
My thoughts, if there is not:
Normalize both params: need to use some realpath replacement, because files can to not exist. example
Cut common parts from both
add rest parts from $basepath as '..'
Manual way looks too heavy for that common task..
For now, I assume there is not any native implementation and want to share my implementation of this.
/**
* realpath analog without filesystem check
*/
function str_normalize_path(string $path, string $sep = DIRECTORY_SEPARATOR): string {
$parts = array_filter(explode($sep, $path));
$stack = [];
foreach ($parts as $part) {
switch($part) {
case '.': break;
case '..': array_pop($stack); break; // excess '..' just ignored by array_pop([]) silency
default: array_push($stack, $part);
}
}
return implode($sep, $stack);
}
function str_relative_path(string $absolute, string $base, string $sep = DIRECTORY_SEPARATOR) {
$absolute = str_normalize_path($absolute);
$base = str_normalize_path($base);
// find common prefix
$prefix_len = 0;
for ($i = 0; $i < min(mb_strlen($absolute), mb_strlen($base)); ++$i) {
if ($absolute[$i] !== $base[$i]) break;
$prefix_len++;
}
// cut common prefix
if ($prefix_len > 0) {
$absolute = mb_substr($absolute, $prefix_len);
$base = mb_substr($base, $prefix_len);
}
// put '..'s for exit to closest common path
$base_length = count(explode($sep, $base));
$relative_parts = explode($sep, $absolute);
while($base_length-->0) array_unshift($relative_parts, '..');
return implode($sep, $relative_parts);
}
$abs = '/a/b/../fk1/fk2/../.././d//proj42/1.txt';
$base = '/a/d/fk/../e/f';
echo str_relative_path($abs, $base); // ../../proj42/1.txt
I have the following code that (1) gets the page / section name from the url (2) cleans up the string and then assigns it to a variable.
I was wondering if there are any suggestions to how I can improve this code to be more efficient, possibly less if / else statements.
Also, any suggestion how I can code this so that it accounts for x amount of sub-directories in the url structure. Right now I check up to 3 in a pretty manual way.
I'd like it to handle any url, for example: www.domain.com/level1/level2/level3/level4/levelx/...
Here is my current code:
<?php
$prefixName = 'www : ';
$getPageName = explode("/", $_SERVER['PHP_SELF']);
$cleanUpArray = array("-", ".php");
for($i = 0; $i < sizeof($getPageName); $i++) {
if ($getPageName[1] == 'index.php')
{
$pageName = $prefixName . 'homepage';
}
else
{
if ($getPageName[1] != 'index.php')
{
$pageName = $prefixName . trim(str_replace($cleanUpArray, ' ', $getPageName[1]));
}
if (isset($getPageName[2]))
{
if ( $getPageName[2] == 'index.php' )
{
$pageName = $prefixName . trim(str_replace($cleanUpArray, ' ', $getPageName[1]));
}
else
{
$pageName = $prefixName . trim(str_replace($cleanUpArray, ' ', $getPageName[2]));
}
}
if (isset($getPageName[3]) )
{
if ( $getPageName[3] == 'index.php' )
{
$pageName = $prefixName . trim(str_replace($cleanUpArray, ' ', $getPageName[2]));
}
else
{
$pageName = $prefixName . trim(str_replace($cleanUpArray, ' ', $getPageName[3]));
}
}
}
}
?>
You are currently using a for-loop, but not using the $i iterator for anything - so to me, you could drop the loop entirely. From what I can see, you just want the directory-name prior to the file to be the $pageName and if there is no prior directory set it as homepage.
You can pass $_SERVER['PHP_SELF'] to basename() to get the exact file-name instead of checking the indexes, and also split on the / as you're currently doing to get the "last directory". To get the last directory, you can skip indexes and directly use array_pop().
<?php
$prefixName = 'www : ';
$cleanUpArray = array("-", ".php");
$script = basename($_SERVER['PHP_SELF']);
$exploded = explode('/', substr($_SERVER['PHP_SELF'], 0, strrpos($_SERVER['PHP_SELF'], '/')));
$count = count($exploded);
if (($count == 1) && ($script == 'index.php')) {
// the current page is "/index.php"
$pageName = $prefixName . 'homepage';
} else if ($count > 1) {
// we are in a sub-directory; use the last directory as the current page
$pageName = $prefixName . trim(str_replace($cleanUpArray, ' ', array_pop($exploded)));
} else {
// there is no sub-directory and the script is not index.php?
}
?>
In the event that you want a more breadcumbs-feel, you may want to keep each individual directory. If this is the case, you can update the middle if else condition to be:
} else if ($count > 1) {
// we are in a sub-directory; "breadcrumb" them all together
$pageName = '';
$separator = ' : ';
foreach ($exploded as $page) {
if ($page == '') continue;
$pageName .= (($pageName != '') ? $separator : '') . trim(str_replace($cleanUpArray, ' ', $page));
}
$pageName = $prefixName . $pageName;
} else {
I found this code very helpful
$protocol = strpos(strtolower($_SERVER['SERVER_PROTOCOL']),'https') ===
FALSE ? 'http' : 'https'; // Get protocol HTTP/HTTPS
$host = $_SERVER['HTTP_HOST']; // Get www.domain.com
$script = $_SERVER['SCRIPT_NAME']; // Get folder/file.php
$params = $_SERVER['QUERY_STRING'];// Get Parameters occupation=odesk&name=ashik
$currentUrl = $protocol . '://' . $host . $script . '?' . $params; // Adding all
echo $currentUrl;
when I spide a website ,I got a lot of bad url like these.
http://example.com/../../.././././1.htm
http://example.com/test/../test/.././././1.htm
http://example.com/.//1.htm
http://example.com/../test/..//1.htm
all of these should be http://example.com/1.htm.
how to use PHP codes to do this ,thanks.
PS: I use http://snoopy.sourceforge.net/
I get a lot of repeated link in my database , the 'http://example.com/../test/..//1.htm' should be 'http://example.com/1.htm' .
You could do it like this, assuming all the urls you have provided are expected tobe http://example.com/1.htm:
$test = array('http://example.com/../../../././.\./1.htm',
'http://example.com/test/../test/../././.\./1.htm',
'http://example.com/.//1.htm',
'http://example.com/../test/..//1.htm');
foreach ($test as $url){
$u = parse_url($url);
$path = $u['scheme'].'://'.$u['host'].'/'.basename($u['path']);
echo $path.'<br />'.PHP_EOL;
}
/* result
http://example.com/1.htm<br />
http://example.com/1.htm<br />
http://example.com/1.htm<br />
http://example.com/1.htm<br />
*/
//or as a function #lpc2138
function getRealUrl($url){
$u = parse_url($url);
$path = $u['scheme'].'://'.$u['host'].'/'.basename($u['path']);
$path .= (!empty($u['query'])) ? '?'.$u['query'] : '';
return $path;
}
You seem to be looking for a algorithm to remove the dot segments:
function remove_dot_segments($abspath) {
$ib = $abspath;
$ob = '';
while ($ib !== '') {
if (substr($ib, 0, 3) === '../') {
$ib = substr($ib, 3);
} else if (substr($ib, 0, 2) === './') {
$ib = substr($ib, 2);
} else if (substr($ib, 0, 2) === '/.' && ($ib[2] === '/' || strlen($ib) === 2)) {
$ib = '/'.substr($ib, 3);
} else if (substr($ib, 0, 3) === '/..' && ($ib[3] === '/' || strlen($ib) === 3)) {
$ib = '/'.substr($ib, 4);
$ob = substr($ob, 0, strlen($ob)-strlen(strrchr($ob, '/')));
} else if ($ib === '.' || $ib === '..') {
$ib = '';
} else {
$pos = strpos($ib, '/', 1);
if ($pos === false) {
$ob .= $ib;
$ib = '';
} else {
$ob .= substr($ib, 0, $pos);
$ib = substr($ib, $pos);
}
}
}
return $ob;
}
This removes the . and .. segments. Any removal of any other segment like an empty one (//) or .\. is not as per standard as it changes the semantics of the path.
You could do some fancy regex but this works just fine.
fixUrl('http://example.com/../../../././.\./1.htm');
function fixUrl($str) {
$str = str_replace('../', '', $str);
$str = str_replace('./', '', $str);
$str = str_replace('\.', '', $str);
return $str;
}
I'm grabbing links from a website, but I'm having a problem in which the higher I set the recursion depth for the function the results become stranger
for example
when I set the function to the following
crawl_page("http://www.mangastream.com/", 10);
I will get a results like this for about half the page
http://mangastream.com/read/naruto/51619850/1/read/naruto/51619850/2/read/naruto/51619850/2/read/naruto/51619850/2/read/naruto/51619850/2/read/naruto/51619850/2/read/naruto/51619850/2/read/naruto/51619850/2
EDIT
while I'm expecting results like this instead
http://mangastream.com/manga/read/naruto/51619850/1
here's the function I've been using to get the results
function crawl_page($url, $depth)
{
static $seen = array();
if (isset($seen[$url]) || $depth === 0) {
return;
}
$seen[$url] = true;
$dom = new DOMDocument('1.0');
#$dom->loadHTMLFile($url);
$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $element) {
$href = $element->getAttribute('href');
if (0 !== strpos($href, 'http')) {
$href = rtrim($url, '/') . '/' . ltrim($href, '/');
}
if(shouldScrape($href)==true)
crawl_page($href, $depth - 1);
}
echo $url,"\r";
//,pageStatus($url)
}
any help with this would be greatly appreciated
the construction of your new url is not correct, replace :
$href = rtrim($url, '/') . '/' . ltrim($href, '/');
with :
if (substr($href, 0, 1)=='/') {
// href relative to root
$info = parse_url($url);
$href = $info['scheme'].'//'.$info['host'].$href;
} else {
// href relative to current path
$href = rtrim(dirname($url), '/') . '/' . $href;
}
I think your problem lies in this line:
$href = rtrim($url, '/') . '/' . ltrim($href, '/');
To all relative urls on any given page this statement will prepend the current page url, which is obviously not what you want. What you need is to prepend only the protocol and host part of the URL.
Something like this should fix your problem (untested):
$url_parts = parse_url($url);
$href = $url_parts['scheme'] . '://' . $url_parts['host '] . $href;
I am trying to figure out how to convert an "external relative path" to an absolute one:
I'd really like a function that will do the following:
$path = "/search?q=query";
$host = "http://google.com";
$abspath = reltoabs($host, $path);
And have $abspath equal to "http://google.com/search?q=query"
Another example:
$path = "top.html";
$host = "www.example.com/documentation";
$abspath = reltoabs($host, $path);
And have $abspath equal to "http://www.example.com/documentation/top.html"
The problem is that it is not guaranteed to be in that format, and it could already be absolute, or be pointing to a different host entirely, and I'm not quite sure how to approach this.
Thanks.
You should try the PECL function http_build_url
http://php.net/manual/en/function.http-build-url.php
So there are three cases:
proper URL
no protocol
no protocol and no domain
Example code (untested):
if (preg_match('#^http(?:s)?://#i', $userurl))
$url = preg_replace('#^http(s)?://#i', 'http$1://', $userurl); //protocol lowercase
//deem to have domain if a dot is found before a /
elseif (preg_match('#^[^/]+\\.[^/]+#', $useurl)
$url = "http://".$useurl;
else { //no protocol or domain
$url = "http://default.domain/" . (($useurl[0] != "/") ? "/" : "") . $useurl;
}
$url = filter_var($url, FILTER_VALIDATE_URL);
if ($url === false)
die("User gave invalid url").
It appears I have solved my own problem:
function reltoabs($host, $path) {
$resulting = array();
$hostparts = parse_url($host);
$pathparts = parse_url($path);
if (array_key_exists("host", $pathparts)) return $path; // Absolute
// Relative
$opath = "";
if (array_key_exists("scheme", $hostparts)) $opath .= $hostparts["scheme"] . "://";
if (array_key_exists("user", $hostparts)) {
if (array_key_exists("pass", $hostparts)) $opath .= $hostparts["user"] . ":" . $hostparts["pass"] . "#";
else $opath .= $hostparts["user"] . "#";
} elseif (array_key_exists("pass", $hostparts)) $opath .= ":" . $hostparts["pass"] . "#";
if (array_key_exists("host", $hostparts)) $opath .= $hostparts["host"];
if (!array_key_exists("path", $pathparts) || $pathparts["path"][0] != "/") {
$dirname = explode("/", $hostparts["path"]);
$opath .= implode("/", array_slice($dirname, 0, count($dirname) - 1)) . "/" . basename($pathparts["path"]);
} else $opath .= $pathparts["path"];
if (array_key_exists("query", $pathparts)) $opath .= "?" . $pathparts["query"];
if (array_key_exists("fragment", $pathparts)) $opath .= "#" . $pathparts["fragment"];
return $opath;
}
Which seems to work pretty well, for my purposes.