I have a list of absolute URLs. I need to make sure that they all have trailing slashes, as applicable. So:
http://www.domain.com/ <-- does not need a trailing slash
http://www.domain.com <-- needs a trailing slash
http://www.domain.com/index.php <-- does not need a trailing slash
http://www.domain.com/?message=hello <-- does not need a trailing slash
I'm guessing I need to use regex, but matching URLs are a pain. Was hoping for an easier solution. Ideas?
For this very specific problem, not using a regex at all might be an option as well. If your list is long (several thousand URLs) and time is of any concern, you could choose to hand-code this very simple manipulation.
This will do the same:
$str .= (substr($str, -1) == '/' ? '' : '/');
It is of course not nearly as elegant or flexible as a regular expression, but it avoids the overhead of parsing the regular expression string and it will run as fast as PHP is able to do it.
It is arguably less readable than the regex, though this depends on how comfortable the reader is with regex syntax (some people might acually find it more readable).
It will certainly not check that the string is really a well-formed URL (such as e.g. zerkms' regex), but you already know that your strings are URLs anyway, so that is a bit redundant.
Though, if your list is something like 10 or 20 URLs, forget this post. Use a regex, the difference will be zero.
Rather than doing this using regex, you could use parse_url() to do this.
For example:
$url = parse_url("http://www.example.com/ab/abc.html?a=b#xyz");
if(!isset($url['path'])) $url['path'] = '/';
$surl = $url['scheme']."://".$url['host'].$url['path'].'?'.$url['query'].'#'.$url['fragment'];
echo $surl;
$url = 'http://www.domain.com';
$need_to_add_trailing_slash = preg_match('~^https?://[^/]+$~', $url);
Try this:
if (!preg_match("/.*\/$/", $url)) {
$url = "$url" . "/";
}
This may not be the most elegant solution, but it works like a charm. First we get the full url, then check to see if it has a a trailing slash. If not, check to see that there is no query string, it isn't an actual file, and isn't an actual directory. If the url meets all these conditions we do a 301 redirect with the trailing slash added.
If you're unfamiliar with PHP headers... note that there cannot be any output - not even whitespace - before this code.
$url = $_SERVER['REQUEST_URI'];
$lastchar = substr( $url, -1 );
if ( $lastchar != '/' ):
if ( !$_SERVER['QUERY_STRING'] and !is_file( $_SERVER['DOCUMENT_ROOT'].$url ) and !is_dir( $_SERVER['DOCUMENT_ROOT'].$url ) ):
header("HTTP/1.1 301 Moved Permanently");
header( "Location: $url/" );
endif;
endif;
Related
function getHost($Address) {
$parseUrl = parse_url(trim($Address));
return trim($parseUrl[host]
? $parseUrl[host]
: array_shift(explode('/', $parseUrl[path], 2))
);
}
$httpreferer = getHost($_SERVER['HTTP_REFERER']);
$httpreferer = preg_replace('#^(http(s)?://)?w{3}\.#', '$1', $httpreferer);
echo $httpreferer;
I am using this to strip http:// , www and subdomains to return just the host however it returns the following:
http://site.google.com ==> google.com
http://google.com ==> com
How do i get it to just remove the subdomain when it exists instead of stripping down to the tld when it doesn't exist?
Start with parse_url specifically parse_url($url)['host']
$arr = parse_url($url);
echo preg_replace('/^www\./', '', $arr['host'])."\n";
Output
site.google.com
google.com
Sandbox
The Regex for this is just matches www. if it's the start of the string, you could probably do this part a few ways, such as with
No subdomain
If you don't want any subdomain at all:
$arr = parse_url($url)['host'];
echo preg_replace('/^(?:[-a-z0-9_]+\.)?([-a-z0-9_]+\..+)$/', '$1',$arr['host'])."\n";
Sandbox
No subdomain, no Country Code
$arr = parse_url($url)['host'];
echo preg_replace('/^(?:[-a-z0-9_]+\.)?([-a-z0-9_]+)(\.[^.]+).*?$/', '$1$2',$arr['host'])."\n";
Sandbox
How it works,
Same as the previous one but the domain is separated from the host, and instead of just capturing everything, we capture everything but the . and outside the new group we capture everything (confusingly the . is everything here) but with *? which means * 0 or more times, ? non-greedy don't take characters from previous expressions.
Or to put it another way. Capture anything 0 or more times don't steal characters from previous matches. This way if there is nothing such as www.google.com we are only worried about stuff after .com then its 0 matches. But if its www.google.com.uk it matches the .uk.
Single Line Answer.
Some versions of PHP, I forget what ones but the newer ones actually let you do this:
$host = parse_url($url)['host'];
So taking the last example we can compress that into one line and remove the variable assignment.
echo preg_replace('/^(?:[-a-z0-9_]+\.)?([-a-z0-9_]+)(\.[^.]+).*?$/', '$1$2',parse_url($url)['host'])."\n";
See it in action
That was just for fun!
Summery
Using parse_url is really the "correct" way to do it. Or the proper way to start as it removes a lot of the other "stuff" and gives you a good starting place. Anyway this was fun for me ... :) ... And I needed a break from coding my Website, because it's tedious for me now (It was 8 years old, so I'm redoing it in WordPress, and I've done about a zillion WordPress site) ...
Cheers, hope it helps!
Found the Answer
$testAdd = "https://testing.google.co.uk";
$parse = parse_url($testAdd);
$httpreferer = preg_replace("/^([a-zA-Z0-9].*\.)?([a-zA-Z0-9][a-zA-Z0-9-]{1,61}[a-zA-Z0-9]\.[a-zA-Z.]{2,})$/", '$2', $parse['host']);
echo $httpreferer;
This will also deal with domain with country TLD
Thanks for all your help.
I still don't understand how regular expression work with preg_replace. I have some url in text:
site.com/user/login.php?valid=tru
site.com/eng/page/some_page.php?valid=tru&anothervar=1
I want to change it so it become this
site.com/user/login/
site.com/eng/page/some_page/
preg_replace(" 'no_mater_what_1'.php'no_mater_what_2' " , 'no_mater_what_1'/ , $some_var);
To avoid traps, like an other .php substring in the path, you can use this replacement:
$url = preg_replace('~\.php(?:[?#]\N*|\z)~i', '', $url, -1, $c);
if (!$c) // not a php file, do something else
or in this way:
if (preg_match('~[^?#]+\.php(?=[?#]|\z)~Ai', $url, $m))
$url = $m[0];
else
// not a php file, do something else
This way ensures that the .php matched is the extension of the file because the regex engine will find the leftmost result that is followed by either a ? for the query part, a # for the fragment part or the end of the string.
pattern elements:
\N: a character that isn't a newline.
\z: anchor for the end of the string.
A: modifier that anchors the pattern at the start of the string
(?=...): lookahead assertion
The advantage of this approach is the safety with a good efficiency.
An other way with parse_url:
You can use parse_url to separate an url into parts. If this way is a little fastidious because you need to rebuild the url after (and the way you will rebuild it depends of the elements present in the url), it's however far from impossible and provides too a safe way.
But why not simply do this:
$replace = explode('.php',$some_var);
$replace = $replace[0] . '/';
Because that I find it necessary to use a regular expression, because ".php" is not repeated in the string.
This should work
$subject = 'site.com/eng/page/some_page.php?valid=tru&anothervar=1';
if (preg_match('/(.*)\.php(?:\?.*)/', $subject, $regs)) {
$result = $regs[1] .'/';
echo $subject .' => '. $result;
} else {
echo 'NOT FOUND';
}
The regular expression doing the magic is this
/(.*)\.php(?:\?.*)?/
by parts:
(.*)\.php
Capture everything until (excluding) ".php"
(?:\?.*)
Search for the pattern "?..."
?
Make that last pattern optional
Because your two examples shows up on the same line, this looks a bit confusing. However, it appears that you want to replace everything from .php to the end of the line with a /. So, use:
$new_link = preg_replace('/\.php.*$/', '/', $old_link);
You need the \ in front of the . because . is a special character that needs to be escaped to make it work like a period. Then, you look for php, in that order, followed by anything to the end of the line ($ means end of the line). You replace all of that with /.
I'm importing data from a csv and I've been looking high and low for a particular regular expression to remove trailing slashes from domain names without a directory after it. See the following example:
example.com/ (remove trailing slash)
example.co.uk/ (remove trailing slash)
example.com/gb/ (do not remove trailing slash)
Can anyone help me out with this or at least point me in the right direction?
Edit: This is my progress so far, I've only matched the extension at the moment but it's picking up those domains with trailing directories.
[a-z0-9\-]+[a-z0-9]\/[a-z]
Many thanks
I don't know how it would compare to a regular expression performance-wise, but you can do it without one.
A simple example:
$string = rtrim ($string, '/');
$string .= (strpos($string, '/') === false) ? '' : '/';
In the second line I'm only adding a / at the end if the string already contains one (to separate domain from folder).
A more solid approach would probably be to only rtrim if the first / found, is the last character of the string.
not sure,
but you can try this,
if it is a $_SERVER['SERVER_NAME'] only then remove slash otherwise keep it
because $_SERVER['SERVER_NAME'] will return URL without any directory
try this
/^(http|https|ftp)\:\/\/[a-z0-9\-\.]+\.[a-z]{2,3}(:[a-z0-9]*)?\/?([a-z0-9\-\._\?\,\'\/\\\+&%\$#\=~])*$/i
you could test for a match on /[a-z]/, then remove the last charater if it's not found.
this is javascript, but it'd be similar in php.
/\/[a-z]+\//
var txt = 'example.com/gb/';
var match = txt.match(/\/[a-z]+\//);
if (!match) {
alert(txt.substring(txt,txt.length-1));
}
else {
alert(txt);
}
http://jsfiddle.net/xjKTS/
Try this, it works:
<?
$result = preg_replace('/^([^\/]+)(\/)$/','$1',$your_data);
?>
I have tested like this:
$reg = '/^([^\/]+)(\/)$/';
echo preg_replace($reg,'$1',$str1);//example.com
echo preg_replace($reg,'$1',$str2);//example.co.uk
echo preg_replace($reg,'$1',$str3);//example.com/gb/
?>
I have some URLs, like www.amazon.com/, www.digg.com or www.microsoft.com/ and I want to remove the trailing slash, if it exists, so not just the last character. Is there a trim or rtrim for this?
You put rtrim in your question, why not just look it up?
$url = rtrim($url,"/");
As a side note, look up any PHP function by doing the following:
http://php.net/functionname
http://php.net/rtrim
http://php.net/trim
(rtrim stands for 'Right trim')
Simple and works across both Windows and Unix:
$url = rtrim($url, '/\\')
I came here looking for a way to remove trailing slash and redirect the browser, I have come up with an answer that I would like to share for anyone coming after me:
//remove trailing slash from uri
if( ($_SERVER['REQUEST_URI'] != "/") and preg_match('{/$}',$_SERVER['REQUEST_URI']) ) {
header ('Location: '.preg_replace('{/$}', '', $_SERVER['REQUEST_URI']));
exit();
}
The ($_SERVER['REQUEST_URI'] != "/") will avoid host URI e.g www.amazon.com/ because web browsers always send a trailing slash after a domain name, and preg_match('{/$}',$_SERVER['REQUEST_URI']) will match all other URI with trailing slash as last character. Then preg_replace('{/$}', '', $_SERVER['REQUEST_URI']) will remove the slash and hand over to header() to redirect. The exit() function is important to stop any further code execution.
$urls="www.amazon.com/ www.digg.com/ www.microsoft.com/";
echo preg_replace("/\b\//","",$urls);
I have a regular expression that I use to reduce multiple slashes to single slashes. The purpose is to read a url that is previously converted to a human readable link using mod_rewrite in apache, like this :
http://www.website.com/about/me
This works :
$uri = 'about//me';
$uri = preg_replace('#//+#', '/', $uri);
echo $uri; // echoes 'about/me'
This doesn't work :
$uri = '/about//me';
$uri = preg_replace('#//+#', '/', $uri);
echo $uri; // echoes '/about/me'
I need to be able to work with each url parameter alone, but in the second example, if I explode the trailling slash, it would return me 3 segments instead of 2 segments. I can verify in PHP if any if the parameters is empty, but as I'm using that regular expression, it would be nice if the regular expression already take care of that for me, so that I don't need to worry about segment validation.
Any thoughts?
str_replace may be faster in this case
$uri = str_replace("//","/",$uri)
Secondly: use trim: http://hu.php.net/manual/en/function.trim.php
$uri = trim($uri,"/");
This converts double slashes in a string to a single slash, but the advantage of this code is that the slashes in the protocol portion of the string (http://) are kept.
preg_replace("#(^|[^:])//+#", "\\1/", $str);
How about running a second replace on $uri?
$uri = preg_replace('#^/#', '', $uri);
That way a trailing slash is removed. Doing it all in one preg_replace beats me :)
Using ltrim could also be a way to go (probably even faster).
I need to be able to work with each
url parameter alone, but in the second
example, if I explode the trailling
slash, it would return me 3 segments
instead of 2 segments.
One fix for this is to use preg_split with the third argument set to PREG_SPLIT_NO_EMPTY:
$uri = '/about//me';
$uri_segments = preg_split('#/#', $uri, PREG_SPLIT_NO_EMPTY);
// $uri_segments[0] == 'about';
// $uri_segments[1] == 'me';
you can combine all three alternatives into one regexp
$urls = array(
'about/me',
'/about//me',
'/about///me/',
'////about///me//'
);
print_r(
preg_replace('~^/+|/+$|/(?=/)~', '', $urls)
);
You may split the string via preg_split instead, skipping the sanitizing altogether. You still have to deal with the empty chunks, though.
Late but all these methods will remove http:// slashes too, but this.
function to_single_slashes($input) {
return preg_replace('~(^|[^:])//+~', '\\1/', $input);
}
# out: http://localhost/lorem-ipsum/123/456/
print to_single_slashes('http:///////localhost////lorem-ipsum/123/////456/');