Regex to Remove Everything After 4th Slash in URL

Regex to Remove Everything After 4th Slash in URL - php

I'm working in PHP with friendly URL paths in the form of:
/2011/09/here-is-the-title
/2011/09/here-is-the-title/2
I need to standardize these URL paths to remove anything after the 4 slash including the slash itself. The value after the 4th slash is sometimes a number, but can also be any parameter.
Any thoughts on how I could do this? I imagine regex could handle it, but I'm terrible with it. I also thought a combination of strpos and substr might be able to handle it, but cannot quite figure it out.

You can use explode() function:
$parts = explode('/', '/2011/09/here-is-the-title/2');
$output = implode('/', array_slice($parts, 0, 4));

Replace
%^((/[^/]*){3}).*%g
with $1.
see http://regexr.com?2vlr8 for a live example

If your regex implementation support arbitrary length look-behind assertions you could replace
(?<=^[^/]*(/[^/]*){3})/.*$
with an empty string.
If it does not, you can replace
^([^/]*(?:/[^/]*){3})/.*$
with the contents of the first capturing group. A PHP example for the second one can be found at ideone.com.

you could also use a loop:
result="";
for char c in URL:
if(c is a slash) count++;
if(count<4) result=result+c;
else break;

Related

regex to clean up url

I am looking for a way to get a valid url out of a string like:
$string = 'http://somesite.com/directory//sites/9/my_forms/3-895a3e/somefilename.jpg|:||:||:||:|19845';
My original solution was:
preg_match('#^[^:|]*#', str_replace('//', '/', $string), $modifiedPath);
But obviously its going to remove a slash from the http:// instead of the one in the middle of the string.
My expected output that I want from the original is:
http://somesite.com/directory/sites/9/my_forms/3-895a3e/somefilename.jpg
I could always break off the http part of the string first but would like a more elegant solution in the form of regex if possible. Thanks.

This will do exactly what you are asking:
<?php
$string = 'http://somesite.com/directory//sites/9/my_forms/3-895a3e/somefilename.jpg|:||:||:||:|19845';
preg_match('/^([^|]+)/', $string, $m); // get everything up to and NOT including the first pipe (|)
$string = $m[1];
$string = preg_replace('/(?<!:)\/\//', '/' ,$string); // replace all occurrences of // as long as they are not preceded by :
echo $string; // outputs: http://somesite.com/directory/sites/9/my_forms/3-895a3e/somefilename.jpg
exit;
?>
EDIT:
(?<!X) in regular expressions is the syntax for what is called a lookbehind. The X is replaced with the character(s) we are testing for.
The following expression would match every instance of double slashes (/):
\/\/
But we need to make sure that the match we are looking for is NOT preceded by the : character so we need to 'lookbehind' our match to see if the : character is there. If it is then we don't want it to be counted as a match:
(?<!:)\/\/
The ! is what says NOT to match in our lookbehind. If we changed it to (?=:)\/\/ then it would only match the double slashes that did have the : preceding them.
Here is a Quick tutorial that can explain it all better than I can lookahead and lookbehind tutorial

Assuming all your strings are in the form given, you don't need any but the simplest of regexes to do this; if you want an elegant solution, then a regex is definitely not what you need. Also, double slashes are legal in a URL, just like in a Unix path, and mean the same thing a single slash does, so you don't really need to get rid of them at all.
Why not just
$url = array_shift(preg_split('/\|/', $string));
?
If you really, really care about getting rid of the double slashes in the URL, then you can follow this with
$url = preg_replace('/([^:])\/\//', '$1/', $url);
or even combine them into
$url = preg_replace('/([^:])\/\//', '$1/', array_shift(preg_split('/\|/', $string)));
although that last form gets a little bit hairy.

Since this is a quite strictly defined situation, I'd consider just one preg to be the most elegant solution.
From the top of my head:
$sanitizedURL = preg_replace('~((?<!:)/(?=/)|\\|.+)~', '', $rawURL);
Basically, what this does is look for any forward slash that IS NOT preceded by a colon (:), and IS followed bij another forward slash. It also searches for any pipe character and any character following it.
Anything found is removed from the result.
I can explain the RegEx in more detail if you like.

how do I match a url in php using regex?

I'm trying to match the value of query v in the following regex:
http:\/\/www\.domain\.com\/videos\/video.php\?.*v=([a-z0-9-_]+)
A sample url:
http://www.domain.com/videos/video.php?v=9Gu0sd2dmm91B9b1
The url is always www and I'm only trying to match the v value. Does anyone know what's wrong with my syntax?

Use the parse_url() function. It's way easier to use:
$url_components = parse_url("http://www.domain.com/videos/video.php?v=9Gu0sd2dmm91B9b1");
echo $url_components['query'];
From there I think you can do the rest and slice off the first couple of letters. Once you do that you're left with only the stuff after v=.

you forget the capital letters
http:\/\/www\.domain\.com\/videos\/video.php\?.*v=([a-zA-Z0-9-_]+)

You are not escaping the period '.' in video.php. I also use a different delimiter if I am escaping paths/URL's - like this:
preg_match( "#http://www\.domain\.code/videos/video\.php\?.*v=([^&]*)#", $url, $matches );
If the v= is in the middle of the query string,
v=([^&]*)
.. will match everything up to another & symbol, just in case characters other than alphas and _,- end up in there for some reason.

PHP if string contains URL isolate it

In PHP, I need to be able to figure out if a string contains a URL. If there is a URL, I need to isolate it as another separate string.
For example: "SESAC showin the Love! http://twitpic.com/1uk7fi"
I need to be able to isolate the URL in that string into a new string. At the same time the URL needs to be kept intact in the original string. Follow?
I know this is probably really simple but it's killing me.

Something like
preg_match('/[a-zA-Z]+:\/\/[0-9a-zA-Z;.\/?:#=_#&%~,+$]+/', $string, $matches);
$matches[0] will hold the result.
(Note: this regex is certainly not RFC compliant; it may fetch malformed (per the spec) URLs. See http://www.faqs.org/rfcs/rfc1738.html).

this doesn't account for dashes -. needed to add -
preg_match('/[a-zA-Z]+:\/\/[0-9a-zA-Z;.\/\-?:#=_#&%~,+$]+/', $_POST['string'], $matches);

URLs can't contain spaces, so...
\b(?:https?|ftp)://\S+
Should match any URL-like thing in a string.
The above is the pure regex. PHP preg_* and string escaping rules apply before you can use it.

$test = "SESAC showin the Love! http://twitpic.com/1uk7fi";
$myURL= strstr ($test, "http");
echo $myURL; // prints http://twitpic.com/1uk7fi

Regular Expression to collect everything after the last /

I'm new at regular expressions and wonder how to phrase one that collects everything after the last /.
I'm extracting an ID used by Google's GData.
my example string is
http://spreadsheets.google.com/feeds/spreadsheets/p1f3JYcCu_cb0i0JYuCu123
Where the ID is: p1f3JYcCu_cb0i0JYuCu123
Oh and I'm using PHP.

This matches at least one of (anything not a slash) followed by end of the string:
[^/]+$
Notes:
No parens because it doesn't need any groups - result goes into group 0 (the match itself).
Uses + (instead of *) so that if the last character is a slash it fails to match (rather than matching empty string).
But, most likely a faster and simpler solution is to use your language's built-in string list processing functionality - i.e. ListLast( Text , '/' ) or equivalent function.
For PHP, the closest function is strrchr which works like this:
strrchr( Text , '/' )
This includes the slash in the results - as per Teddy's comment below, you can remove the slash with substr:
substr( strrchr( Text, '/' ), 1 );

Generally:
/([^/]*)$
The data you want would then be the match of the first group.
Edit   Since you’re using PHP, you could also use strrchr that’s returning everything from the last occurence of a character in a string up to the end. Or you could use a combination of strrpos and substr, first find the position of the last occurence and then get the substring from that position up to the end. Or explode and array_pop, split the string at the / and get just the last part.

You can also get the "filename", or the last part, with the basename function.
<?php
$url = 'http://spreadsheets.google.com/feeds/spreadsheets/p1f3JYcCu_cb0i0JYuCu123';
echo basename($url); // "p1f3JYcCu_cb0i0JYuCu123"
On my box I could just pass the full URL. It's possible you might need to strip off http:/ from the front.
Basename and dirname are great for moving through anything that looks like a unix filepath.

/^.*\/(.*)$/
^ = start of the row
.*\/ = greedy match to last occurance to / from start of the row
(.*) = group of everything that comes after the last occurance of /

you can also normal string split
$str = "http://spreadsheets.google.com/feeds/spreadsheets/p1f3JYcCu_cb0i0JYuCu123";
$s = explode("/",$str);
print end($s);

This pattern will not capture the last slash in $0, and it won't match anything if there's no characters after the last slash.
/(?<=\/)([^\/]+)$/
Edit: but it requires lookbehind, not supported by ECMAScript (Javascript, Actionscript), Ruby or a few other flavors. If you are using one of those flavors, you can use:
/\/([^\/]+)$/
But it will capture the last slash in $0.

Not a PHP programmer, but strrpos seems a more promising place to start. Find the rightmost '/', and everything past that is what you are looking for. No regex used.
Find position of last occurrence of a char in a string

based on #Mark Rushakoff's answer the best solution for different cases:
<?php
$path = "http://spreadsheets.google.com/feeds/spreadsheets/p1f3JYcCu_cb0i0JYuCu123?var1&var2#hash";
$vars =strrchr($path, "?"); // ?asd=qwe&stuff#hash
var_dump(preg_replace('/'. preg_quote($vars, '/') . '$/', '', basename($path))); // test.png
?>
Regular Expression to collect everything after the last /
How to get file name from full path with PHP?

How can I match the domain part of a URL in PHP?

I'm so bad at regexp, but I'm trying to get some/path/image.jpg out of http://somepage.com/some/...etc and trying this method:
function removeDomain($string) {
return preg_replace("/http:\/\/.*\//", "", $string);
}
It isn't working -- so far as I can tell it's just returning a blank string. How do I write this regexp?

you should use parse_url

you might want to use this rather than regex:
http://cz2.php.net/manual/en/function.parse-url.php
this will break up the URL for you, so you just read the resulting array for the domain name

Use parse_url as other people have already said.
But to answer your question about why your regex isn't working, it will match an entire URL because .* matches anything, and indeed it is. It is matching the whole URL, and replacing it with an empty string, hence your results. Try the following instead which will only match a hostname (anything up to the first '/'):
function removeDomain($string) {
return preg_replace("#^https?://[^/]+/#", "", $string);
}

While SilentGhost is right, the reason your regex is failing is because .* is greedy, and will eat everything, as long as there is a / afterwards.
If you put a ? mark after your .*, it will only match until the first /
function removeDomain($string) {
return preg_replace("/http:\/\/.*?\//", "", $string);
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regex to Remove Everything After 4th Slash in URL - php

You can use explode() function: $parts = explode('/', '/2011/09/here-is-the-title/2'); $output = implode('/', array_slice($parts, 0, 4));

Replace %^((/[^/]){3}).%g with $1. see http://regexr.com?2vlr8 for a live example

you could also use a loop: result=""; for char c in URL: if(c is a slash) count++; if(count<4) result=result+c; else break;

Related

regex to clean up url

how do I match a url in php using regex?

PHP if string contains URL isolate it

Regular Expression to collect everything after the last /

How can I match the domain part of a URL in PHP?

Categories

Resources

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regex to Remove Everything After 4th Slash in URL - php

You can use explode() function: $parts = explode('/', '/2011/09/here-is-the-title/2'); $output = implode('/', array_slice($parts, 0, 4));

Replace %^((/[^/]*){3}).*%g with $1. see http://regexr.com?2vlr8 for a live example

you could also use a loop: result=""; for char c in URL: if(c is a slash) count++; if(count<4) result=result+c; else break;

Related

regex to clean up url

how do I match a url in php using regex?

PHP if string contains URL isolate it

Regular Expression to collect everything after the last /

How can I match the domain part of a URL in PHP?

Categories

Resources

Replace %^((/[^/]){3}).%g with $1. see http://regexr.com?2vlr8 for a live example