How can I match the domain part of a URL in PHP? - php

I'm so bad at regexp, but I'm trying to get some/path/image.jpg out of http://somepage.com/some/...etc and trying this method:
function removeDomain($string) {
return preg_replace("/http:\/\/.*\//", "", $string);
}
It isn't working -- so far as I can tell it's just returning a blank string. How do I write this regexp?

you should use parse_url

you might want to use this rather than regex:
http://cz2.php.net/manual/en/function.parse-url.php
this will break up the URL for you, so you just read the resulting array for the domain name

Use parse_url as other people have already said.
But to answer your question about why your regex isn't working, it will match an entire URL because .* matches anything, and indeed it is. It is matching the whole URL, and replacing it with an empty string, hence your results. Try the following instead which will only match a hostname (anything up to the first '/'):
function removeDomain($string) {
return preg_replace("#^https?://[^/]+/#", "", $string);
}

While SilentGhost is right, the reason your regex is failing is because .* is greedy, and will eat everything, as long as there is a / afterwards.
If you put a ? mark after your .*, it will only match until the first /
function removeDomain($string) {
return preg_replace("/http:\/\/.*?\//", "", $string);
}

Related

regex to clean up url

I am looking for a way to get a valid url out of a string like:
$string = 'http://somesite.com/directory//sites/9/my_forms/3-895a3e/somefilename.jpg|:||:||:||:|19845';
My original solution was:
preg_match('#^[^:|]*#', str_replace('//', '/', $string), $modifiedPath);
But obviously its going to remove a slash from the http:// instead of the one in the middle of the string.
My expected output that I want from the original is:
http://somesite.com/directory/sites/9/my_forms/3-895a3e/somefilename.jpg
I could always break off the http part of the string first but would like a more elegant solution in the form of regex if possible. Thanks.
This will do exactly what you are asking:
<?php
$string = 'http://somesite.com/directory//sites/9/my_forms/3-895a3e/somefilename.jpg|:||:||:||:|19845';
preg_match('/^([^|]+)/', $string, $m); // get everything up to and NOT including the first pipe (|)
$string = $m[1];
$string = preg_replace('/(?<!:)\/\//', '/' ,$string); // replace all occurrences of // as long as they are not preceded by :
echo $string; // outputs: http://somesite.com/directory/sites/9/my_forms/3-895a3e/somefilename.jpg
exit;
?>
EDIT:
(?<!X) in regular expressions is the syntax for what is called a lookbehind. The X is replaced with the character(s) we are testing for.
The following expression would match every instance of double slashes (/):
\/\/
But we need to make sure that the match we are looking for is NOT preceded by the : character so we need to 'lookbehind' our match to see if the : character is there. If it is then we don't want it to be counted as a match:
(?<!:)\/\/
The ! is what says NOT to match in our lookbehind. If we changed it to (?=:)\/\/ then it would only match the double slashes that did have the : preceding them.
Here is a Quick tutorial that can explain it all better than I can lookahead and lookbehind tutorial
Assuming all your strings are in the form given, you don't need any but the simplest of regexes to do this; if you want an elegant solution, then a regex is definitely not what you need. Also, double slashes are legal in a URL, just like in a Unix path, and mean the same thing a single slash does, so you don't really need to get rid of them at all.
Why not just
$url = array_shift(preg_split('/\|/', $string));
?
If you really, really care about getting rid of the double slashes in the URL, then you can follow this with
$url = preg_replace('/([^:])\/\//', '$1/', $url);
or even combine them into
$url = preg_replace('/([^:])\/\//', '$1/', array_shift(preg_split('/\|/', $string)));
although that last form gets a little bit hairy.
Since this is a quite strictly defined situation, I'd consider just one preg to be the most elegant solution.
From the top of my head:
$sanitizedURL = preg_replace('~((?<!:)/(?=/)|\\|.+)~', '', $rawURL);
Basically, what this does is look for any forward slash that IS NOT preceded by a colon (:), and IS followed bij another forward slash. It also searches for any pipe character and any character following it.
Anything found is removed from the result.
I can explain the RegEx in more detail if you like.

Preg replace URL with links: MIME types failure

I use the following regexp in a php function to replace URLs with proper HTML links:
return preg_replace('#(https?://([-\w\.]+[-\w])+(:\d+)?(/([\w/_\.#-]*(\?\S+)?[^\.\s])?)?)#', '$1', $s);
But when $s has for value a string like
<li>http://www.link.com/something.pdf</li>
the function returns
<li>http://www.link.com/something.pdf</li></li>
Does anyone know how to modify the regexp to get the intended string, i.e.
<li>http://www.link.com/something.pdf</li> ?
without excluding from the replacement substrings of the URL introduced by '%', '?' or '&' ?
Really easy solution:
return '<li>'.preg_replace('#(https?://([-\w.]+[-\w])+(:\d+)?(/([\w-.~:/?#\[\]\#!$&\'()*+,;=%]*)?)?)#', '$1', $s).'</li>';
If you really want a regex:
return preg_replace('#(https?://([-\w.]+[-\w])+(:\d+)?(/([\w-.~:/?#\[\]\#!$&\'()*+,;=%]*)?)?)#', '$1', $s);
You rpattern is not sufficient (to catch all the links), but anyway, instead of \S+ you might want to have [^\s<>]+ because the former catches everything non-space.
Same applies to [^\.\s]. Make this [^\s<>.]. You don't need to escape the dot when used in a character class, so my addition to this group was basically the greater than and less than signs.

Strip out rest of query string after first ampersand

I am trying to remove the query string from a url, but I need to leave the first key/var intact. So I know what first occurrence of an ampersand is the point from which I want to discard the query string. What would the best way to do this? Below is my code, which currently just keeps appending to the query string.
<a href="<?php echo $_SERVER["REQUEST_URI"] ?>&sortkey=year&sortval=asc">
You could simply match for everything that is not an ampersand until we hit the first ampersand. E.g.
$incomingURI = 'http://www.example.com/?id=12&left=right&up=down';
preg_match('/[^&]+/', $incomingURI, $match);
$outgoingURI = $match[0];
The above code will output the following in variable $outgoingURI:
http://www.example.com/?id=12
This will be much quicker than using a preg_replace.
If i understood your question correctly, you want to strip out everything after 1st occurrence of ampersand. You can use something like this:
<?php
$uri = 'blah.blah.com?a=b&sortkey=year&sortval=asc';
$new_uri = preg_replace("/([^&]+)&(.*)/", "$1", $uri)
?>
The pattern:
([^&]+) : Match everything except '&'
& : First '&'
(.*) : Any thing after that
Is replaced by first group ($1), which is anything before first occurrence of &.
With the strpos function find the location of the ampersand. Then with the substr function get the part of the URL until that point.
strpos will work, but unless you are using RewriteRule, $_SERVER['SCRIPT_NAME'] or $_SERVER['PHP_SELF'] should suffice (and presumably more efficient).
If the URL IS being rewritten, then $_SERVER['REDIRECT_URL'] is more appropriate.
EDIT: I missed the bit about keeping first part of query string :s

how do I match a url in php using regex?

I'm trying to match the value of query v in the following regex:
http:\/\/www\.domain\.com\/videos\/video.php\?.*v=([a-z0-9-_]+)
A sample url:
http://www.domain.com/videos/video.php?v=9Gu0sd2dmm91B9b1
The url is always www and I'm only trying to match the v value. Does anyone know what's wrong with my syntax?
Use the parse_url() function. It's way easier to use:
$url_components = parse_url("http://www.domain.com/videos/video.php?v=9Gu0sd2dmm91B9b1");
echo $url_components['query'];
From there I think you can do the rest and slice off the first couple of letters. Once you do that you're left with only the stuff after v=.
you forget the capital letters
http:\/\/www\.domain\.com\/videos\/video.php\?.*v=([a-zA-Z0-9-_]+)
You are not escaping the period '.' in video.php. I also use a different delimiter if I am escaping paths/URL's - like this:
preg_match( "#http://www\.domain\.code/videos/video\.php\?.*v=([^&]*)#", $url, $matches );
If the v= is in the middle of the query string,
v=([^&]*)
.. will match everything up to another & symbol, just in case characters other than alphas and _,- end up in there for some reason.

Regex to Remove Everything After 4th Slash in URL

I'm working in PHP with friendly URL paths in the form of:
/2011/09/here-is-the-title
/2011/09/here-is-the-title/2
I need to standardize these URL paths to remove anything after the 4 slash including the slash itself. The value after the 4th slash is sometimes a number, but can also be any parameter.
Any thoughts on how I could do this? I imagine regex could handle it, but I'm terrible with it. I also thought a combination of strpos and substr might be able to handle it, but cannot quite figure it out.
You can use explode() function:
$parts = explode('/', '/2011/09/here-is-the-title/2');
$output = implode('/', array_slice($parts, 0, 4));
Replace
%^((/[^/]*){3}).*%g
with $1.
see http://regexr.com?2vlr8 for a live example
If your regex implementation support arbitrary length look-behind assertions you could replace
(?<=^[^/]*(/[^/]*){3})/.*$
with an empty string.
If it does not, you can replace
^([^/]*(?:/[^/]*){3})/.*$
with the contents of the first capturing group. A PHP example for the second one can be found at ideone.com.
you could also use a loop:
result="";
for char c in URL:
if(c is a slash) count++;
if(count<4) result=result+c;
else break;

Categories