How can I make this URL validation regular expression less greedy? - php

So I have the following regular expression:
https?://(www\.)?flickr\.com/photos/(.+)/?
To match against the following URL:
http://www.flickr.com/photos/username/
How can I stop the final forward slash (/) from being included in the username sub-pattern (.+)?
I have tried:
https?://(www\.)?flickr\.com/photos/(.+?)/?
But then it only matches the first letter of the username.

https?://(?:www\.)?flickr\.com/photos/([^/]+)/?
I added ?: to the first group so it's not capturing, then used [^/] instead of the dot in the last match. This assures you that everything between "photos/" and the very next "/" is captured.
If you need to capture the first www just use this:
https?://(www\.)?flickr\.com/photos/([^/]+)/?

You need to make sure it doesn't match the forward slash:
https?://(?:www\.)?flickr\.com/photos/([^/]+)/?
You could also make the regex lazy (which is what I guess you were doing with the (.+?) syntax), but the above will work just fine

Change (.+) to ([^/]+). This will match until it encounters a /, so you might want to throw some other stuff in the class too.

There are generally two ways to do this:
Append a question mark, to make the matching non-greedy. .* will match as much as possible, .*? will match as little as possible.
Exclude the character you want to match next. If you want to stop on /, use [^/]*.

If you know there will be a trailing slash, take out the final ?.

Related

Match string not followed by another string

I'm trying to match a string. For example, in the following strings, first should match, the second should not.
/users/akinuri/
/users/akinuri/asd/
I've tried a negative lookahead, but it didn't work. Obviously I'm doing something wrong.
if (preg_match("/\/users\/.*?\/(?!.*\/)/", "/users/akinuri/asd/")) {
echo "match";
}
I'm experimenting (trying to create a route system). Right now, I'm just trying to determine if the requested uri is valid. If a visitor requests the second string, it should return false and I'll return a not found page. For example, test this on SO. The second url returns a not found page.
https://stackoverflow.com/users/2202732/akinuri/
https://stackoverflow.com/users/2202732/akinuri/asdasd
So, can I do this check using regex? If so, how? What am I doing wrong?
Or should I split the text and then do further checks? This seems a bit redundant.
You don't need a lookahead for this. You can use a negated character class [^/]+ to match 1+ of any character except /. You also need to use anchors in regex to make sure you match complete input.
For PHP code, you can use this regex:
'~^/users/[^/]+/?$~'
Note that /?$ makes trailing slash an optional match.
RegEx Demo

Regex - Match Word Aslong As Nothing Follows It

Having a little trouble with regex. I'm trying to test for a match but only if nothing follows it. So in the below example if I go to test/create/1/2 - it still matches. I only want to match if it's explicitally test/create/1 (but the one is dynamic).
if(preg_match('^test/create/(.*)^', 'test/create/1')):
// do something...
endif;
I've found some answers that suggest using $ before my delimiter but it doesn't appear to do anything. Or a combination of ^ and $ but I can't quite figure it out. Regex confuses the hell out of me!
EDIT:
I didn't really explain this well enough so just to clarify:
I need the if statement to return true if a URL is test/create/{id} - the {id} being dynamic (and of any length). If the {id} is followed by a forward slash the if statement should fail. So that if someone types in test/create/1/2 - it will fail because of the forward slash after the 1.
Solution
I went for thedarkwinter's answer in the end as it's what worked best for me, although other answers did work as well.
I also had to add an little extra in the regex to make sure that it would work with hyphens as well so the final code looked like this:
if(preg_match('^test/create/[\w-]*$^', 'test/create/1')):
// do something...
endif;
/w matches word characters, and $ matches end of string
if(preg_match('^test/create/\w*$^', 'test/create/1'))
will match test/create/[word/num] and nothing following.
I think thats what you are after.
edit added * in \w*
Here you go:
"/^test\\/create\\/([^\\/]*)$/"
This says:
The string that starts with "test" followed by a forward slash (remember the first backslash escapes the second so PHP puts a letter backslash in the input, which escapes the / to regex) followed by create followed by a forward slash followed by and capture everything that isn't a slash which is then the end of the string.
Comment if you need more detail
I prefer my expressions to always start with / because it has no meaning as a regex character, I've seen # used, I believe some other answer uses ^, this means "start of string" so I wouldn't use it as my regex delimiters.
Use following regular expression (use $ to denote end of the input):
'|test/create/[^/]+$|'
If you want only match digits, use folloiwng instead (\d match digit character):
'^test/create/\d+$^'
The ^ is an anchor for the beginning of the line, i.e. no characters occurring before the ^ . Use a $ to designate the end of the string, or end of the line.
EDIT: wanted to add a suggestion as well:
Your solution is fine and works, but in terms of style I'd advise against using the carat (^) as a delimiter -- especially because it has special meaning as either negation or as a start of line anchor so it's a bit confusing to read it that way. You can legally use most special characters as long as they don't occur (or are escaped) in the regex itself. Just talking about a matter of style/maintainability here.
Of course nearly every potential delimiter has some special meaning, but you also often tend to see the ^ at the beginning of a regex so I might chose another alternative. For example # is a good choice here :
if(preg_match('#test/create/[\w-]*$#', $mystring)) {
//etc
}
The regex abc$ will match abc only when it's the last string.
abcd # no match
dabc # match
abc # match

A preg_replace puzzle: replacing zero or more of a char at the end of the subject

Say $d is a directory path and I want to ensure that it starts and ends with exactly one slash (/). It may initially have zero, one or more leading and/or trailing slashes.
I tried:
preg_replace('%^/*|/*$', '/', $d);
which works for the leading slash but to my surprise yields two trailing slashes if $d has at least one trailing slash. If the subject is, e.g., 'foo///' then preg_replace() first matches and replaces the three trailing slashes with one slash and then it matches zero slashes at the end and replaces that with with a slash. (You can verify this by replacing the second argument with '[$0]'.) I find this rather counterintuitive.
While there are many other ways to solve the underlying problem (and I implemented one) this became a PCRE puzzle for me: what (scalar) pattern in a single preg_replace does this job?
ADDITIONAL QUESTION (edit)
Can anyone explain why this pattern matches the way it does at the end of the string but does not behave similarly at the start?
$path = '/' . trim($path, '/') . '/';
This first removes all slashes at beginning or end and then adds single ones again.
Given a regex like /* that can legitimately match zero characters, the regex engine has to make sure that it never matches more than once in the same spot, or it would get stuck in an infinite loop. Thus, if it does consume zero characters, the engine jumps forward one position before attempting another match. As far as I know, that's the only situation in which the regex engine does anything on its own initiative.
What you're seeing is the opposite situation: the regex consumes one or more characters, then on the next go-round it tries to start matching at the spot where it left off. Never mind that this particular regex can't match anything but the one character, and it already matched as many of those as it could; it still has the option of matching nothing, so that's what it does.
So, why doesn't your regex match twice at the beginning, like it does at the end? Because of the start anchor (^). If the subject starts with one or more slashes, it consumes them and then tries to match zero slashes, but it fails because it's not at the beginning of the string any more. And if there are no slashes at the beginning, the manual bump-along has the same affect.
At the end of the subject it's a different story. If there are no slashes there, it matches nothing, tries to bump along and fails; end of story. But if it does match one or more slashes, it consumes them and tries to match again--and succeeds because the $ anchor still matches.
So in general, if you want to prevent this kind of double match, you can either add a condition to the beginning of the match to prevent it, like the ^ anchor does for the first alternative:
preg_replace('%^/*|(?<!/)/*$%', '/', $d);
...or make sure that part of the regex has to consume at least one character:
preg_replace('%^/*|([^/])/*$%', '$1/', $d);
But in this case you have a much simpler option, as demonstrated by John Kugelman: just capture the part you want to keep and chuck the rest.
preg_replace('%^/*(.*?)/*$%', '/\1/', $d)
it can be done in a single preg_replace
preg_replace('/^\/{2,}|\/{2,}$|^([^\/])|([^\/])$/', '\2/\1', $d);
A small change to your pattern would be to separate out the two key concerns at the end of the string:
Replace multiple slashes with one slash
Replace no slashes with one slash
A pattern for that (and the existing part for matching at the start of the string) would look like:
#^/*|/+$|$(?<!/)#
A slightly less concise, but more precise, option would be to be very explicit about only matching zero or two-or-more slashes; the notion being, why replace one slash with one slash?
#^(?!/)|^/{2,}|/{2,}$|$(?<!/)#
Aside: nikic's suggestion to use trim (to remove leading/trailing slashes, then add your own) is a good one.

php regular expression help finding multiple filenames only not full URL

I am trying to fix a regular expression i have been using in php it finds all find filenames within a sentence / paragraph. The file names always look like this: /this-a-valid-page.php
From help i have received on SOF my old pattern was modified to this which avoids full urls which is the issue i was having, but this pattern only finds one occurance at the beginning of a string, nothing inside the string.
/^\/(.*?).php/
I have a live example here: http://vzio.com/upload/reg_pattern.php
Remove the ^ - the carat signifies the beginning of a string/line, which is why it's not matching elsewhere.
If you need to avoid full URLs, you might want to change the ^ to something like (?:^|\s) which will match either the beginning of the string or a whitespace character - just remember to strip whitespace from the beginning of your match later on.
The last dot in your expression could still cause problems, since it'll match "one anything". You could match, for example, /somefilename#php with that pattern. Backslash it to make it a literal period:
/\/(.*?)\.php/
Also note the ? to make .* non-greedy is necessary, and Arda Xi's pattern won't work. .* would race to the end of the string and then backup one character at a time until it can match the .php, which certainly isn't what you'd want.
To find all the occurrences, you'll have to remove the start anchor and use the preg_match_all function instead of preg_match :
if(preg_match_all('/\/(.*?)\.php/',$input,$matches)) {
var_dump($matches[1]); // will print all filenames (after / and before .php)
}
Also . is a meta char. You'll have to escape it as \. to match a literal period.

Regex Question Again!

I really don't know what my problem is lately, but Regex seems to be giving me the most trouble.
Very simple thing I need to do, but can't seem to get it:
I have a uri that returns either /xmlfeed or /xmlfeed/what/is/this
I want to match /xmlfeed on any occasion.
I've tried many variations of the following:
preg_match('/(\/.*?)\/?/', $_SERVER['REQUEST_URI'], $match);
I would read this as: Match forwardslash then match any character until you come to an optional forwardslash.
Why not:
preg_match ('#/[^/]+#', _SERVER['REQUEST_URI'], $match);
?
$match[0] will give you what you need
why do you need regex that make you confused??
$string = "/xmlfeed/what/is/this";
$s = explode("/",$string,3);
print "/".$s[1]."\n";
output
$ php test.php
/xmlfeed
Your problem is the reluctant quantifier. After the initial slash is matched, .*? consumes the minimum number of characters it's allowed to, which is zero. Then /? takes over; it doesn't see a slash in the next position (which is immediately after the first slash), but that's okay because it's optional. The result: the regex always matches a single slash, and group #1 always matches an empty string.
Obviously, you can't just replace the reluctant quantifier with a greedy one. But if you replace the .* with something that can't match a slash, you don't have to worry about greediness. That's what K Prime's regex, '#/[^/]+#' does. Notice as well how it uses # as the regex delimiter and avoids the necessity of escaping slashes within the regex.
In PHP: '/(\/.*?)\/?/' is a string containing a regular expression.
First you have to decode the string: /(/.*?)\/?/
So you have a forward slash that starts the result expression. An opening brace. A forward slash that ends the matching part of the expression … and I'm pretty sure that it will then error since you haven't closed the brace.
So, to get this working:
Remember to escape characters with special meanings in strings and regular expressions
Don't confuse the forward slash / with the backslash \
You want to match everything after and including the first slash, but before any (optional) second slash (so we don't want the ? that makes it non-greedy):
/(\/[^\/]*)/
Which, expressed as a PHP string is:
'/(\\/([^\\/]*)/'
I know this is avoiding the regex, and therefore avoids the question, but how about splitting the uri (at slashes) into an array.
Then you can deal with the elements of the array, and ignore the bits of the uri you don't want.
Using the suggestions posted, I ended up trying this:
echo $_SERVER['REQUEST_URI'];
preg_match("/(\/.*)[^\/]/", $_SERVER['REQUEST_URI'], $match);
$url = "http://".$_SERVER['SERVER_NAME'].$match[0];
foreach($match as $k=>$v){
echo "<h1>$k - $v</h1>";
}
I also tried it without the .* and without the parentheses.
Without the .* AND () it returns the / with the next character ONLY.
Like it is, it just returns the entire URI everytime
So, when ran with the code above, the output is
/tea-time-blog/post/20
0 - /tea-time-blog/post/20
1 - /tea-time-blog/post/2
This code is being eval()'d by the way. I don't think that should make any differnce in the way PHP handles the regular expression.

Categories