Adding any pattern to match in regex - php

I have a regex that works pretty well except in one particular situation;
$message = preg_replace("#(^(http(s)?://)(?!img.youtube.com/vi/)([-a-zA-Z?-??-?()0-9#:%_+.~\#?&;//=,]+(\.jpg|\.jpeg|\.gif|\.bmp|\.png)))#i",
"<p><a href='/viewpost.php?messageid=$message_id'><img src='$1' width=100%></a>", $message);
This pattern does several things, 1) It exactly matches http or https, 2) it ignores any string that includes img.youtube.com/vi/ , and 3) it looks for popular image file types in the links. It works the way it should only if their are no character before a sting (a sting like http://exampleaddress/exampleimage.jpeg). If the string is in the middle of a paragraph, it fails.
I need to keep the ^(http(s)?://) as an exact match (removing ^ fixes my problem but causes a conflict with a subsequent regex rule. So, it looks like the problem is that this exact match situation does not want any carriage returns, spaces, or anything to precede ^(http(s)?://) . How can I make the regex work so that nothing before of after the string is relevant, but when you see exactly http or https to apply the rule?

As you know, the ^ anchor requires the string to appear exactly at the beginning of the input string. You can achieve a similar restriction anywhere inside the input string with a \b word boundary. It matches a zero-length string at the start of a word - for example after but not including whitespace.
I'll note also that you do not need to surround the s in a () group, since the ? will match only the single preceding character.
\bhttps?://...

Related

greedy character matching at end of string

I'm trying to match the following string:
controller1/action1/something
With the following regex:
(?P<controller>[[:alnum:]]+)/(?P<action>[[:alnum:]]+)/(.*)
For some reason it doesn't find the last part of the string: something. But it works when i change the * to + at the end of the regex:
(?P<controller>[[:alnum:]]+)/(?P<action>[[:alnum:]]+)/(.+)
With that regex it does find the something string. But i want to use .* (or .*?) because i want this regex to succeed also when it doesn't have something at the end.
So it should also succeed when the string is: controller1/action1/
So why doesn't it work with (.*) or (.*?) but works with .+? The difference should simply be that the first says "zero or more characters" and the last "one or more". I simply want to check for "zero or more".
PS. I don't want to use ^ and $ to denote the beginning and end of the string due to a complexer problem. Simply stated, this pattern doesn't always occur for strings at the end.
So it should also succeed when the string is: controller1/action1/
I suspect since this input is part of some bigger string that's why .* isn't working for you. suggest you to post some real examples of your input text.
Meanwhile can you try this regex:
"#(?P<controller>[^/]+)/(?P<action>[^/]+)/([^/]*)#"
You just have to make the last group optional to make it match controller1/action1/
(?P<controller>[[:alnum:]]+)/(?P<action>[[:alnum:]]+)/(.+)?

Regex - Match Word Aslong As Nothing Follows It

Having a little trouble with regex. I'm trying to test for a match but only if nothing follows it. So in the below example if I go to test/create/1/2 - it still matches. I only want to match if it's explicitally test/create/1 (but the one is dynamic).
if(preg_match('^test/create/(.*)^', 'test/create/1')):
// do something...
endif;
I've found some answers that suggest using $ before my delimiter but it doesn't appear to do anything. Or a combination of ^ and $ but I can't quite figure it out. Regex confuses the hell out of me!
EDIT:
I didn't really explain this well enough so just to clarify:
I need the if statement to return true if a URL is test/create/{id} - the {id} being dynamic (and of any length). If the {id} is followed by a forward slash the if statement should fail. So that if someone types in test/create/1/2 - it will fail because of the forward slash after the 1.
Solution
I went for thedarkwinter's answer in the end as it's what worked best for me, although other answers did work as well.
I also had to add an little extra in the regex to make sure that it would work with hyphens as well so the final code looked like this:
if(preg_match('^test/create/[\w-]*$^', 'test/create/1')):
// do something...
endif;
/w matches word characters, and $ matches end of string
if(preg_match('^test/create/\w*$^', 'test/create/1'))
will match test/create/[word/num] and nothing following.
I think thats what you are after.
edit added * in \w*
Here you go:
"/^test\\/create\\/([^\\/]*)$/"
This says:
The string that starts with "test" followed by a forward slash (remember the first backslash escapes the second so PHP puts a letter backslash in the input, which escapes the / to regex) followed by create followed by a forward slash followed by and capture everything that isn't a slash which is then the end of the string.
Comment if you need more detail
I prefer my expressions to always start with / because it has no meaning as a regex character, I've seen # used, I believe some other answer uses ^, this means "start of string" so I wouldn't use it as my regex delimiters.
Use following regular expression (use $ to denote end of the input):
'|test/create/[^/]+$|'
If you want only match digits, use folloiwng instead (\d match digit character):
'^test/create/\d+$^'
The ^ is an anchor for the beginning of the line, i.e. no characters occurring before the ^ . Use a $ to designate the end of the string, or end of the line.
EDIT: wanted to add a suggestion as well:
Your solution is fine and works, but in terms of style I'd advise against using the carat (^) as a delimiter -- especially because it has special meaning as either negation or as a start of line anchor so it's a bit confusing to read it that way. You can legally use most special characters as long as they don't occur (or are escaped) in the regex itself. Just talking about a matter of style/maintainability here.
Of course nearly every potential delimiter has some special meaning, but you also often tend to see the ^ at the beginning of a regex so I might chose another alternative. For example # is a good choice here :
if(preg_match('#test/create/[\w-]*$#', $mystring)) {
//etc
}
The regex abc$ will match abc only when it's the last string.
abcd # no match
dabc # match
abc # match

regex validation

I am trying to validate a string of 3 numbers followed by / then 5 more numbers
I thought this would work
(/^([0-9]+[0-9]+[0-9]+/[0-9]+[0-9]+[0-9]+[0-9]+[0-9])/i)
but it doesn't, any ideas what i'm doing wrong
Try this
preg_match('#^\d{3}/\d{5}#', $string)
The reason yours is not working is due to the + symbols which match "one or more" of the nominated character or character class.
Also, when using forward-slash delimiters (the characters at the start and end of your expression), you need to escape any forward-slashes in the pattern by prefixing them with a backslash, eg
/foo\/bar/
PHP allows you to use alternate delimiters (as in my answer) which is handy if your expression contains many forward-slashes.
First of all, you're using / as the regexp delimiter, so you can't use it in the pattern without escaping it with a backslash. Otherwise, PHP will think that you're pattern ends at the / in the middle (you can see that even StackOverflow's syntax highlighting thinks so).
Second, the + is "greedy", and will match as many characters as it can, so the first [0-9]+ would match the first 3 numbers in one go, leaving nothing for the next two to match.
Third, there's no need to use i, since you're dealing with numbers which aren't upper- or lowercase, so case-sensitivity is a moot point.
Try this instead
/^\d{3}\/\d{5}$/
The \d is shorthand for writing [0-9], and the {3} and {5} means repeat 3 or 5 times, respectively.
(This pattern is anchored to the start and the end of the string. Your pattern was only anchored to the beginning, and if that was on purpose, the remove the $ from my pattern)
I recently found this site useful for debugging regexes:
http://www.regextester.com/index2.html
It assumes use of /.../ (meaning you should not include those slashes in the regex you paste in).
So, after I put your regex ^([0-9]+[0-9]+[0-9]+/[0-9]+[0-9]+[0-9]+[0-9]+[0-9]) in the Regex box and 123/45678 in the Test box I see no match. When I put a backslash in front of the forward slash in the middle, then it recognizes the match. You can then try matching 1234/567890 and discover it still matches. Then you go through and remove all the plus signs and then it correctly stops matching.
What I particularly like about this particular site is the way it shows the partial matches in red, allowing you to see where your regex is working up to.

A preg_replace puzzle: replacing zero or more of a char at the end of the subject

Say $d is a directory path and I want to ensure that it starts and ends with exactly one slash (/). It may initially have zero, one or more leading and/or trailing slashes.
I tried:
preg_replace('%^/*|/*$', '/', $d);
which works for the leading slash but to my surprise yields two trailing slashes if $d has at least one trailing slash. If the subject is, e.g., 'foo///' then preg_replace() first matches and replaces the three trailing slashes with one slash and then it matches zero slashes at the end and replaces that with with a slash. (You can verify this by replacing the second argument with '[$0]'.) I find this rather counterintuitive.
While there are many other ways to solve the underlying problem (and I implemented one) this became a PCRE puzzle for me: what (scalar) pattern in a single preg_replace does this job?
ADDITIONAL QUESTION (edit)
Can anyone explain why this pattern matches the way it does at the end of the string but does not behave similarly at the start?
$path = '/' . trim($path, '/') . '/';
This first removes all slashes at beginning or end and then adds single ones again.
Given a regex like /* that can legitimately match zero characters, the regex engine has to make sure that it never matches more than once in the same spot, or it would get stuck in an infinite loop. Thus, if it does consume zero characters, the engine jumps forward one position before attempting another match. As far as I know, that's the only situation in which the regex engine does anything on its own initiative.
What you're seeing is the opposite situation: the regex consumes one or more characters, then on the next go-round it tries to start matching at the spot where it left off. Never mind that this particular regex can't match anything but the one character, and it already matched as many of those as it could; it still has the option of matching nothing, so that's what it does.
So, why doesn't your regex match twice at the beginning, like it does at the end? Because of the start anchor (^). If the subject starts with one or more slashes, it consumes them and then tries to match zero slashes, but it fails because it's not at the beginning of the string any more. And if there are no slashes at the beginning, the manual bump-along has the same affect.
At the end of the subject it's a different story. If there are no slashes there, it matches nothing, tries to bump along and fails; end of story. But if it does match one or more slashes, it consumes them and tries to match again--and succeeds because the $ anchor still matches.
So in general, if you want to prevent this kind of double match, you can either add a condition to the beginning of the match to prevent it, like the ^ anchor does for the first alternative:
preg_replace('%^/*|(?<!/)/*$%', '/', $d);
...or make sure that part of the regex has to consume at least one character:
preg_replace('%^/*|([^/])/*$%', '$1/', $d);
But in this case you have a much simpler option, as demonstrated by John Kugelman: just capture the part you want to keep and chuck the rest.
preg_replace('%^/*(.*?)/*$%', '/\1/', $d)
it can be done in a single preg_replace
preg_replace('/^\/{2,}|\/{2,}$|^([^\/])|([^\/])$/', '\2/\1', $d);
A small change to your pattern would be to separate out the two key concerns at the end of the string:
Replace multiple slashes with one slash
Replace no slashes with one slash
A pattern for that (and the existing part for matching at the start of the string) would look like:
#^/*|/+$|$(?<!/)#
A slightly less concise, but more precise, option would be to be very explicit about only matching zero or two-or-more slashes; the notion being, why replace one slash with one slash?
#^(?!/)|^/{2,}|/{2,}$|$(?<!/)#
Aside: nikic's suggestion to use trim (to remove leading/trailing slashes, then add your own) is a good one.

php regular expression help finding multiple filenames only not full URL

I am trying to fix a regular expression i have been using in php it finds all find filenames within a sentence / paragraph. The file names always look like this: /this-a-valid-page.php
From help i have received on SOF my old pattern was modified to this which avoids full urls which is the issue i was having, but this pattern only finds one occurance at the beginning of a string, nothing inside the string.
/^\/(.*?).php/
I have a live example here: http://vzio.com/upload/reg_pattern.php
Remove the ^ - the carat signifies the beginning of a string/line, which is why it's not matching elsewhere.
If you need to avoid full URLs, you might want to change the ^ to something like (?:^|\s) which will match either the beginning of the string or a whitespace character - just remember to strip whitespace from the beginning of your match later on.
The last dot in your expression could still cause problems, since it'll match "one anything". You could match, for example, /somefilename#php with that pattern. Backslash it to make it a literal period:
/\/(.*?)\.php/
Also note the ? to make .* non-greedy is necessary, and Arda Xi's pattern won't work. .* would race to the end of the string and then backup one character at a time until it can match the .php, which certainly isn't what you'd want.
To find all the occurrences, you'll have to remove the start anchor and use the preg_match_all function instead of preg_match :
if(preg_match_all('/\/(.*?)\.php/',$input,$matches)) {
var_dump($matches[1]); // will print all filenames (after / and before .php)
}
Also . is a meta char. You'll have to escape it as \. to match a literal period.

Categories