Optional Group Expression - php

Today I was working with regular expressions at work and during some experimentation I noticed that a regex such as (\w|) compiled. This seems to be an optional group but looking online didn't yield any results.
Is there any practical use of having a group that matches something, but otherwise can match anything? What's the difference between that and (\w|.*)? Thanks.

(\w|) is a verbose way of writing \w?, which checks for \w first, then empty string.
I remove the capturing group, since it seems that () is used for grouping property only. If you actually need the capturing group, then (\w?).
On the same vein, (|\w) is a verbose way of writing \w??, which tries for empty string first, before trying for \w.
(\w|.*) is a different regex altogether. It tries to match (in that order) one word character \w, or 0 or more of any character (except line terminators) .*.
I can't imagine how this regex fragment would be useful, though.

Related

PHP PCRE regex with multiple SKIP FAIL in a pattern

I have a simple string:
$string = '--#--%--%2B--';
I want to percent-encode all characters (inclusive the "lonely" %), except the - character and the triplets of the form %xy. So I wrote the following pattern alternatives:
$pattern1 = '/(?:[\-]+|%[A-Fa-f0-9]{2})(*SKIP)(*FAIL)|./us';
$pattern2 = '/(?:[\-]+)(*SKIP)(*FAIL)|(?:%[A-Fa-f0-9]{2})(*SKIP)(*FAIL)|./us';
Please notice the use of (multiple) (*SKIP)(*FAIL) and of (?:).
The result of matching and replacing is the same - and the correct one too:
--%23--%25--%2B--
I would like to ask:
Are the two patterns equivalent? If not, which one whould be the proper one to use for url-encoding? Could you please explain in few words, why?
Would you suggest other alternatives (implying backtracking control verbs), or are my patterns a good choice?
Can I apply only one (?:) around the whole (chosen) pattern, even if the (multiple) (*SKIP)(*FAIL) will be inside it?
I know that I request a little too much from you by asking more questions at once. Please accept my apology! Thank you very much.
P.S: I've tested with the following PHP code:
$result = preg_replace_callback($patternX, function($matches) {
return rawurlencode($matches[0]);
}, $string);
echo $result;
First of all, both the patterns leverage the SKIP-FAIL PCRE verb sequence that is quite a well-known "trick" to match some text and skip it. See How do (*SKIP) or (*F) work on regex? for some more details.
The two patterns yield the same results, (?:[\-]+|%[A-Fa-f0-9]{2})(*SKIP)(*FAIL) matches either [\-]+ or %[A-Fa-f0-9]{2} and then skips the match, and (?:[\-]+)(*SKIP)(*FAIL)|(?:%[A-Fa-f0-9]{2})(*SKIP)(*FAIL) first tries to match [\-]+ and skips it if found, and then tries to match %[A-Fa-f0-9]{2} and skips the match if it is found. The (?:...) non-capturing groups in the second pattern are redundant as there is no alternation inside and the groups are not quantified. You may use any number of (*SKIP)(*FAIL) in your pattern, just make sure you use them before the | to skip the relevant match.
SKIP-FAIL technique is used when you want to match some text in specific context, when a char should be skipped/"avoided" if it is preceded and followed with some chars, or when you need to "avoid" matching a whole sequence of chars, like in this scenario, thus, the SKIP-FAIL is good to use.

(PHP) How to find words beginning with a pattern and replace all of them?

I have a string. An example might be "Contact /u/someone on reddit, or visit /r/subreddit or /r/subreddit2"
I want to replace any instance of "/r/x" and "/u/x" with "[/r/x](http://reddit.com/r/x)" and "[/u/x](http://reddit.com/u/x)" basically.
So I'm not sure how to 1) find "/r/" and then expand that to the rest of the word (until there's a space), then 2) take that full "/r/x" and replace with my pattern, and most importantly 3) do this for all "/r/" and "/u/" matches in a single go...
The only way I know to do this would be to write a function to walk the string, character by character, until I found "/", then look for "r" and "/" to follow; then keep going until I found a space. That would give me the beginning and ending characters, so I could do a string replacement; then calculate the new end point, and continue walking the string.
This feels... dumb. I have a feeling there's a relatively simple way to do this, and I just don't know how to google to get all the relevant parts.
A simple preg_replace will do what you want.
Try:
$string = preg_replace('#(/(?:u|r)/[a-zA-Z0-9_-]+)#', '[\1](http://reddit.com\1)', $string);
Here is an example: http://ideone.com/dvz2zB
You should see if you can discover what characters are valid in a Reddit name or in a Reddit username and modify the [a-zA-Z0-9_-] charset accordingly.
You are looking for a regular expression.
A basic pattern starts out as a fixed string. /u/ or /r/ which would match those exactly. This can be simplified to match one or another with /(?:u|r)/ which would match the same as those two patterns. Next you would want to match everything from that point up to a space. You would use a negative character group [^ ] which will match any character that is not a space, and apply a modifier, *, to match as many characters as possible that match that group. /(?:u|r)/[^ ]*
You can take that pattern further and add a lookbehind, (?<= ) to ensure your match is preceded by a space so you're not matching a partial which results in (?<= )/(?:u|r)/[^ ]*. You wrap all of that to make a capturing group ((?<= )/(?:u|r)/[^ ]*). This will capture the contents within the parenthesis to allow for a replacement pattern. You can express your chosen replacement using the \1 reference to the first captured group as [\1](http://reddit.com\1).
In php you would pass the matching pattern, replacement pattern, and subject string to the preg_replace function.
In my opinion regex would be an overkill for such a simple operation. If you just want to replace instance of "/r/x" with "[r/x](http://reddit.com/r/x)" and "/u/x" with "[/u/x](http://reddit.com/u/x)" you should use str_replace although with preg_replace it'll lessen the code.
str_replace("/r/x","[/r/x](http://reddit.com/r/x)","whatever_string");
use regex for intricate search string and replace. you can also use http://www.jslab.dk/tools.regex.php regular expression generator if you have something complex to capture in the string.

Regex question mark

To match a string with pattern like:
-TEXT-someMore-String
To get -TEXT-, I came to know that this works:
/-(.+?)-/ // -TEXT-
As of what I know, ? makes preceding token as optional as in:
colou?r matches both colour and color
I initially put in regex to get -TEXT- part like this:
/-(.+)-/
But it gave -TEXT-someMore-.
How does adding ? stops regex to get the -TEXT- part correctly? Since it used to make preceding token optional not stopping at certain point like in above example ?
As you say, ? sometimes means "zero or one", but in your regex +? is a single unit meaning "one or more — and preferably as few as possible". (This is in contrast to bare +, which means "one or more — and preferably as many as possible".)
As the documentation puts it:
However, if a quantifier is followed by a question mark,
then it becomes lazy, and instead matches the minimum
number of times possible, so the pattern /\*.*?\*/
does the right thing with the C comments. The meaning of the
various quantifiers is not otherwise changed, just the preferred
number of matches. Do not confuse this use of
question mark with its use as a quantifier in its own right.
Because it has two uses, it can sometimes appear doubled, as
in \d??\d which matches one digit by preference, but can match two if
that is the only way the rest of the pattern matches.
Alternatively, you can use Ungreedy modifier to set the whole regular expression to search for preferably as short as possible match:
/-(.+)-/U
? before a token is shorthand for {0,1}, which means: Anything up from 0 to 1 appearances as the foremost.
But + is not a token, but a quantifier. shorthand for {1,}: 1 up to endless appearances.
A ? after a quantifier sets it into nongreedy mode. If in greedy mode, it matches as much of the string as possible. If non greedy it matches as little as possible
Another, perhaps the underlying error in your regex is that you try to match a number of arbitrary characters via .+?. However, what you really want is probably: "any character except -". You can get that via [^-]+ In this case, it doesn't matter if you do a greedy match or not -- the repeated match will terminate as soon as you encounter the second "-" in your string.

Why does this regex not validate in the same way in PHP?

when I try preg_match with the following expression: /.{0,5}/, it still matches string longer than 5 characters.
It does, however, work properly when trying in online regexp matcher
The site you reference, myregexp.com, is focussed on Java.
Java has a specific function for matching an exact pattern, without needing to use anchor characters. This is the function which myregexp.com uses.
In most other languages, in order to match an exact pattern, you would need to add the anchoring characters ^ and $ at the start and end of the pattern respectively, otherwise the regex assumes it only needs to find the matched pattern somewhere within the string, rather than the whole string being the match.
This means that without the anchors, your pattern will match any string, of any length, because whatever the string, it will contain within it somewhere a match for "zero to five of any character".
So in PHP, and Perl, and virtually any other language, you need your pattern to look like this:
/^.{0,5}$/
Having explained all that, I would make one final observation though: this specific pattern really doesn't need to be a regular expression -- you could achieve the same thing with strlen(). In addition, the dot character in regex may not work exactly as you expect: it typically matches almost any character; some characters, including new line characters, are excluded by default, so if your string contains five characters, but one of them is a new line, it will fail your regex when you might have expected it to pass. With this in mind, strlen() would be a safer option (or mb_strlen() if you expect to have unicode characters).
If you need to match any character in regex, and the default behaviour of the dot isn't good enough, there are two options: One is to add the s modifier at the end of the expression (ie it becomes /^.{0,5}$/s). The s modifier tells regex to include new line characters in the dot "any character" match.
The other option (which is useful for languages that don't support the s modifier) is to use an expression and its negative together in a character class - eg [\s\S] - instead of the dot. \s matches any white space character, and \S is a negative of \s, so any character not matched by \s. So together in a character class they match any character. It's more long winded and less readable than a dot, but in some languages it's the only way to be sure.
You can find out more about this here: http://www.regular-expressions.info/dot.html
Hope that helps.
You need to anchor it with ^$. These symbols match the beginning and end of the string respectively, so it must be 0-5 characters between the beginning and end. Leaving out the anchors will match anywhere in the string so it could be longer.
/^.{0,5}$/
For better readability, I would probably also enclose the . in (), but that's kind of subjective.
/^(.){0,5}$/

Regex/PHP check if group of characters appears only once

I am trying to validate an input in PHP with REGEX. I want to check whether the input has the %s character group inside it and that it appears only once. Otherwise, the rule should fail.
Here's what I've tried:
preg_match('|^[0-9a-zA-Z_-\s:;,\.\?!\(\)\p{L}(%s){1}]*$|u', $value); (there are also some other rules besides this; I've tried the (%s){1} part and it doesn't work).
I believe it is a very easy solution to this, but I'm not really into REGEX's...Thank you for your help!
If I understand your question, you need a positive lookahead. The lookahead causes the expression to only match if it finds a single %s.
preg_match('|^(?=[^%s].*?[%s][^%s]*$)[0-9a-zA-Z_-\s:;,\.\?!\(\)\p{L}(%s){1}]*$|u', $value);
I'll explain how each part works
^(?=[^%s].*?[%s][^%s]*$) is a zero-width assertion -- (?=regex) a positive lookahead -- (meaning it must match, but does not "eat" any characters). It means that the whole line can have only 1 %s.
[0-9a-zA-Z_-\s:;,\.\?!\(\)\p{L}(%s){1}]*$ The remaining part of the regex also looks at the entire string and ensures that the whole string is composed only of the characters in the character class (like your original regex).
I managed to do this with PHP's substr_count() function, following Johnsyweb suggestion to use an alternate way to perform the validation and because the REGEX's suggested seem pretty complicated.
Thank you again!
Alternatively, you can use preg_match_all with your pattern and check the number of matches. If it's 1, then you're ok - something like this:
$result = (preg_match_all('|^[0-9a-zA-Z_-\s:;,\.\?!\(\)\p{L}(%s){1}]*$|u', $value) == 1)
Try this:
'|^(?=(?:(?!%s).)*%s(?:(?!%s).)*$)[0-9_\s:;,.?!()\p{L}-]+$|u'
The (%s){1} sequence inside the square brackets probably doesn't do what you think it does, but never mind, the solution is more complex. In fact, {1} should never appear anywhere in a regex. It doesn't ensure that there's only one of something, as many people assume. As a matter of fact, it doesn't do anything; it's pure clutter.
EDIT (in answer to the comment): To ensure that only one of a particular sequence is present in a string, you have to actively examine every single character, classifying it as either part-of-%s or not part-of-%s. To that end, (?:(?!%s).)* consumes one character at a time, after the negative lookahead has confirmed that the character is not the start of %s.
When that part of the lookahead expression quits matching, the next thing in the string has to be %s. Then the second (?:(?!%s).)*$ kicks in to confirm that there are no more %s sequences until the end of the string.
And don't forget that the lookahead expression must be anchored at both ends. Because the lookahead is the first thing after the main regex's start anchor you don't need to add another ^. But the lookahead must end with its own $ anchor.
If you're not "into" regular expressions, why not solve this with PHP?
One call to the builtin strpos() will tell you if the string has a match. A second call will tell you if it appears more than once.
This will be easier for you to read and for others to maintain.

Categories