I want my regex to match ?ver and ?v, but not ?version
This is what I have so far: $parts = preg_split( "(\b\?ver\b|\b\?v\b)", $src );
I think the trouble might be how I escape the ?.
Your pattern tries to match a ? that is preceded with a word char, and since there is none, you do not have a match.
Use the following pattern:
'/\?v(?:er)?\b/'
See the regex demo
Pattern details:
\? - a literal ? char
v(?:er)? - v or ver
\b - a word boundary (i.e. there must be a non-word char (not a digit, letter or _) or end of string after v or ver).
Note you do not need the first (initial) word boundary as it is already there, between a ? (a non-word char) and v (a word char). You would need a word boundary there if the ? were optional.
Try the following regex pattern;
(\?v(?:\b|(?:er(?!sion))))
Demo
This will allow ?ver and ?v, but will use a negative look-ahead to prevent matching if ?ver is followed by sion, as in your case ?version.
Building upon above answers, to match a word without being a part of another you can try
\b(WORD_HERE)\b which in your case is \b(\?ver)\b
this will allow ver and prevent version average
Related
I have a regular expression to escape all special characters in a search string. This works great, however I can't seem to get it to work with word boundaries. For example, with the haystack
add +
or
add (+)
and the needle
+
the regular expression /\+/gi matches the "+". However the regular expression /\b\+/gi doesn't. Any ideas on how to make this work?
Using
add (plus)
as the haystack and /\bplus/gi as the regex, it matches fine. I just can't figure out why the escaped characters are having problems.
\b is a zero-width assertion: it doesn't consume any characters, it just asserts that a certain condition holds at a given position. A word boundary asserts that the position is either preceded by a word character and not followed by one, or followed by a word character and not preceded by one. (A "word character" is a letter, a digit, or an underscore.) In your string:
add +
...there's a word boundary at the beginning because the a is not preceded by a word character, and there's one after the second d because it's not followed by a word character. The \b in your regex (/\b\+/) is trying to match between the space and the +, which doesn't work because neither of those is a word character.
Try changing it to:
/\b\s?+/gi
Edit:
Extend this concept as far as you want. If you want the first + after any word boundary:
/\b[^+]*+/gi
Boundaries are very conditional assertions; what they anchor depends on what they touch. See this answer for a detailed explanation, along with what else you can do to deal with it.
I am aware that definition of word boundary is (?<!\w)(?=\w)|(?<=\w)(?!\w)
and i wish to add underscore(optionally) too in definition of word boundary.
The one way of doing it is we can simply modify the definition
like the new one would be (_)?((?<!\w)(?=\w)|(?<=\w)(?!\w))
, but don't wish to use too long expression.
Easy Approach can be
If i can write word boundary inside character class, then adding underscore inside character class would be very easy just like [\b-], but the problem is that putting \b inside character class i.e. [\b], means back space character not word boundary.
please tell the solution i.e. how to put \b inside character class without losing its original meaning.
You may use lookarounds:
(?:\b|(?<=_))word(?=\b|_)
^^^^^^^^^^^^^ ^^^^^^^
See the regex demo where (?:\b|(?<=_)) is a non-capturing group matching either a word boundary or a location preceded with _, and (?=\b|_) is a positive lookahead matching either a word boundary or a _ symbol.
Unfortunately, Python re won't allow using (?<=\b|_) as the lookbehind pattern should be of fixed width (else, you will get look-behind requires fixed-width pattern error).
A Python demo:
import re
rx = r"(?:\b|(?<=_))word(?=\b|_)"
s = "some_word_here and a word there"
print(re.findall(rx,s))
An alternative solution is to use custom word boundaries like (?<![^\W_]) / (?![^\W_]) (see online demo):
rx = r"(?<![^\W_])word(?![^\W_])"
The (?<![^\W_]) negative lookbehind fails a match if there is no character other than non-word and _ char (so, it requires the start of string or any word char excluding _ before the search word) and (?![^\W_]) negative lookahead will fail the match if there is no char other than non-word and _ char (that is, requires the end of string or a word char excluding _).
I am trying to build regex for the expression to get values for either Boost Mobile or BoostMobile whichever is present.
Any suggestions please ?
In NFA regexes, in unanchored alternation groups, the first branch matched stops the group processing, the other branches located further on the right are not checked against the string. You may read more on that at Alternation with The Vertical Bar or Pipe Symbol.
So, swapping the values and simplifying the pattern you could use
/\b(Boost \s*Mobile|Boost)\b/i
However, the most effective way here is through using an optional group:
/\bBoost(?:\s*Mobile)?\b/i
^^ ^^
See the regex demo
The i case insensitive modifier is set on the whole regex. You need not switch it on and off at the beginning/end of the pattern. Also, \W* can match an empty string, so your way of checking a word boundary may fail here when \b will work.
Pattern details:
\b - leading word boundary
Boost - a literal substring
(?:\s*Mobile)? - an optional group matching 1 or 0 sequences of
\s* - 0+ whitespaces
Mobile - a literal substring
\b - trailing word boundary
im looking for a regex that matches words that repeat a letter(s) more than once and that are next to each other.
Here's an example:
This is an exxxmaple oooonnnnllllyyyyy!
By far I havent found anything that can exactly match:
exxxmaple and oooonnnnllllyyyyy
I need to find it and place them in an array, like this:
preg_match_all('/\b(???)\b/', $str, $arr) );
Can somebody explain what regexp i have to use?
You can use a very simple regex like
\S*(\w)(?=\1+)\S*
See how the regex matches at http://regex101.com/r/rF3pR7/3
\S matches anything other than a space
* quantifier, zero or more occurance of \S
(\w) matches a single character, captures in \1
(?=\1+) postive look ahead. Asserts that the captrued character is followed by itsef \1
+ quantifiers, one or more occurence of the repeated character
\S* matches anything other than space
EDIT
If the repeating must be more than once, a slight modification of the regex would do the trick
\S*(\w)(?=\1{2,})\S*
for example http://regex101.com/r/rF3pR7/5
Use this if you want discard words like apple etc .
\b\w*(\w)(?=\1\1+)\w*\b
or
\b(?=[^\s]*(\w)\1\1+)\w+\b
Try this.See demo.
http://regex101.com/r/kP8uF5/20
http://regex101.com/r/kP8uF5/21
You can use this pattern:
\b\w*?(\w)\1{2}\w*
The \w class and the word-boundary \b limit the search to words. Note that the word boundary can be removed, however, it reduces the number of steps to obtain a match (as the lazy quantifier). Note too, that if you are looking for words (in the common meaning), you need to remove the word boundary and to use [a-zA-Z] instead of \w.
(\w)\1{2} checks if a repeated character is present. A word character is captured in group 1 and must be followed with the content of the capture group (the backreference \1).
i have try to use that regex /^(\S+)(?:\?$|$)/
with yolo and yolo?
works with both but on the second string (yolo?) the ? will be include on the capturing group (\S+).
It's a bug of regex or i have made some mistake?
edit: i don't want that the '?' included on the capturing group. Sry for my bad english.
You can use
If what you want to capture can't have a ? in it, use a negated character class [^...] (see demo here):
^([^\s?]+)\??$
If what you want to capture can have ? in it (for example, yolo?yolo? and you want
yolo?yolo), you need to make your quantifier + lazy by adding ? (see demo here):
^(\S+?)\??$
There is BTW no need for a capturing group here, you can use a look ahead (?=...) instead and look at the whole match (see demo here):
^[^\s?]+(?=\??$)
What was happening
The rules are: quantifiers (like +) are greedy by default, and the regex engine will return the first match it finds.
Considers what this means here:
\S+ will first match everything in yolo?, then the engine will try to match (?:\?$|$).
\?$ fails (we're already at the end of the string, so we now try to match an empty string and there's no ? left), but $ matches.
The regex has succesfully reached its end, the engine returns the match where \S+ has matched all the string and everything is in the first capturing group.
To match what you want you have to make the quantifier lazy (+?), or prevent the character class (yeah, \S is a character class) from matching your ending delimiter ? (with [^\s?] for example).
This is the correct response as \S+ matches one or more non-whitespace characters greedily, of which ? is one.
thus the question mark is matched in the (\S+) group and the non-capturing group resolves to $ you could make it work as you expect by making the match non-greedy with:
/^(\S+?)(?:\?$|$)/
demo
alternatively you could restrict the character group:
/^([^\s?]+)(?:\?$|$)/
demo
Make the + non greedy:
^(\S+?)\??$
The below regex would capture all the non space characters followed by an option ?,
^([\S]+)\??$
DEMO
OR
^([\w]+)\??$
DEMO
If you use \S+, it matches even the ? character also. So to seperate word and non word character you could use the above regex. It would capture only the word characters and matches the optional ? which is follwed by one or more word characters.
It is doing that because \S matches any non-white space character and it is being greedy.
Following the + quantifier with ? for a non-greedy match will prevent this.
^(\S+?)\??$
Or use \w here which matches any word character.
^(\w+)\??$