Regex prevent selecting characters from previous match - php

My title probably doesn't explain exactly what I mean. Take the following string:
POWERSTART9^{{2|3}}POWERENDx{{3^EXSTARTxEXEND}}=POWERSTART27^{{1|4}}POWEREND
What I want to do here is isolate the parts that are like this:
{{2|3}} or {{1|4}}
The following expression works to an extent, it selects the first one {{2|3}} with no issue:
\{\{(.*?)\|(.*?)\}\}
The problem is, it's not just selecting the first if {{2|3}} and the second of {{1|4}} because after the first one we have {{3^EXSTARTxEXEND}} so it's taking the starting point from {{3 and going right until the end of the second part I want |4}}
Here it is highlighted on RegExr:
I've never been great with regex and can't work out how to stop it doing that. Any ideas? I basically want it to only match the exact pattern and not something that contains it.

You may use
\{\{((?:(?!{{).)*?)\|(.*?)}}
See the regex demo.
If there can be no { and } inside the {{...}} substrings, you may use a simpler \{\{([^{}|]*)\|([^{}]*)}} expression (see demo).
Details
\{\{ - a {{ substring
((?:(?!{{).)*?) - Capturing group 1: any char (.), as few as possible (*?), that does not start a {{ char sequence (tempered greedy token)
[^{}|]* - any 0 or more chars other than {, } and |
\| - a | char
(.*?) - Capturing group 2: any 0 or more chars, as few as possible
[^{}]* - any 0 or more chars other than { and }
}} - a }} substring.

Try this \{\{([^\^|]*)\|([^\^|]*)\}\}
https://regex101.com/r/bLF8Oq/1

Related

preg - Difference between Search Patterns with [] and without

It seems I am not able to understand something very basic with preg regex Patterns in PHP.
What is the difference between these Regex Patterns:
\b([A-Z...]...)
[\b]{1}([A-Z...]...)
The Pattern should start with a word boundary, but why is the result different, when I put it in []{1} ??
The first one works like I expected, but the second not. The problem is, that I want to put more into the [], so that the pattern can start with a word boundary OR a small character [a-z].
Thank you!
Example Text:
Race1529/05/201512:45K4 Senior Men 1000m
LaneName(s)NFBib(s)TimeRank250m500m750m
152
Martin SCHUBERT / Lukas REUSCHENBACH155
11
153
151Kostja STROINSKI / Kai SPENNER
03:07.740
GER
8
I want to find the names of the racers. Sometimes they have a word-break (\b) at the beginning, sometimes not. (But i need the word-break.)
$pattern = '#\b(['.$GB.$KB.'\s\-]{2,40})\s(['.$GB.'\'\-\s]{2,40})[0-9]{0,5}#';
($GB is a variable with all Uppercase Letters, $KB with lower case letters)
preg_match_all gives me all racers where the Name has a word-break at the beginning. (In this example Schubert, Reuschenbach, Spenner) but of course not Stroinski. So, I try this:
$pattern = '#[\b0-9]+(['.$GB.$KB.'\s\-]{2,40})\s(['.$GB.'\'\-\s]{2,40})[0-9]{0,5}#';
Does not work. Even if i remove the 0-9 and only put [\b]{1} at the beginning it doesn't find any hit.
I don't see the difference between \b and [\b]{1}. It seems to be a very basic misunderstanding.
The [\b] is a character class that only matches a backspace char (\u0008).
See PHP regex reference:
note that "\b" has a different meaning, namely the backspace character, inside a character class
Also, .{1} = ., the {1} limiting quantifier is always redundant and only makes sense when your patterns are built dynamically from variables.

Retrieve 0 or more matches from comma separated list inside parenthesis using regex

I am trying to retrieve matches from a comma separated list that is located inside parenthesis using regular expression. (I also retrieve the version number in the first capture group, though that's not important to this question)
What's worth noting is that the expression should ideally handle all possible cases, where the list could be empty or could have more than 3 entries = 0 or more matches in the second capture group.
The expression I have right now looks like this:
SomeText\/(.*)\s\(((,\s)?([\w\s\.]+))*\)
The string I am testing this on looks like this:
SomeText/1.0.4 (debug, OS X 10.11.2, Macbook Pro Retina)
Result of this is:
1. [6-11] `1.0.4`
2. [32-52] `, Macbook Pro Retina`
3. [32-34] `, `
4. [34-52] `Macbook Pro Retina`
The desired result would look like this:
1. [6-11] `1.0.4`
2. [32-52] `debug`
3. [32-34] `OS X 10.11.2`
4. [34-52] `Macbook Pro Retina`
According to the image above (as far as I can see), the expression should work on the test string. What is the cause of the weird results and how could I improve the expression?
I know there are other ways of solving this problem, but I would like to use a single regular expression if possible. Please don't suggest other options.
When dealing with a varying number of groups, regex ain't the best. Solve it in two steps.
First, break down the statement using a simple regex:
SomeText\/([\d.]*) \(([^)]*)\)
1. [9-14] `1.0.4`
2. [16-55] `debug, OS X 10.11.2, Macbook Pro Retina`
Then just explode the second result by ',' to get your groups.
Probably the \G anchor works best here for binding the match to an entry point. This regex is designed for input that is always similar to the sample that is provided in your question.
(?<=SomeText\/|\G(?!^))[(,]? *\K[^,)(]+
(?<=SomeText\/|\G) the lookbehind is the part where matches should be glued to
\G matches where the previous match ended (?!^) but don't match start
[(,]? *\ matches optional opening parenthesis or comma followed by any amount of space
\K resets beginning of the reported match
[^,)(]+ matches the wanted characters, that are none of ( ) ,
Demo at regex101 (grab matches of $0)
Another idea with use of capture groups.
SomeText\/([^(]*)\(|\G(?!^),? *([^,)]+)
This one without lookbehind is a bit more accurate (it also requires the opening parenthesis), of better performance (needs fewer steps) and probably easier to understand and maintain.
SomeText\/([^(]*)\( the entry anchor and version is captured here to $1
|\G(?!^),? *([^,)]+) or glued to previous match: capture to $2 one or more characters, that are not , ) preceded by optional space or comma.
Another demo at regex101
Actually, stribizhev was close:
(?:SomeText\/([^() ]*)\s*\(|(?!^)\G),?\s*([^(),]+)(?=[^()]*\))
Just had to make that one class expect at least one match
(?:SomeText\/([0-9.]+)\s*\(|(?!^)\G),?\s*([^(),]+)(?=[^()]*\)) is a little more clear as long as the version number is always numbers and periods.
I wanted to come up with something more elegant than this (though this does actually work):
SomeText\/(.*)\s\(([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?\)
Obviously, the
([^\,]+)?\,?\s?
is repeated 6 times.
(It can be repeated any number of times and it will work for any number of comma-separated items equal to or below that number of times).
I tried to shorten the long, repetitive list of ([^\,]+)?\,?\s? above to
(?:([^\,]+)\,?\s?)*
but it doesn't work and my knowledge of regex is not currently good enough to say why not.
This should solve your problem. Use the code you already have and add something like this. It will determine where commas are in your string and delete them.
Use trim() to delete white spaces at the start or the end.
$a = strpos($line, ",");
$line = trim(substr($line, 55-$a));
I hope, this helps you!

Regex for word not followed by asterisk

i need a regex (for php) matching any 1 or 2 characters that start with a + and end not with a *.
So far i got this one: /\+\b\w{1,2}\b/ which finds +a3 but also finds +a3* as the asterisk is seen as after the word.
In a String like +find +in +me* i only want to find the +in but not the +me*.
I tried with /\+\b[\w\*]{1,2}\b/ but that does not seem to make any difference.
preg_replace($regex,'','+do+find +in +me*'); //expected result: '+do+find +me*'
How about:
/\+\w{1,2}\b(?!\*)/
(?!\*) is a negative lookahead that assure a * doesn't follow the two character.
The \b isn't mandatory between \+ and \w.
Edit according to comment:
This matches the "+2c" in "whatever+2c" what would i need to change that it wont match this but only matches for "whatever +2c" or "+2c whatever"
Use this one:
/(?:^|\s)\+\w{1,2}(?:\s|$)
According to comments:
/(?<=^|\s)\+\w{1,2}(?:\s|$)/

Regex - matching all between second set of brackets ([])

I have the following string that I need to match only the last seven digets between [] brackets. The string looks like this
[15211Z: 2012-09-12] ([5202900])
I only need to match 5202900 in the string contained between ([]), a similar number could appear anywhere in the string so something like this won't work (\d{7})
I also tried the following regex
([[0-9]{1,7}])
but this includes the [] in the string?
If you just want the 7 digits, not the brackets, but want to make sure that the digits are surrounded with brackets:
(?<=\[)\d{7}(?=\])
FYI: This is called a positive lookahead and positive lookbehind.
Good source on the topic: http://www.regular-expressions.info/lookaround.html
Try matching \(\[(\d{7})\]\), so you match this whole regular expression, then you take group 1, the one between unescaped parentheses. You can replace {7} with a '*' for zero or more, + for 1 or more or a precise range like you already showed in your question.
You can try to use
\[(\d{1,7})\]
If first pattern looks like yours (not only digits), then this should work for you to extract group of digits surrounded by brackets like ([123]):
\(\[(\d+)\]\)
From your details, lookbehind and lookaround seems to be good one. You can also use this one:
(\d{7})\]\)$
Since the pattern of seven digit is expected at the end of the line, engine need to work less in order to find the match.
Hope it helps!
Here is a benchmark (in Perl, but I think is close the same in php) that compares lookaround approach and capture group:
use Benchmark qw(:all);
my $str = q/[15211Z: 2012-09-12] ([5202900])/;
my $count = -3;
cmpthese($count, {
'lookaround' => sub {
$str =~ /(?<=\[)\d{7}(?=\])/;
},
'capture group' => sub {
$str =~ /\[(\d{7})\]/;
},
});
result:
Rate lookaround capture group
lookaround 274914/s -- -70%
capture group 931043/s 239% --
As we can see, capture is more than 3 times faster than lookaround.

Regex optional groups

I'd like to capture up to four groups of text between <p> and </p>. I can do that using the following regex:
<h5>Trivia<\/h5><p>(.*)<\/p><p>(.*)<\/p><p>(.*)<\/p><p>(.*)<\/p>
The text to match on:
<h5>Trivia</h5><p>Was discovered by a freelance photographer while sunbathing on Bournemouth Beach in August 2003.</p><p>Supports Southampton FC.</p><p>She has 11 GCSEs and 2 'A' Levels.</p><p>Listens to soul, R&B, Stevie Wonder, Aretha Franklin, Usher Raymond, Michael Jackson and George Michael.</p>
It outputs the four lines of text. It also works as intended if there are more trivia items or <p> occurrences.
But if there are less than 4 trivia items or <p> groups, it outputs nothing since it cannot find the fourth group. How do I make that group optional?
I've tried: <h5>Trivia<\/h5><p>(.*?)<\/p>(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)? and that works according to http://gskinner.com/RegExr/ but it doesn't work if I put it inside PHP code. It only detects one group and puts everything in it.
The magic word is either 'escaping' or 'delimiters', read on.
The first regex:
<h5>Trivia<\/h5><p>(.*)<\/p><p>(.*)<\/p><p>(.*)<\/p><p>(.*)<\/p>
worked because you escaped the / characters in tags like </h5> to <\/h5>.
But in your second regex (correctly enclosing each paragraph in a optional non-capturing group, fetching 1 to 5 paragraphs):
<h5>Trivia</h5><p>(.*?)</p>(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?
you forgot to escape those / characters.
It should then have been:
$pattern = '/<h5>Trivia<\/h5><p>(.*?)<\/p>(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?/';
The above is assuming you were putting your regex between two / "delimiters" characters (out of conventional habit).
To dive a little deeper into the rabbit-hole, one should note that in php the first and last character of a regular expression is usually a "delimiter", so one can add modifiers at the end (like case-insensitive etc).
So instead of escaping your regex, you could also use a ~ character (or #, etc) as a delimiter.
Thus you could also use the same identical (second) regex that you posted and enclose for example like this:
$pattern = '~<h5>Trivia</h5><p>(.*?)</p>(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?~';
Here is a working (web-based) example of that, using # as delimiter (just because we can).
You can use the question mark to make each <p>...</p> optional:
$pattern = '~<h5>Trivia</h5>(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?~';
Use the Dom is a good option too.

Categories