Why does this regex pattern not match? [duplicate] - php

This question already has answers here:
Greedy vs. Reluctant vs. Possessive Qualifiers
(7 answers)
Can someone explain Possessive Quantifiers to me? (Regular Expressions)
(1 answer)
Closed 5 years ago.
Regex101 link: https://regex101.com/r/MsZy0A/2
I have the following regex pattern; .++b with the following test data; aaaaaaaacaeb.
What I don't understand is the "Possessive quantifier". I've read that it doesn't backtrack, which it normally does. However, I don't think it has to backtrack anyways? It only has to match anything up to and including "b", "b" would be matched twice, as .+ matches everything (including "b"), and the "b" after would also match "b".
Could someone please explain the possessive quantifier's role in this?
This question is not a duplicate of the one noted, I'm asking about this particular case because I still didn't get it after reading the other answer.

++ Matches between one and unlimited times, as many times as possible, without giving back - means, if you write .++, it matches everything including the final b. So the additional b in your regex will never matched.
You could get around this, if you don't use possessive quantifiers or simply remove the b from the matching class [^b]++b - but I would suggest the first. Possessive quantifiers are almost everytime unneccessary.

Related

Is it safe to mass replace legacy ASP tags with <?= ?> [duplicate]

This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Closed 4 years ago.
I have this RegEx:
('.+')
It has to match character literals like in C. For example, if I have 'a' b 'a' it should match the a's and the ''s around them.
However, it also matches the b also (it should not), probably because it is, strictly speaking, also between ''s.
Here is a screenshot of how it goes wrong (I use this for syntax highlighting):
I'm fairly new to regular expressions. How can I tell the regex not to match this?
It is being greedy and matching the first apostrophe and the last one and everything in between.
This should match anything that isn't an apostrophe.
('[^']+')
Another alternative is to try non-greedy matches.
('.+?')
Have you tried a non-greedy version, e.g. ('.+?')?
There are usually two modes of matching (or two sets of quantifiers), maximal (greedy) and minimal (non-greedy). The first will result in the longest possible match, the latter in the shortest. You can read about it (although in perl context) in the Perl Cookbook (Section 6.15).
Try:
('[^']+')
The ^ means include every character except the ones in the square brackets. This way, it won't match 'a' b 'a' because there's a ' in between, so instead it'll give both instances of 'a'
You need to escape the qutoes:
\'[^\']+\'
Edit: Hmm, we'll I suppose this answer depends on what lang/system you're using.

Extract all words between two phrases using regex [duplicate]

This question already has an answer here:
Simple AlphaNumeric Regex (single spacing) without Catastrophic Backtracking
(1 answer)
Closed 4 years ago.
I'm trying to extract all the words between two phrases using the following regex:
\b(?:item\W+(?:\w+\W+){0,2}?(?:1|one)\W+(?:\w+\W+){0,3}?business)\b(.*)\b(?:item\W+(?:\w+\W+){0,2}?(?:3|three)\W+(?:\w+\W+){0,3}?legal\W+(?:\w+\W+){0,3}?proceedings)\b
The documents I'm running this regex on are 10-K filings. The filings are too long to post here (see regex101 url below for example), but basically they are something like this:
ITEM 1. BUSINESS
lots of words
ITEM 2. PROPERTIES
lots of words
ITEM 3. LEGAL PROCEEDINGS
I want to extract all the words between ITEM 1 and ITEM 3. Note that the subtitles for each ITEM may be slightly different for each 10-K filing, hence I'm allowing for a few words between each word.
I keep getting catastrophic backtracking error, and I cannot figure out why. For example, please see https://regex101.com/r/zgTiyb/1.
What am I doing wrong?
Catastrophic backtracking has almost one main reason:
A possible match is found but can't finish.
You made too many positions available for regex to try. This hits backtracking limit on PCRE. A quick work around would be removing the only dot-star in regex in order to replace it with a restrictive quantifier i.e.
.{0,200}
See live demo here
But the better approach is re-constructing the regular expression:
\bitem\b.*?\b(?:1|one)\b(*COMMIT)\W+(?:\w+\W+){0,2}?business\b\h*\R+(?:(?!item\h+(?:3|three)\b)[\s\S])*+item\h+(?:3|three)\b\W+(?:\w+\W+){0,3}?legal\W+(?:\w+\W+){0,3}?proceedings\b
See live demo here
Your own regex needs ~45K steps on given input string to find those two matches. In contrast, this modified regex needs ~8K steps to accomplish the task. That's a huge improvement.
The latter doesn't need s flag (and it shouldn't be enabled). I used (*COMMIT) backtracking verb to cause an early failure if a possible match is found but is likely to not finish.
#Sebastian Proske's solution matches three sub-strings but I don't think the third match is an expected match. This huge third match is the only reason for your regex to break.
Please read this answer to have a better insight into this problem.
This isn't really catastrophic backtracking, just a whole lot of text and a comparedly low backtracking limit in regex101. In this scenario the use of .* isn't optimal, as it will match the whole remainder of the textfile once it is reached and then backtrack character after character to match the parts after it - which means a lot of characters to process.
Seems you can stick to \w+\W+ at that place as well and use lazy matching instead of greedy to get your result, like
\b(?:item\W+(?:\w+\W+){0,2}?(?:1|one)\W+(?:\w+\W+){0,3}?business)\b\W+(?:\w+\W+)*?\b(?:item\W+(?:\w+\W+){0,2}?(?:3|three)\W+(?:\w+\W+){0,3}?legal\W+(?:\w+\W+){0,3}?proceedings)\b
Note that the pcre engine optimizes (?:\w+\W+) to (?>\w++\W++) thus working by word-no-word-chunks instead of single characters.

Regular Expression - Find all matches after string but before character [duplicate]

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 2 years ago.
I have this gigantic ugly string:
J0000000: Transaction A0001401 started on 8/22/2008 9:49:29 AM
J0000010: Project name: E:\foo.pf
J0000011: Job name: MBiek Direct Mail Test
J0000020: Document 1 - Completed successfully
I'm trying to extract pieces from it using regex. In this case, I want to grab everything after Project Name up to the part where it says J0000011: (the 11 is going to be a different number every time).
Here's the regex I've been playing with:
Project name:\s+(.*)\s+J[0-9]{7}:
The problem is that it doesn't stop until it hits the J0000020: at the end.
How do I make the regex stop at the first occurrence of J[0-9]{7}?
Make .* non-greedy by adding '?' after it:
Project name:\s+(.*?)\s+J[0-9]{7}:
Using non-greedy quantifiers here is probably the best solution, also because it is more efficient than the greedy alternative: Greedy matches generally go as far as they can (here, until the end of the text!) and then trace back character after character to try and match the part coming afterwards.
However, consider using a negative character class instead:
Project name:\s+(\S*)\s+J[0-9]{7}:
\S means “everything except a whitespace and this is exactly what you want.
Well, ".*" is a greedy selector. You make it non-greedy by using ".*?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ".*?". This means that if for instance nothing comes after the ".*?", then it matches nothing.
Here's what I used. s contains your original string. This code is .NET specific, but most flavors of regex will have something similar.
string m = Regex.Match(s, #"Project name: (?<name>.*?) J\d+").Groups["name"].Value;
I would also recommend you experiment with regular expressions using "Expresso" - it's a utility a great (and free) utility for regex editing and testing.
One of its upsides is that its UI exposes a lot of regex functionality that people unexprienced with regex might not be familiar with, in a way that it would be easy for them to learn these new concepts.
For example, when building your regex using the UI, and choosing "*", you have the ability to check the checkbox "As few as possible" and see the resulting regex, as well as test its behavior, even if you were unfamiliar with non-greedy expressions before.
Available for download at their site:
http://www.ultrapico.com/Expresso.htm
Express download:
http://www.ultrapico.com/ExpressoDownload.htm
(Project name:\s+[A-Z]:(?:\\w+)+.[a-zA-Z]+\s+J[0-9]{7})(?=:)
This will work for you.
Adding (?:\\w+)+.[a-zA-Z]+ will be more restrictive instead of .*

Regex find a string with hypones in it [duplicate]

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 2 years ago.
I have this gigantic ugly string:
J0000000: Transaction A0001401 started on 8/22/2008 9:49:29 AM
J0000010: Project name: E:\foo.pf
J0000011: Job name: MBiek Direct Mail Test
J0000020: Document 1 - Completed successfully
I'm trying to extract pieces from it using regex. In this case, I want to grab everything after Project Name up to the part where it says J0000011: (the 11 is going to be a different number every time).
Here's the regex I've been playing with:
Project name:\s+(.*)\s+J[0-9]{7}:
The problem is that it doesn't stop until it hits the J0000020: at the end.
How do I make the regex stop at the first occurrence of J[0-9]{7}?
Make .* non-greedy by adding '?' after it:
Project name:\s+(.*?)\s+J[0-9]{7}:
Using non-greedy quantifiers here is probably the best solution, also because it is more efficient than the greedy alternative: Greedy matches generally go as far as they can (here, until the end of the text!) and then trace back character after character to try and match the part coming afterwards.
However, consider using a negative character class instead:
Project name:\s+(\S*)\s+J[0-9]{7}:
\S means “everything except a whitespace and this is exactly what you want.
Well, ".*" is a greedy selector. You make it non-greedy by using ".*?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ".*?". This means that if for instance nothing comes after the ".*?", then it matches nothing.
Here's what I used. s contains your original string. This code is .NET specific, but most flavors of regex will have something similar.
string m = Regex.Match(s, #"Project name: (?<name>.*?) J\d+").Groups["name"].Value;
I would also recommend you experiment with regular expressions using "Expresso" - it's a utility a great (and free) utility for regex editing and testing.
One of its upsides is that its UI exposes a lot of regex functionality that people unexprienced with regex might not be familiar with, in a way that it would be easy for them to learn these new concepts.
For example, when building your regex using the UI, and choosing "*", you have the ability to check the checkbox "As few as possible" and see the resulting regex, as well as test its behavior, even if you were unfamiliar with non-greedy expressions before.
Available for download at their site:
http://www.ultrapico.com/Expresso.htm
Express download:
http://www.ultrapico.com/ExpressoDownload.htm
(Project name:\s+[A-Z]:(?:\\w+)+.[a-zA-Z]+\s+J[0-9]{7})(?=:)
This will work for you.
Adding (?:\\w+)+.[a-zA-Z]+ will be more restrictive instead of .*

What does this PHP regular expression mean? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
I'm learning PHP regular expressions, and I came across something I'm having trouble making sense of.
The book gives this example in validating an e-mail address.
if (ereg("^[^#]+#([a-z0-9\-]+\.)+[a-z]{2,4}$", $email))
I'm not clear on a couple elements of this expression.
what does this mean [^#]+#
What is the purpose of the parentheses in ([a-z0-9\-]+\.)?
[^#]+# means:
[ - Match this group of characters
^# - Anything that is NOT an at sign
]
+ - One or more times
# - Match an at sign
So, it's essentially matching every character before the first at sign.
The purpose of parenthesis in ([a-z0-9-]+.) is to create a capturing group, which you should be able to reference later on once the group captures some amount of text.
Also note that ereg_* functions are deprecated, and your book must be a bit dated. Nowadays, we use the preg_* family of functions. A tutorial on converting them can be found in this SO question.

Categories