Regex: Match start of string after (*SKIP)(*F) - php

The expression <[^>]*>(*SKIP)(*F)|(\/|\s|^|\()(Dakota Ridge.*?)(,|\.|\s|\b|\)|<) matches Dakota Ridge in the string The Dakota Ridge Trail is open. as expected.
If I wrap Dakota Ridge Trail in HTML tags, however, the string is no longer matched: The <b>Dakota Ridge Trail</b> is open.
I thought the ^ alternative would assert that the string is anchored at the start since (*SKIP) prevents the engine from backtracking past that point but apparently it doesn't work that way.
How can I modify this expression to match if the string is anchored at the first position after a skipped and failed match?
Edit to clarify: The purpose of <[^>]*>(*SKIP)(*F) is to skip HTML tags that could potentially contain the pattern within.

Your regex does not match the second occurrence because the substring you want to match is preceded with a > that is consumed and discarded after SKIP-FAIL does its job. That means there is no way for the (\/|\s|^|\() pattern to match the empty space before Dakota as it is not /, nor a whitespace, start of string or (.
Since you have a \b word boundary in the trailing position, you may use it in the leasing position, too, and further restrict the context with lookarounds (e.g. lookbehind).
For the current scenario, the following will do:
<[^>]*>(*SKIP)(*F)|\b(Dakota Ridge.*?)\b
See the regex demo.
Details
<[^>]*>(*SKIP)(*F) - match <, then 0+ chars other than > and then a >, and discard the match keeping the regex index right at the end of the match
| - or
\b - a word boundary
(Dakota Ridge.*?) - Group 1: Dakota Ridge, and then any 0+ chars (other than line break chars) as few as possible, p to the first
\b - word boundary.

Related

PHP regex: zero or more whitespace not working

I'm trying to apply a regex constraint to a Symfony form input. The requirement for the input is that the start of the string and all commas must be followed by zero or more whitespace, then a # or # symbol, except when it's the empty string.
As far as I can tell, there is no way to tell the constraint to use preg_match_all instead of just preg_match, but it does have the ability to negate the match. So, I need a regular expression that preg_match will NOT MATCH for the given scenario: any string containing the start of the string or a comma, followed by zero or more whitespace, followed by any character that is not a # or # and is not the end of the string, but will match for everything else. Here are a few examples:
preg_match(..., ''); // No match
preg_match(..., '#yolo'); // No match
preg_match(..., '#yolo, #swag'); // No match
preg_match(..., '#yolo,#swag'); // No match
preg_match(..., '#yolo, #swag,'); // No match
preg_match(..., 'yolo'); // Match
preg_match(..., 'swag,#yolo'); // Match
preg_match(..., '#swag, yolo'); // Match
I would've thought for sure that /(^|,)\s*[^##]/ would work, but it's failing in every case with 1 or more spaces and it appears to be because of the asterisk. If I get rid of the asterisk, preg_match('/(^|,)\s[^##]/', '#yolo, #swag') does not match (as desired) when there's exactly once space, but as as soon as I reintroduce the asterisk it breaks for any quantity of spaces > 0.
My theory is that the regex engine is interpreting the second space as a character that is not in the character set [##], but that's just a theory and I don't know what to do about it. I know that I could create a custom constraint to use preg_match_all instead to get around this, but I'd like to avoid that if possible.
You may use
'~(?:^|,)\s*+[^##]~'
Here, the + symbol defines a *+ possessive quantifier matching 0 or more occurrences of whitespace chars, and disallowing the regex engine to backtrack into \s* pattern if [^##] cannot match the subsequent char.
See the regex demo.
Details
(?:^|,) - either start of string or ,
\s*+ - zero or more whitespace chars, possessively matched (i.e. if the next char is not matched with [^##] pattern, the whole pattern match will fail)
[^##] - a negated character class matching any char but # and #.

Regexp for matching all #property in PHPDoc

I am currently making regexp to extract all #property content in PHPDoc above class. I have no fields defined in class because i am using magic __get() method.
I have made (?<=#property)(.+(?=))((?<= ).+) regexp (Live example)
But i can't cut out spaces around first group so if i have something like:
* #property string external_order_id
With regex i presented above i will get:
Full match string external_order_id
Group 1. string (notice spaces around word)
Group 2. external_order_id (there is no unnecessary spaces)
I tried to use (?= ) inside first group (.+(?=)) but it stops regex after that match.
Can You help me with that? I would be thankful.
Since properties with no type may exist you have to do this:
(?m)#property *\K(?>(\S+) *)??(\S+)$
^^
(?m) sets multiline flag
(?>...) constructs an atomic group that's used for its non-capturing purpose.
?? ungreedy optional quantifier (you may use ? alone however)
$ asserts end of line
Live demo
For now that you are familiar with end of line assertion you can use \s with no worries:
(?m)#property\s*\K(?>(\S+)\s*)?(\S+)$
The whitespace lands in Group 1 since the .+ captures those spaces that appear right after the #property substring (the lookbehind (?<=#property) finds a location right after #property, and . matches any chars but line break chars).
Besides, (?=) is redundant as it requires an empty string right after the current location, i.e. "does nothing".
You may use
(?m)#property\h*\K(?:(\S+)\h+)?(\S+)$
See the regex demo.
Pattern details
(?m) - a multiline modifier making $ match the end of a line rather than a whole string
#property - a literal substring
\h* - 0+ horizontal whitespaces
\K - a match reset operator that discards all text matched so far from the Group 0 buffer
(?: - matches a sequence of patterns:
(\S+) - Group 1: one or more non-whitespace chars
\h+ - 1+ horizontal whitespaces
)? - 1 or 0 times
(\S+) - Group 2: one or more non-whitespace chars
$ - end of a line.

Exact word conflict word with dash

Originally, I use was just word boundary for exact word match - https://regex101.com/r/M97FkV/4
Update 1:
1) Exact word match with punctuation inside word like 20-year-old
Search year's, only year's exact is match
-- Search year alone, will not match year's
If I search 20-year-old, exact 20-year-old is match
-- Searching 20 or year or old will not match 20-year-old
2) exact match word before or after punctuation
If I search old, exact word or before or after punctuation old .old old. -old old- _old old_ old' 'old these will match.
-- old will not be match with word with punctuation in it 20-year-old.
Our last Progress
https://regex101.com/r/M97FkV/15 - solve (2) but not (1)
https://regex101.com/r/M97FkV/16 - solve (1) but not (2)
Including case-insensitivity and unicode curly single quotes...
Pattern: /(?:^|\s)[-_,'’.]*\Kold(?=[-_,'’.]*(?:\s|$))/ui
Replace: young
Demo: https://regex101.com/r/M97FkV/20
This input: old 20-year-old _old-maid _old- -old old-’’’ 'old' 20-year-old old’ ....old
Will become: young 20-year-old _old-maid _young- -young young-’’’ 'young' 20-year-old young’ ....young
(?:^|\s) matches the start of the string or a white space character.
[-_,'’.]* matches zero or more of the characters in the character class (list)
\K restarts the fullstring match. This is done to avoid using a capture group and more importantly a "variable-width look-behind" which php doesn't allow.
old is the literal string that is being search for. You can apply your variable in this position.
(?=[-_,'’.]*(?:\s|$)) is a two-part look-ahead expression. It matches zero or more of the characters in the character class (list) then requires either a white-space character or the end of the string.
All of this convolution is done to match a targeted substring that has optional leading and/or trailing punctuation, but beyond that NO non-white-space characters.

regex to match of the occurrence for either "this" or "that" at least twice in a sentence

I want create a regex in PHP that searches the sentences in a text which contain "this" or "that" at least twice (so at least twice "this" or at least twice "that")
We got stuck at:
([^.?!]*(\bthis|that\b){2,}[^.?!]*[.|!|?]+)
Use this Pattern (\b(?:this|that)\b).*?\1 Demo
( # Capturing Group (1)
\b # <word boundary>
(?: # Non Capturing Group
this # "this"
| # OR
that # "that"
) # End of Non Capturing Group
\b # <word boundary>
) # End of Capturing Group (1)
. # Any character except line break
*? # (zero or more)(lazy)
\1 # Back reference to group (1)
This is mostly Wiktor's pattern with a deviation to isolate the sentences and omit the leading white-space characters from the fullstring matches.
Pattern: /\b[^.?!]*\b(th(?:is|at))\b[^.?!]*(\b\1\b)[^.?!]*\b[.!?]/i
Here is a sample text that will demonstrate how the other answers will not correctly disqualify unwanted matches for "word boundary" or "case-insensitive" reasons: (Demo - capture group applied to \b\1\b in the demo to show which substrings are qualifying the sentences for matching)
This is nothing.
That is what that will be.
The Indian policeman hit the thief with his lathis before pushing him into the thistles.
This Indian policeman hit the thief with this lathis before pushing him into the thistles. This is that and that.
The Indian policeman hit the thief with this lathis before pushing him into the thistles.
To see the official breakdown of the pattern, refer to the demo link.
In plain terms:
/ #start of pattern
\b #match start of a sentence on a "word character"
[^.?!]* #match zero or more characters not a dot, question mark, or exclamation
\b(th(?:is|at))\b #match whole word "this" or "that" (not thistle)
[^.?!]* #match zero or more characters not a dot, question mark, or exclamation
\b\1\b #match the earlier captured whole word "this" or "that"
[^.?!]* #match zero or more characters not a dot, question mark, or exclamation
\b #match second last character of sentence as "word character"
[.!?] #match the end of a sentence: dot, question mark, exclamation
/ #end of pattern
i #make pattern case-insensitive
The pattern will match three of the five sentences from the above sample text:
That this is what that will be.
This Indian policeman hit the thief with this lathis before pushing him into the thistles.
This is that and that.
*note, previously I was using \s*\K at the start of my pattern to omit the white-space characters. I've elected to alter my pattern to use additional word boundary meta-characters for improved efficiency. If this doesn't work with your project text, it may be better to revert to my original pattern.
Use this
.*(this|that).*(this|that).*
http://regexr.com/3ggq5
UPDATE:
This is another way, based in your regex:
.*(this\s?|that\s?){2,}.*[\.\n]*
http://regexr.com/3ggq8

PHP Regex display either abc or abc xyz format

I am trying to build regex for the expression to get values for either Boost Mobile or BoostMobile whichever is present.
Any suggestions please ?
In NFA regexes, in unanchored alternation groups, the first branch matched stops the group processing, the other branches located further on the right are not checked against the string. You may read more on that at Alternation with The Vertical Bar or Pipe Symbol.
So, swapping the values and simplifying the pattern you could use
/\b(Boost \s*Mobile|Boost)\b/i
However, the most effective way here is through using an optional group:
/\bBoost(?:\s*Mobile)?\b/i
^^ ^^
See the regex demo
The i case insensitive modifier is set on the whole regex. You need not switch it on and off at the beginning/end of the pattern. Also, \W* can match an empty string, so your way of checking a word boundary may fail here when \b will work.
Pattern details:
\b - leading word boundary
Boost - a literal substring
(?:\s*Mobile)? - an optional group matching 1 or 0 sequences of
\s* - 0+ whitespaces
Mobile - a literal substring
\b - trailing word boundary

Categories