regex skip match if its follows by whitespace and a keyword - php

Currently trying to match comments with regexes but only if no function follows.
Currently I use a regex which also matches the keyword function.
And then check in the source code (php) if this group is set or not.
/\/\*\*.*?\*\/\s*(function)?/sg
https://regex101.com/r/l0j1ip/1
Now the question is whether it is possible to realize with pure regex.
I have tried it with a simple negative lookahead but without success.
Although the comment is no longer made individually, but then just with the subsequent comment.
/\/\*\*.*?\*\/\s*(?!function)/sg
https://regex101.com/r/PuUUw6/1
Next I tried non capture group. But also there without success.
/(?:\/\*\*.*?\*\/\s*function)|\/\*\*.*?\*\/\s*/sg
https://regex101.com/r/wkQE7E/1
After a comment with the information (*SKIP)(*FAIL) I also tried it without success.
All matches above this keyword are skipped. Also the single matches are skipped.
/\/\*\*.*?\*\/\s*function(*SKIP)(*FAIL)|\/\*\*.*?\*\//sg
https://regex101.com/r/OJSFrF/1

After reading the question again, it should be doable using negative lookahead ; the repetition must be inside the negative expression:
/\/\*\*((?!\*\/).)*\*\/(?!\s*function)/sg
Seems you need to understand better how backtracking works, using .*? instead of .* means the regex engine will try first to match everything after before .* however the negative lookahead makes the match fail and .* continues to match. Using ((?!\*\/).)* can't match \*\/ wheras .*? can, after backtracking.
Another solution is to use atomic group (?>\/\*\*.*?\*\/)(?!\s*function).

Another option without the /s flag could be
/\*\*(?:[^*]*+|\*(?!/)[^*]*+)*\*/(?!\s*function)
The pattern matches:
/\*\* Match /**
(?: Non capture group
[^*]*+ Match any char except * using a possessive quantifier
| Or
\*(?!/) Match * not followed by /
[^*]*+ Match any char except * using a possessive quantifier
)* Close non capture group and optionally repeat
\*/ Match */
(?!\s*function) Negative lookahead, assert not optional whitspace chars followed by function to the right
Regex demo
Note that you don't have to escape the backslash when using a different delimiter.
$regex = '~/\*\*(?:[^*]*+|\*(?!/)[^*]*+)*\*/(?!\s*function)~';

Related

Why is non-greedy match consuming entire pattern even when followed by another non-greedy match

Using PHP8, I'm struggling to figure out how to conditionally match some key that may or may not appear in a string.
I would like to match both
-----------key=xyz---------------
AND
--------------------------
The dashes("-") could be any non-space character, and only used here for a cleaner to read example.
The regex is matching "key=..." if its containing group is greedy like below.
But this isn't adequate, because the full match will fail a "key=xyz" is missing the subject string.
/
(\S*)?
(key\=(?<foundkey>[[:alnum:]-]*))
\S*
/x
if that capture group is non-greedy, then the regex just ignores the key match any "key=xyz"
/
(\S*)?
(key\=(?<foundkey>[[:alnum:]-]*))?
\S*
/x
I tried debugging in this regex101 example but couldn't figure it out.
I sorted this out using multiple regexs, but hoping someone can help address my misunderstandings so I learn know how to make this work as a single regex.
Thanks
You may use:
/
^
\S*?
(?:
key=(?<foundkey>\w+)
\S*
)?
$
/xm
RegEx Demo
RegEx Breakdown:
^: Start
\S*?: Match 0 or more whitespaces non-greedy
(?:: Start Lookahead
key=(?<foundkey>\w+): Match key= text followed by 1+ word characters as capture group foundkey
\S*: Match 0 or more whitespaces
)?: End lookahead. ? makes it an optional match
$; End

Using Regex to detect if string exists

I need to use PHP's preg_match() and Regex to detect the following conditions:
If a URL path is one of the following:
products/new items
new items/products
new items/products/brand name
Do something...
I can't seem to figure out how to check if the a string exists before or after the word products. The closest I can get is:
if (preg_match("([a-zA-Z0-9_ ]+)\/products\/([a-zA-Z0-9_ ]+)", $url_path)) {
// Do something
Would anyone know a way to check if the first part of the string exists within the one regex line?
You could use an alternation with an optional group for the last item making the / part of the optional group.
If you are only looking for a match, you can omit the capturing groups.
(?:[a-zA-Z0-9_ ]+/products(?:/[a-zA-Z0-9_ ]+)?|products/[a-zA-Z0-9_ ]+)
Explanation
(?: Non catpuring group
[a-zA-Z0-9_ ]+/products Match 1+ times what is listed in the character class, / followed by products
(?:/[a-zA-Z0-9_ ]+)? Optionally match / and what is listed in the character class
| Or
products/[a-zA-Z0-9_ ]+ Match products/, match 1+ times what is listed
) Close group
Regex demo
Note that [a-zA-Z0-9_ ]+ might be shortened to [\w ]+
You can use alternation
([\w ]+)\/products|products\/([\w ]+)
Regex Demo
Note:- I am not sure how you're using the matched values, if you don't need back reference to any specific values then you can avoid capturing group, i.e.
[\w ]+\/products|products\/[\w ]+

PHP regex - match everything but not exactly one or more word

I try to find any string it not exactly one or more word
My pattern
(?!(^ignoreme$)|(^ignoreme2$))
Iam looking for
ignoreme - no
ignoreme2 - no
ignoremex - match
ignorem - match
gnoreme - match
ignoreme22 - match
But it return many space. How to do that thank.
https://regex101.com/r/u4EsNv/1
You may use this corrected regex:
^(?!ignoreme2?$).*$
Updated RegEx Demo
RegEx Details:
^: Start
(?!ignoreme2?$): Negartive lookahead to fail the match when we have ignoreme or ignoreme2 ahead till end.
.*: Match 0 more of any characters
$: End
Note that regex (?!(^ignoreme$)|(^ignoreme2$)) matches first 2 invalid cases because you have included ^ in negative lookahead expressions not outside. This causes regex engine to start matching after 1st character to satisfy lookahead assertions. (You can see that in regex101 highlighted matches)

Match multiple times a group only in single regex

Hi my question is simple:
I want to match all the possible hashtags in an article only if they are in a <figcaption> with PCRE regex. E.g:
<figcaption>blah blah #hashtag1, #hashtag2</figcaption>
I made an attempt here https://regex101.com/r/aL9vS8/1 and removing the last ? would change the capture from #hashtag1 to #hashtag2 but can't get both.
I am not even sure it is doable in one single regex in PHP.
Any idea to help me? :)
If there is no way in one single regex (really? even working with recursion (?R)?? :p), please suggest the most efficient way possible performance wise.
Thank you!
[EDIT]
If there is no way, my PHP next idea is to:
Match every figcaption with preg_replace_callback
In the callback match every instance of #hashtag.
Can I get your opinions on this? Is there a better way? my articles are not very long.
Please suggest the most efficient way possible performance wise
The most reliable way to match some text in between some delimiters with PCRE regex is by using the custom boundaries with \G operator. However, the trailing boundary is a multicharacter string, and to match any text but the </figcaption> you'd need a tempered greedy token. Since this token is very resource consuming, it must be unrolled.
Here is a fast, reliable PCRE regex for your task:
(?:<figcaption|(?!^)\G)[^<#]*(?:(?:<(?!\/figcaption>)|#\B)[^<#]*)*\K#\w+
See the regex demo
Details:
(?:<figcaption|(?!^)\G) - Matches <figcaption or the end of the previous successful match
More details: (?:<figcaption|(?!^)\G) is a non-capturing group ((?:...))that is meant to only group, not keep track of what was matched with this group (i.e. no value is kept in the group stack since the stack is not created) that matches 2 alternatives (| is an alternation operator): 1) literal text <figcaption or 2) (?!^)\G - a location after the previous successful match (note that \G also matches the start of the string, thus, we must add the negative lookahead (?!^) to exclude that behavior).
[^<#]* - 0+ chars other than < and #
(?:(?:<(?!\/figcaption>)|#\B)[^<#]*)* - 0+ sequences of:
(?:<(?!\/figcaption>)|#\B) - a < not followed with /figcaption> or # not followed with a word char
[^<#]* - 0+ chars other than < and #
\K - omit the text matched so far
#\w+ - # and 1+ word chars
Even more details:
\K:
The escape sequence \K causes any previously matched characters not to be included in the final matched sequence. For example, the pattern:
foo\Kbar
matches foobar, but reports that it has matched bar. This feature is similar to a lookbehind assertion.
(?:(?:<(?!\/figcaption>)|#\B)[^<#]*)*: Here, we have an outer non-capturing group (?:...)* to enable matching a sequence of subpatterns zero or more times (we can set a quantifier * only to a grouping if we need to repeat a sequence of subpatterns) and the inner non-capturing group (?:<(?!\/figcaption>)|#\B)[^<#]* is just a way to shrink a longer <(?!\/figcaption>)[^<#]*|#\B[^<#]* (just to group 2 different alternatives <(?!\/figcaption>) and #\B before a common "suffix" [^<#]*.
Wrapping in a tag: just use preg_replace with the <span class="highlight">$0</span> replacement pattern:
Code:
$re = '~(?:<figcaption|(?!^)\G)[^<#]*(?:(?:<(?!\/figcaption>)|#\B)[^<#]*)*\K#\w+~';
$str = "<figcaption>blah # blah #hashtag1, #hashtag2</figcaption> #ee <figcaption>#ddddd";
$subst = "<span class=\"highlight\">$0</span>";
$result = preg_replace($re, $subst, $str);
echo $result;
See the PHP IDEONE demo

Regex group include if condition

i have try to use that regex /^(\S+)(?:\?$|$)/
with yolo and yolo?
works with both but on the second string (yolo?) the ? will be include on the capturing group (\S+).
It's a bug of regex or i have made some mistake?
edit: i don't want that the '?' included on the capturing group. Sry for my bad english.
You can use
If what you want to capture can't have a ? in it, use a negated character class [^...] (see demo here):
^([^\s?]+)\??$
If what you want to capture can have ? in it (for example, yolo?yolo? and you want
yolo?yolo), you need to make your quantifier + lazy by adding ? (see demo here):
^(\S+?)\??$
There is BTW no need for a capturing group here, you can use a look ahead (?=...) instead and look at the whole match (see demo here):
^[^\s?]+(?=\??$)
What was happening
The rules are: quantifiers (like +) are greedy by default, and the regex engine will return the first match it finds.
Considers what this means here:
\S+ will first match everything in yolo?, then the engine will try to match (?:\?$|$).
\?$ fails (we're already at the end of the string, so we now try to match an empty string and there's no ? left), but $ matches.
The regex has succesfully reached its end, the engine returns the match where \S+ has matched all the string and everything is in the first capturing group.
To match what you want you have to make the quantifier lazy (+?), or prevent the character class (yeah, \S is a character class) from matching your ending delimiter ? (with [^\s?] for example).
This is the correct response as \S+ matches one or more non-whitespace characters greedily, of which ? is one.
thus the question mark is matched in the (\S+) group and the non-capturing group resolves to $ you could make it work as you expect by making the match non-greedy with:
/^(\S+?)(?:\?$|$)/
demo
alternatively you could restrict the character group:
/^([^\s?]+)(?:\?$|$)/
demo
Make the + non greedy:
^(\S+?)\??$
The below regex would capture all the non space characters followed by an option ?,
^([\S]+)\??$
DEMO
OR
^([\w]+)\??$
DEMO
If you use \S+, it matches even the ? character also. So to seperate word and non word character you could use the above regex. It would capture only the word characters and matches the optional ? which is follwed by one or more word characters.
It is doing that because \S matches any non-white space character and it is being greedy.
Following the + quantifier with ? for a non-greedy match will prevent this.
^(\S+?)\??$
Or use \w here which matches any word character.
^(\w+)\??$

Categories