Match multiple times a group only in single regex - php

Hi my question is simple:
I want to match all the possible hashtags in an article only if they are in a <figcaption> with PCRE regex. E.g:
<figcaption>blah blah #hashtag1, #hashtag2</figcaption>
I made an attempt here https://regex101.com/r/aL9vS8/1 and removing the last ? would change the capture from #hashtag1 to #hashtag2 but can't get both.
I am not even sure it is doable in one single regex in PHP.
Any idea to help me? :)
If there is no way in one single regex (really? even working with recursion (?R)?? :p), please suggest the most efficient way possible performance wise.
Thank you!
[EDIT]
If there is no way, my PHP next idea is to:
Match every figcaption with preg_replace_callback
In the callback match every instance of #hashtag.
Can I get your opinions on this? Is there a better way? my articles are not very long.

Please suggest the most efficient way possible performance wise
The most reliable way to match some text in between some delimiters with PCRE regex is by using the custom boundaries with \G operator. However, the trailing boundary is a multicharacter string, and to match any text but the </figcaption> you'd need a tempered greedy token. Since this token is very resource consuming, it must be unrolled.
Here is a fast, reliable PCRE regex for your task:
(?:<figcaption|(?!^)\G)[^<#]*(?:(?:<(?!\/figcaption>)|#\B)[^<#]*)*\K#\w+
See the regex demo
Details:
(?:<figcaption|(?!^)\G) - Matches <figcaption or the end of the previous successful match
More details: (?:<figcaption|(?!^)\G) is a non-capturing group ((?:...))that is meant to only group, not keep track of what was matched with this group (i.e. no value is kept in the group stack since the stack is not created) that matches 2 alternatives (| is an alternation operator): 1) literal text <figcaption or 2) (?!^)\G - a location after the previous successful match (note that \G also matches the start of the string, thus, we must add the negative lookahead (?!^) to exclude that behavior).
[^<#]* - 0+ chars other than < and #
(?:(?:<(?!\/figcaption>)|#\B)[^<#]*)* - 0+ sequences of:
(?:<(?!\/figcaption>)|#\B) - a < not followed with /figcaption> or # not followed with a word char
[^<#]* - 0+ chars other than < and #
\K - omit the text matched so far
#\w+ - # and 1+ word chars
Even more details:
\K:
The escape sequence \K causes any previously matched characters not to be included in the final matched sequence. For example, the pattern:
foo\Kbar
matches foobar, but reports that it has matched bar. This feature is similar to a lookbehind assertion.
(?:(?:<(?!\/figcaption>)|#\B)[^<#]*)*: Here, we have an outer non-capturing group (?:...)* to enable matching a sequence of subpatterns zero or more times (we can set a quantifier * only to a grouping if we need to repeat a sequence of subpatterns) and the inner non-capturing group (?:<(?!\/figcaption>)|#\B)[^<#]* is just a way to shrink a longer <(?!\/figcaption>)[^<#]*|#\B[^<#]* (just to group 2 different alternatives <(?!\/figcaption>) and #\B before a common "suffix" [^<#]*.
Wrapping in a tag: just use preg_replace with the <span class="highlight">$0</span> replacement pattern:
Code:
$re = '~(?:<figcaption|(?!^)\G)[^<#]*(?:(?:<(?!\/figcaption>)|#\B)[^<#]*)*\K#\w+~';
$str = "<figcaption>blah # blah #hashtag1, #hashtag2</figcaption> #ee <figcaption>#ddddd";
$subst = "<span class=\"highlight\">$0</span>";
$result = preg_replace($re, $subst, $str);
echo $result;
See the PHP IDEONE demo

Related

regex skip match if its follows by whitespace and a keyword

Currently trying to match comments with regexes but only if no function follows.
Currently I use a regex which also matches the keyword function.
And then check in the source code (php) if this group is set or not.
/\/\*\*.*?\*\/\s*(function)?/sg
https://regex101.com/r/l0j1ip/1
Now the question is whether it is possible to realize with pure regex.
I have tried it with a simple negative lookahead but without success.
Although the comment is no longer made individually, but then just with the subsequent comment.
/\/\*\*.*?\*\/\s*(?!function)/sg
https://regex101.com/r/PuUUw6/1
Next I tried non capture group. But also there without success.
/(?:\/\*\*.*?\*\/\s*function)|\/\*\*.*?\*\/\s*/sg
https://regex101.com/r/wkQE7E/1
After a comment with the information (*SKIP)(*FAIL) I also tried it without success.
All matches above this keyword are skipped. Also the single matches are skipped.
/\/\*\*.*?\*\/\s*function(*SKIP)(*FAIL)|\/\*\*.*?\*\//sg
https://regex101.com/r/OJSFrF/1
After reading the question again, it should be doable using negative lookahead ; the repetition must be inside the negative expression:
/\/\*\*((?!\*\/).)*\*\/(?!\s*function)/sg
Seems you need to understand better how backtracking works, using .*? instead of .* means the regex engine will try first to match everything after before .* however the negative lookahead makes the match fail and .* continues to match. Using ((?!\*\/).)* can't match \*\/ wheras .*? can, after backtracking.
Another solution is to use atomic group (?>\/\*\*.*?\*\/)(?!\s*function).
Another option without the /s flag could be
/\*\*(?:[^*]*+|\*(?!/)[^*]*+)*\*/(?!\s*function)
The pattern matches:
/\*\* Match /**
(?: Non capture group
[^*]*+ Match any char except * using a possessive quantifier
| Or
\*(?!/) Match * not followed by /
[^*]*+ Match any char except * using a possessive quantifier
)* Close non capture group and optionally repeat
\*/ Match */
(?!\s*function) Negative lookahead, assert not optional whitspace chars followed by function to the right
Regex demo
Note that you don't have to escape the backslash when using a different delimiter.
$regex = '~/\*\*(?:[^*]*+|\*(?!/)[^*]*+)*\*/(?!\s*function)~';

How to match "have", but not "had"

I would like to match phrases like this:
having the same issue
facing the same problem
have the same question
I am getting the same issue
I see the same issue
I have same issue
But I do not want to match them if they are in the past tense, which means for example that anything containing the word had should be excluded:
I had the same issue
have had the same question
Later, I will add other words in past tense.
I tried this regex, but it still matches "the same issue" even if preceded by the word "had"
((?:i\s)?(?:have\s)?(?<!had\s)(?:(?:the\s|a\s)?same\s(?:(?:problem|question|issue)|here)))
https://regex101.com/r/Nvjtqj/1
Why is this regex still finding phrase "same issue" even if it contains word "had" in front of it?
You need to exclude all matches with the verb in Past tense you want and then match what you need:
(\b(?:i\s+)?(?:have\s+)?)(?:had|faced)\s+((?:the\s+)?same\s+(?:problem|question|issue|here))(*SKIP)(*F)|(?1)(?2)
See the regex demo
Details
(\b(?:i\s+)?(?:have\s+)?)(?:had|faced)\s+((?:the\s+)?same\s+(?:problem|question|issue|here))(*SKIP)(*F) - (*SKIP)(*F) will make the regex engine drop the text matched with the following patterns and go on looking for a match at the failed location:
(\b(?:i\s+)?(?:have\s+)?) - Group 1:
\b - word boundary
(?:i\s+)? - an optional group matching an i and then 1+ whitespaces
(?:have\s+)? - an optional group matching a have and then 1+ whitespaces
(?:had|faced) - had or faced
\s+ - 1+ whitespaces
((?:the\s+)?same\s+(?:problem|question|issue|here)) - Group 2:
(?:the\s+)? - an optional group matching a the and then 1+ whitespaces
same\s+ - same and 1+ whitespaces
(?:problem|question|issue|here) - one of the words in the group
| - or match and return the following match:
(?1) - Group 1 pattern repeated
(?2) - Group 2 pattern repeated
When you don't anchor your lookarounds the regex engine will simply give up a word in order to make the expression match - 'the' in this case, since 'same' does not have the problem of being preceded by 'had'.
Note that this is stretching the limits of what you can and should do with one expression and entering the territory of multiple checks and parsers. If you need to do this with an expression, it could be something like:
^(?!.*\b(?:had)\b)(?=.*same (?:problem|question|issue)).*
where you make a positive and a negative assertion from the same fixed position.

NOT words in Regex Pattern

I am trying to grab the text after the first hyphen in a pattern
<title>.*?-(.*?)(-|<\/title>)
which then grabs DesiredText from the pattern below:
<title>Stuff - DesiredText - Other Stuff</title>
However in this pattern:
<title>Stuff - Unwanted - DesiredText - Otherstuff</title>
I want it to skip the 'Unwanted' text and match the text after the next hyphen instead (DesiredText). I made a regex101 with both patterns and need to modify my basic regex so that if a word or words I don't want to match are present in that capture group it then matches the second hyphen text instead:
https://regex101.com/r/veSqH3/1
I believe this is what you are looking for. The key is in using the caret (^) character within the square-bracket character list ([]). Using the caret and brackets together indicate a blacklist. It will only match things that are NOT in the list.
https://regex101.com/r/alAZhj/3
Pattern: <title>.*?-\s*([^-\s]*)\s*- End<\/title>
This matches anything in between the middle hyphens that is not a hyphen or space. You can of course modify the pattern to include such characters by using the following pattern.
Pattern: <title>.*?-\s*([^-]*)\s*- End<\/title>
This will match anything in between the middle hyphens that is not a hyphen, so that you can have less restricted text in there.
This will use a negative lookahead to disqualify Note. There may be ways to optimize the pattern, but I cannot do so with confidence because I don't know how variable your inputs strings are.
Pattern: /<title>.*?- (?P<title>(?!Note).*?)(?= -|<])/
Demo
I am using a positive lookahead to ensure the captured match doesn't have any unwanted trailing characters.
If you just want the second last delimited value, you could do something like this to return the value as the fullstring match:
~- \K[^-]*(?= - [^-]*?</title>)~
Or faster with a capture group:
~- ([^-]*) - [^-]*?</title>~
This assumes there are no hyphens in the value.
I took a different approach and focused on returning the capture prior to the last word, rather than any sort of negation. In this way it's highly generic.
This pattern will match what you want in the capture group:
\s-\s([a-zA-Z]+)\s-\s[a-zA-Z]+<\/title>
If you are concerned that this only match between title tags, then you can add:
<title>.*?\s-\s([a-zA-Z]+)\s-\s[a-zA-Z]+<\/title>
Here's a link to the Test
The only limitation to this I see, is that it uses words and whitespace, so if your desired match is "- Some phrase -" then this won't work with it, but that was not indicated in your example. It's a bit unclear because you used "other stuff" and then "otherstuff".

PHP Regex display either abc or abc xyz format

I am trying to build regex for the expression to get values for either Boost Mobile or BoostMobile whichever is present.
Any suggestions please ?
In NFA regexes, in unanchored alternation groups, the first branch matched stops the group processing, the other branches located further on the right are not checked against the string. You may read more on that at Alternation with The Vertical Bar or Pipe Symbol.
So, swapping the values and simplifying the pattern you could use
/\b(Boost \s*Mobile|Boost)\b/i
However, the most effective way here is through using an optional group:
/\bBoost(?:\s*Mobile)?\b/i
^^ ^^
See the regex demo
The i case insensitive modifier is set on the whole regex. You need not switch it on and off at the beginning/end of the pattern. Also, \W* can match an empty string, so your way of checking a word boundary may fail here when \b will work.
Pattern details:
\b - leading word boundary
Boost - a literal substring
(?:\s*Mobile)? - an optional group matching 1 or 0 sequences of
\s* - 0+ whitespaces
Mobile - a literal substring
\b - trailing word boundary

REGEX - match words that contain letters repeating next to each other

im looking for a regex that matches words that repeat a letter(s) more than once and that are next to each other.
Here's an example:
This is an exxxmaple oooonnnnllllyyyyy!
By far I havent found anything that can exactly match:
exxxmaple and oooonnnnllllyyyyy
I need to find it and place them in an array, like this:
preg_match_all('/\b(???)\b/', $str, $arr) );
Can somebody explain what regexp i have to use?
You can use a very simple regex like
\S*(\w)(?=\1+)\S*
See how the regex matches at http://regex101.com/r/rF3pR7/3
\S matches anything other than a space
* quantifier, zero or more occurance of \S
(\w) matches a single character, captures in \1
(?=\1+) postive look ahead. Asserts that the captrued character is followed by itsef \1
+ quantifiers, one or more occurence of the repeated character
\S* matches anything other than space
EDIT
If the repeating must be more than once, a slight modification of the regex would do the trick
\S*(\w)(?=\1{2,})\S*
for example http://regex101.com/r/rF3pR7/5
Use this if you want discard words like apple etc .
\b\w*(\w)(?=\1\1+)\w*\b
or
\b(?=[^\s]*(\w)\1\1+)\w+\b
Try this.See demo.
http://regex101.com/r/kP8uF5/20
http://regex101.com/r/kP8uF5/21
You can use this pattern:
\b\w*?(\w)\1{2}\w*
The \w class and the word-boundary \b limit the search to words. Note that the word boundary can be removed, however, it reduces the number of steps to obtain a match (as the lazy quantifier). Note too, that if you are looking for words (in the common meaning), you need to remove the word boundary and to use [a-zA-Z] instead of \w.
(\w)\1{2} checks if a repeated character is present. A word character is captured in group 1 and must be followed with the content of the capture group (the backreference \1).

Categories