PHP Regex display either abc or abc xyz format - php

I am trying to build regex for the expression to get values for either Boost Mobile or BoostMobile whichever is present.
Any suggestions please ?

In NFA regexes, in unanchored alternation groups, the first branch matched stops the group processing, the other branches located further on the right are not checked against the string. You may read more on that at Alternation with The Vertical Bar or Pipe Symbol.
So, swapping the values and simplifying the pattern you could use
/\b(Boost \s*Mobile|Boost)\b/i
However, the most effective way here is through using an optional group:
/\bBoost(?:\s*Mobile)?\b/i
^^ ^^
See the regex demo
The i case insensitive modifier is set on the whole regex. You need not switch it on and off at the beginning/end of the pattern. Also, \W* can match an empty string, so your way of checking a word boundary may fail here when \b will work.
Pattern details:
\b - leading word boundary
Boost - a literal substring
(?:\s*Mobile)? - an optional group matching 1 or 0 sequences of
\s* - 0+ whitespaces
Mobile - a literal substring
\b - trailing word boundary

Related

PHP regex: zero or more whitespace not working

I'm trying to apply a regex constraint to a Symfony form input. The requirement for the input is that the start of the string and all commas must be followed by zero or more whitespace, then a # or # symbol, except when it's the empty string.
As far as I can tell, there is no way to tell the constraint to use preg_match_all instead of just preg_match, but it does have the ability to negate the match. So, I need a regular expression that preg_match will NOT MATCH for the given scenario: any string containing the start of the string or a comma, followed by zero or more whitespace, followed by any character that is not a # or # and is not the end of the string, but will match for everything else. Here are a few examples:
preg_match(..., ''); // No match
preg_match(..., '#yolo'); // No match
preg_match(..., '#yolo, #swag'); // No match
preg_match(..., '#yolo,#swag'); // No match
preg_match(..., '#yolo, #swag,'); // No match
preg_match(..., 'yolo'); // Match
preg_match(..., 'swag,#yolo'); // Match
preg_match(..., '#swag, yolo'); // Match
I would've thought for sure that /(^|,)\s*[^##]/ would work, but it's failing in every case with 1 or more spaces and it appears to be because of the asterisk. If I get rid of the asterisk, preg_match('/(^|,)\s[^##]/', '#yolo, #swag') does not match (as desired) when there's exactly once space, but as as soon as I reintroduce the asterisk it breaks for any quantity of spaces > 0.
My theory is that the regex engine is interpreting the second space as a character that is not in the character set [##], but that's just a theory and I don't know what to do about it. I know that I could create a custom constraint to use preg_match_all instead to get around this, but I'd like to avoid that if possible.
You may use
'~(?:^|,)\s*+[^##]~'
Here, the + symbol defines a *+ possessive quantifier matching 0 or more occurrences of whitespace chars, and disallowing the regex engine to backtrack into \s* pattern if [^##] cannot match the subsequent char.
See the regex demo.
Details
(?:^|,) - either start of string or ,
\s*+ - zero or more whitespace chars, possessively matched (i.e. if the next char is not matched with [^##] pattern, the whole pattern match will fail)
[^##] - a negated character class matching any char but # and #.

Regex: Match start of string after (*SKIP)(*F)

The expression <[^>]*>(*SKIP)(*F)|(\/|\s|^|\()(Dakota Ridge.*?)(,|\.|\s|\b|\)|<) matches Dakota Ridge in the string The Dakota Ridge Trail is open. as expected.
If I wrap Dakota Ridge Trail in HTML tags, however, the string is no longer matched: The <b>Dakota Ridge Trail</b> is open.
I thought the ^ alternative would assert that the string is anchored at the start since (*SKIP) prevents the engine from backtracking past that point but apparently it doesn't work that way.
How can I modify this expression to match if the string is anchored at the first position after a skipped and failed match?
Edit to clarify: The purpose of <[^>]*>(*SKIP)(*F) is to skip HTML tags that could potentially contain the pattern within.
Your regex does not match the second occurrence because the substring you want to match is preceded with a > that is consumed and discarded after SKIP-FAIL does its job. That means there is no way for the (\/|\s|^|\() pattern to match the empty space before Dakota as it is not /, nor a whitespace, start of string or (.
Since you have a \b word boundary in the trailing position, you may use it in the leasing position, too, and further restrict the context with lookarounds (e.g. lookbehind).
For the current scenario, the following will do:
<[^>]*>(*SKIP)(*F)|\b(Dakota Ridge.*?)\b
See the regex demo.
Details
<[^>]*>(*SKIP)(*F) - match <, then 0+ chars other than > and then a >, and discard the match keeping the regex index right at the end of the match
| - or
\b - a word boundary
(Dakota Ridge.*?) - Group 1: Dakota Ridge, and then any 0+ chars (other than line break chars) as few as possible, p to the first
\b - word boundary.

Match multiple times a group only in single regex

Hi my question is simple:
I want to match all the possible hashtags in an article only if they are in a <figcaption> with PCRE regex. E.g:
<figcaption>blah blah #hashtag1, #hashtag2</figcaption>
I made an attempt here https://regex101.com/r/aL9vS8/1 and removing the last ? would change the capture from #hashtag1 to #hashtag2 but can't get both.
I am not even sure it is doable in one single regex in PHP.
Any idea to help me? :)
If there is no way in one single regex (really? even working with recursion (?R)?? :p), please suggest the most efficient way possible performance wise.
Thank you!
[EDIT]
If there is no way, my PHP next idea is to:
Match every figcaption with preg_replace_callback
In the callback match every instance of #hashtag.
Can I get your opinions on this? Is there a better way? my articles are not very long.
Please suggest the most efficient way possible performance wise
The most reliable way to match some text in between some delimiters with PCRE regex is by using the custom boundaries with \G operator. However, the trailing boundary is a multicharacter string, and to match any text but the </figcaption> you'd need a tempered greedy token. Since this token is very resource consuming, it must be unrolled.
Here is a fast, reliable PCRE regex for your task:
(?:<figcaption|(?!^)\G)[^<#]*(?:(?:<(?!\/figcaption>)|#\B)[^<#]*)*\K#\w+
See the regex demo
Details:
(?:<figcaption|(?!^)\G) - Matches <figcaption or the end of the previous successful match
More details: (?:<figcaption|(?!^)\G) is a non-capturing group ((?:...))that is meant to only group, not keep track of what was matched with this group (i.e. no value is kept in the group stack since the stack is not created) that matches 2 alternatives (| is an alternation operator): 1) literal text <figcaption or 2) (?!^)\G - a location after the previous successful match (note that \G also matches the start of the string, thus, we must add the negative lookahead (?!^) to exclude that behavior).
[^<#]* - 0+ chars other than < and #
(?:(?:<(?!\/figcaption>)|#\B)[^<#]*)* - 0+ sequences of:
(?:<(?!\/figcaption>)|#\B) - a < not followed with /figcaption> or # not followed with a word char
[^<#]* - 0+ chars other than < and #
\K - omit the text matched so far
#\w+ - # and 1+ word chars
Even more details:
\K:
The escape sequence \K causes any previously matched characters not to be included in the final matched sequence. For example, the pattern:
foo\Kbar
matches foobar, but reports that it has matched bar. This feature is similar to a lookbehind assertion.
(?:(?:<(?!\/figcaption>)|#\B)[^<#]*)*: Here, we have an outer non-capturing group (?:...)* to enable matching a sequence of subpatterns zero or more times (we can set a quantifier * only to a grouping if we need to repeat a sequence of subpatterns) and the inner non-capturing group (?:<(?!\/figcaption>)|#\B)[^<#]* is just a way to shrink a longer <(?!\/figcaption>)[^<#]*|#\B[^<#]* (just to group 2 different alternatives <(?!\/figcaption>) and #\B before a common "suffix" [^<#]*.
Wrapping in a tag: just use preg_replace with the <span class="highlight">$0</span> replacement pattern:
Code:
$re = '~(?:<figcaption|(?!^)\G)[^<#]*(?:(?:<(?!\/figcaption>)|#\B)[^<#]*)*\K#\w+~';
$str = "<figcaption>blah # blah #hashtag1, #hashtag2</figcaption> #ee <figcaption>#ddddd";
$subst = "<span class=\"highlight\">$0</span>";
$result = preg_replace($re, $subst, $str);
echo $result;
See the PHP IDEONE demo

REGEX - match words that contain letters repeating next to each other

im looking for a regex that matches words that repeat a letter(s) more than once and that are next to each other.
Here's an example:
This is an exxxmaple oooonnnnllllyyyyy!
By far I havent found anything that can exactly match:
exxxmaple and oooonnnnllllyyyyy
I need to find it and place them in an array, like this:
preg_match_all('/\b(???)\b/', $str, $arr) );
Can somebody explain what regexp i have to use?
You can use a very simple regex like
\S*(\w)(?=\1+)\S*
See how the regex matches at http://regex101.com/r/rF3pR7/3
\S matches anything other than a space
* quantifier, zero or more occurance of \S
(\w) matches a single character, captures in \1
(?=\1+) postive look ahead. Asserts that the captrued character is followed by itsef \1
+ quantifiers, one or more occurence of the repeated character
\S* matches anything other than space
EDIT
If the repeating must be more than once, a slight modification of the regex would do the trick
\S*(\w)(?=\1{2,})\S*
for example http://regex101.com/r/rF3pR7/5
Use this if you want discard words like apple etc .
\b\w*(\w)(?=\1\1+)\w*\b
or
\b(?=[^\s]*(\w)\1\1+)\w+\b
Try this.See demo.
http://regex101.com/r/kP8uF5/20
http://regex101.com/r/kP8uF5/21
You can use this pattern:
\b\w*?(\w)\1{2}\w*
The \w class and the word-boundary \b limit the search to words. Note that the word boundary can be removed, however, it reduces the number of steps to obtain a match (as the lazy quantifier). Note too, that if you are looking for words (in the common meaning), you need to remove the word boundary and to use [a-zA-Z] instead of \w.
(\w)\1{2} checks if a repeated character is present. A word character is captured in group 1 and must be followed with the content of the capture group (the backreference \1).

Regex group include if condition

i have try to use that regex /^(\S+)(?:\?$|$)/
with yolo and yolo?
works with both but on the second string (yolo?) the ? will be include on the capturing group (\S+).
It's a bug of regex or i have made some mistake?
edit: i don't want that the '?' included on the capturing group. Sry for my bad english.
You can use
If what you want to capture can't have a ? in it, use a negated character class [^...] (see demo here):
^([^\s?]+)\??$
If what you want to capture can have ? in it (for example, yolo?yolo? and you want
yolo?yolo), you need to make your quantifier + lazy by adding ? (see demo here):
^(\S+?)\??$
There is BTW no need for a capturing group here, you can use a look ahead (?=...) instead and look at the whole match (see demo here):
^[^\s?]+(?=\??$)
What was happening
The rules are: quantifiers (like +) are greedy by default, and the regex engine will return the first match it finds.
Considers what this means here:
\S+ will first match everything in yolo?, then the engine will try to match (?:\?$|$).
\?$ fails (we're already at the end of the string, so we now try to match an empty string and there's no ? left), but $ matches.
The regex has succesfully reached its end, the engine returns the match where \S+ has matched all the string and everything is in the first capturing group.
To match what you want you have to make the quantifier lazy (+?), or prevent the character class (yeah, \S is a character class) from matching your ending delimiter ? (with [^\s?] for example).
This is the correct response as \S+ matches one or more non-whitespace characters greedily, of which ? is one.
thus the question mark is matched in the (\S+) group and the non-capturing group resolves to $ you could make it work as you expect by making the match non-greedy with:
/^(\S+?)(?:\?$|$)/
demo
alternatively you could restrict the character group:
/^([^\s?]+)(?:\?$|$)/
demo
Make the + non greedy:
^(\S+?)\??$
The below regex would capture all the non space characters followed by an option ?,
^([\S]+)\??$
DEMO
OR
^([\w]+)\??$
DEMO
If you use \S+, it matches even the ? character also. So to seperate word and non word character you could use the above regex. It would capture only the word characters and matches the optional ? which is follwed by one or more word characters.
It is doing that because \S matches any non-white space character and it is being greedy.
Following the + quantifier with ? for a non-greedy match will prevent this.
^(\S+?)\??$
Or use \w here which matches any word character.
^(\w+)\??$

Categories