How to match "have", but not "had" - php

I would like to match phrases like this:
having the same issue
facing the same problem
have the same question
I am getting the same issue
I see the same issue
I have same issue
But I do not want to match them if they are in the past tense, which means for example that anything containing the word had should be excluded:
I had the same issue
have had the same question
Later, I will add other words in past tense.
I tried this regex, but it still matches "the same issue" even if preceded by the word "had"
((?:i\s)?(?:have\s)?(?<!had\s)(?:(?:the\s|a\s)?same\s(?:(?:problem|question|issue)|here)))
https://regex101.com/r/Nvjtqj/1
Why is this regex still finding phrase "same issue" even if it contains word "had" in front of it?

You need to exclude all matches with the verb in Past tense you want and then match what you need:
(\b(?:i\s+)?(?:have\s+)?)(?:had|faced)\s+((?:the\s+)?same\s+(?:problem|question|issue|here))(*SKIP)(*F)|(?1)(?2)
See the regex demo
Details
(\b(?:i\s+)?(?:have\s+)?)(?:had|faced)\s+((?:the\s+)?same\s+(?:problem|question|issue|here))(*SKIP)(*F) - (*SKIP)(*F) will make the regex engine drop the text matched with the following patterns and go on looking for a match at the failed location:
(\b(?:i\s+)?(?:have\s+)?) - Group 1:
\b - word boundary
(?:i\s+)? - an optional group matching an i and then 1+ whitespaces
(?:have\s+)? - an optional group matching a have and then 1+ whitespaces
(?:had|faced) - had or faced
\s+ - 1+ whitespaces
((?:the\s+)?same\s+(?:problem|question|issue|here)) - Group 2:
(?:the\s+)? - an optional group matching a the and then 1+ whitespaces
same\s+ - same and 1+ whitespaces
(?:problem|question|issue|here) - one of the words in the group
| - or match and return the following match:
(?1) - Group 1 pattern repeated
(?2) - Group 2 pattern repeated

When you don't anchor your lookarounds the regex engine will simply give up a word in order to make the expression match - 'the' in this case, since 'same' does not have the problem of being preceded by 'had'.
Note that this is stretching the limits of what you can and should do with one expression and entering the territory of multiple checks and parsers. If you need to do this with an expression, it could be something like:
^(?!.*\b(?:had)\b)(?=.*same (?:problem|question|issue)).*
where you make a positive and a negative assertion from the same fixed position.

Related

Regex optional groups and digit length

Maybe some regex-Master can solve my problem.
I have a big list with many addresses with no seperators( , ; ).
The address string contains following Information:
The first group is the street name
The second group is the street number
The third group is the zipcode (optional)
The last group is the town name (optional)
As you can see on the image above the last two test strings are not matching.
I need the last two regex groups to be optional and the third group should be either 4 or 5 digits.
I tried (\d{4,5}) for allowing 4 and 5 digits. But this only works halfways as you can see here: https://regex101.com/r/ZurqHh/1
(This sometimes mixes the street number and zipcode together)
I also tried (?:\d{5})? to make the third and fourth group optional. But this destroys my whole group layout...
https://regex101.com/r/EgxeMy/1
This is my current regex:
/^([a-zäöüÄÖÜß\s\d.,-]+?)\s*([\d\s]+(?:\s?[-|+\/]\s?\d+)?\s*[a-z]?)?\s*(\d{5})\s*(.+)?$/im
Try it out yourself:
https://regex101.com/r/zC8NCP/1
My brain is only farting at this moment and i can't think straight anymore.
Please help me fix this problem so i can die in peace.
You can use
^(.*?)(?:\s+(\d+(?:\s*[-|+\/]\s*\d+)*\s*[a-z]?\b))?(?:\s+(\d{4,5})(?:\s+(.*))?)?$
See the regex demo (note all \s are replaced with \h to only match horizontal whitespaces).
Details:
^ - start of string
(.*?) - Group 1: any zero or more chars other than line break chars
(?:\s+(\d+(?:\s*[-|+\/]\s*\d+)*\s*[a-z]?\b))? - an optional non-capturing group matching
\s+ - one or more whitespaces
(\d+(?:\s*[-|+\/]\s*\d+)*\s*[a-z]?\b) - Group 2:
\d+ - one or more digits
(?:\s*[-|+\/]\s*\d+)* - zero or more sequences of zero or more whitespaces, -, +, | or /, zero or more whitespaces, one or more digits
\s* - zero or more whitespaces
[a-z]?\b - an optional lowercase ASCII letter and a word boundary
(?:\s+(\d{4,5})\b(?:\s+(.*))?)? - an optional non-capturing group matching
\s+ - one or more whitespaces
(\d{4,5}) - Group 3: four or five digits
(?:\s+(.*))? - an optional sequence of one or more whitespaces and then any zero or more chars other than line break chars as many as possible
$ - end of string.
Please note that the (?:\s+(.*))? optional group must be inside the (?:\s+(\d{4,5})...)? group to work.
It is difficult to parse addresses because we are halfway between formatted text and natural language. Here is a pattern that tries as much as possible to reduce the number of optional parameters to succeed with the examples offered without asking too much to the regex engine. To do this, I mainly rely on character classes, atomic groups, and a relatively accurate description of the street names. Obviously, all the examples of the question cannot be representative of reality and characters could be added or removed from the classes to deal with new cases. Nevertheless, the structure of this pattern is a good starting point.
~
^
(?<strasse> [\pL\d-]+ \.? (?> \h+ [\pL\d-]+ \.? )*? ) \h*
(?<nummer> \b (?> \d+ | [-+/\h]+ | [a-z] \b )*? )
(?: \h+ (?<plz> \d{4,5} )
\h+ (?<stadt> .+ ) )?
$
~mxui
demo
Note that in the above link you can also see a previous version of this pattern with a more accurate description of the street number (a bit more efficient but longer).

PHP regex - match everything but not exactly one or more word

I try to find any string it not exactly one or more word
My pattern
(?!(^ignoreme$)|(^ignoreme2$))
Iam looking for
ignoreme - no
ignoreme2 - no
ignoremex - match
ignorem - match
gnoreme - match
ignoreme22 - match
But it return many space. How to do that thank.
https://regex101.com/r/u4EsNv/1
You may use this corrected regex:
^(?!ignoreme2?$).*$
Updated RegEx Demo
RegEx Details:
^: Start
(?!ignoreme2?$): Negartive lookahead to fail the match when we have ignoreme or ignoreme2 ahead till end.
.*: Match 0 more of any characters
$: End
Note that regex (?!(^ignoreme$)|(^ignoreme2$)) matches first 2 invalid cases because you have included ^ in negative lookahead expressions not outside. This causes regex engine to start matching after 1st character to satisfy lookahead assertions. (You can see that in regex101 highlighted matches)

PHP Regex display either abc or abc xyz format

I am trying to build regex for the expression to get values for either Boost Mobile or BoostMobile whichever is present.
Any suggestions please ?
In NFA regexes, in unanchored alternation groups, the first branch matched stops the group processing, the other branches located further on the right are not checked against the string. You may read more on that at Alternation with The Vertical Bar or Pipe Symbol.
So, swapping the values and simplifying the pattern you could use
/\b(Boost \s*Mobile|Boost)\b/i
However, the most effective way here is through using an optional group:
/\bBoost(?:\s*Mobile)?\b/i
^^ ^^
See the regex demo
The i case insensitive modifier is set on the whole regex. You need not switch it on and off at the beginning/end of the pattern. Also, \W* can match an empty string, so your way of checking a word boundary may fail here when \b will work.
Pattern details:
\b - leading word boundary
Boost - a literal substring
(?:\s*Mobile)? - an optional group matching 1 or 0 sequences of
\s* - 0+ whitespaces
Mobile - a literal substring
\b - trailing word boundary

Match multiple times a group only in single regex

Hi my question is simple:
I want to match all the possible hashtags in an article only if they are in a <figcaption> with PCRE regex. E.g:
<figcaption>blah blah #hashtag1, #hashtag2</figcaption>
I made an attempt here https://regex101.com/r/aL9vS8/1 and removing the last ? would change the capture from #hashtag1 to #hashtag2 but can't get both.
I am not even sure it is doable in one single regex in PHP.
Any idea to help me? :)
If there is no way in one single regex (really? even working with recursion (?R)?? :p), please suggest the most efficient way possible performance wise.
Thank you!
[EDIT]
If there is no way, my PHP next idea is to:
Match every figcaption with preg_replace_callback
In the callback match every instance of #hashtag.
Can I get your opinions on this? Is there a better way? my articles are not very long.
Please suggest the most efficient way possible performance wise
The most reliable way to match some text in between some delimiters with PCRE regex is by using the custom boundaries with \G operator. However, the trailing boundary is a multicharacter string, and to match any text but the </figcaption> you'd need a tempered greedy token. Since this token is very resource consuming, it must be unrolled.
Here is a fast, reliable PCRE regex for your task:
(?:<figcaption|(?!^)\G)[^<#]*(?:(?:<(?!\/figcaption>)|#\B)[^<#]*)*\K#\w+
See the regex demo
Details:
(?:<figcaption|(?!^)\G) - Matches <figcaption or the end of the previous successful match
More details: (?:<figcaption|(?!^)\G) is a non-capturing group ((?:...))that is meant to only group, not keep track of what was matched with this group (i.e. no value is kept in the group stack since the stack is not created) that matches 2 alternatives (| is an alternation operator): 1) literal text <figcaption or 2) (?!^)\G - a location after the previous successful match (note that \G also matches the start of the string, thus, we must add the negative lookahead (?!^) to exclude that behavior).
[^<#]* - 0+ chars other than < and #
(?:(?:<(?!\/figcaption>)|#\B)[^<#]*)* - 0+ sequences of:
(?:<(?!\/figcaption>)|#\B) - a < not followed with /figcaption> or # not followed with a word char
[^<#]* - 0+ chars other than < and #
\K - omit the text matched so far
#\w+ - # and 1+ word chars
Even more details:
\K:
The escape sequence \K causes any previously matched characters not to be included in the final matched sequence. For example, the pattern:
foo\Kbar
matches foobar, but reports that it has matched bar. This feature is similar to a lookbehind assertion.
(?:(?:<(?!\/figcaption>)|#\B)[^<#]*)*: Here, we have an outer non-capturing group (?:...)* to enable matching a sequence of subpatterns zero or more times (we can set a quantifier * only to a grouping if we need to repeat a sequence of subpatterns) and the inner non-capturing group (?:<(?!\/figcaption>)|#\B)[^<#]* is just a way to shrink a longer <(?!\/figcaption>)[^<#]*|#\B[^<#]* (just to group 2 different alternatives <(?!\/figcaption>) and #\B before a common "suffix" [^<#]*.
Wrapping in a tag: just use preg_replace with the <span class="highlight">$0</span> replacement pattern:
Code:
$re = '~(?:<figcaption|(?!^)\G)[^<#]*(?:(?:<(?!\/figcaption>)|#\B)[^<#]*)*\K#\w+~';
$str = "<figcaption>blah # blah #hashtag1, #hashtag2</figcaption> #ee <figcaption>#ddddd";
$subst = "<span class=\"highlight\">$0</span>";
$result = preg_replace($re, $subst, $str);
echo $result;
See the PHP IDEONE demo

REGEX - match words that contain letters repeating next to each other

im looking for a regex that matches words that repeat a letter(s) more than once and that are next to each other.
Here's an example:
This is an exxxmaple oooonnnnllllyyyyy!
By far I havent found anything that can exactly match:
exxxmaple and oooonnnnllllyyyyy
I need to find it and place them in an array, like this:
preg_match_all('/\b(???)\b/', $str, $arr) );
Can somebody explain what regexp i have to use?
You can use a very simple regex like
\S*(\w)(?=\1+)\S*
See how the regex matches at http://regex101.com/r/rF3pR7/3
\S matches anything other than a space
* quantifier, zero or more occurance of \S
(\w) matches a single character, captures in \1
(?=\1+) postive look ahead. Asserts that the captrued character is followed by itsef \1
+ quantifiers, one or more occurence of the repeated character
\S* matches anything other than space
EDIT
If the repeating must be more than once, a slight modification of the regex would do the trick
\S*(\w)(?=\1{2,})\S*
for example http://regex101.com/r/rF3pR7/5
Use this if you want discard words like apple etc .
\b\w*(\w)(?=\1\1+)\w*\b
or
\b(?=[^\s]*(\w)\1\1+)\w+\b
Try this.See demo.
http://regex101.com/r/kP8uF5/20
http://regex101.com/r/kP8uF5/21
You can use this pattern:
\b\w*?(\w)\1{2}\w*
The \w class and the word-boundary \b limit the search to words. Note that the word boundary can be removed, however, it reduces the number of steps to obtain a match (as the lazy quantifier). Note too, that if you are looking for words (in the common meaning), you need to remove the word boundary and to use [a-zA-Z] instead of \w.
(\w)\1{2} checks if a repeated character is present. A word character is captured in group 1 and must be followed with the content of the capture group (the backreference \1).

Categories