How can I capture phrases in regex [duplicate] - php

This question already has answers here:
How can I make part of regex optional?
(2 answers)
Closed 25 days ago.
I have this regex have (?=(an|the) agreement) to find "have an/the agreement", but I want to capture "have agreement" as well. How can I achieve it?

You may make the middle portion of the regular expression optional
\bhave(?: (?:an|the))? agreement\b
Demo

Your regex does neither match "have an agreement" nor "have the agreement", but just "have " if it is followed by "an" or "the".
A proper regex for all your needs would be
have (?:(?:an|the) )?agreement
where you have two non-capturing groups (started with (?:)):
the inner group (?:an|the) matches either "an" or "the",
the outer (?:(?:an|the) ) matches the inner group plus a space after it
The outer non-capturing group is completely optional itself, marked by the ? after it, so you also can have just "have agreement" - with only one space between the two words.

Related

How to Retrieve Overlapping Matches with Complex Regex and Preg_Match_All in PHP

Have read the following which have some overlap (pun intended!) with the issue I am facing:
preg_match_all how to get *all* combinations? Even overlapping ones
Overlapping matches with preg_match_all and pattern ending with repeated character
However, I don’t really know how to apply their answers to my issue which is a little more complicated.
My regex that I use with preg_match_all():
/.{240}[^\[]Order[^ ][^\(].{9}/u
With the following string:
56A.  Subject to the provisions of this Act, any decision of the Court or the Appeal Board shall be final and conclusive, and no decision or order of the Court or the Appeal Board shall be challenged, appealed against, reviewed, quashed or called into question in any court and shall not be subject to any Quashing Order, Prohibiting Order, Mandatory Order or injunction in any court on any account.[20/99; 42/2005]
I intended it to match exactly 3 times. The first match has “Quashing Order” 9 characters before the end. The second match has “Prohibiting Order” 9 characters before the end. The third match has “Mandatory Order” 9 characters before the end.
However, as expected it’s only matching the first one, as the expected matches are overlapping.
I applied what I read in the other posts, I tried this:
(?=(.{240}[^\[]Order[^ ][^\(].{9}))
I still don’t get what I need.
How do I solve this?
You can use
\w+\s+Order\b
See the regex demo.
Regex details
\w+ - one or more word chars
\s+ - 1 or more whitespaces
Order\b - a whole word Order, as \b is a word boundary.
You will need to use a positive look-behind assertion for .{240}, just like the answer you found suggests using a positive look-ahead assertion for .{9}:
/(?<=.{240})[^\[]Order[^ ][^\(](?=.{9})/u
This RE matches your string only twice because of [^ ], as #bobblebubble said. Adjust that part as necessary.

Identify the (weat)her using regexr tool

I am using this tool http://regexr.com/3fvg9
I want to mark this (weat)her in regexxr tool.
(weath)er is good. // i want to mark this word
(weather is go)od. // i want to mark this word
Please help me.
Since there is no way to check with a regex if a word is "known" or not, I suggest extracting these parts you need first and then use a kind of a spelling dictionary to check if the words are correct. It won't be 100% accurate, but still better than pure regex.
The expression you need to extract the parts of glued words with parentheses is
(?|([a-zA-Z0-9]+)\(([a-zA-Z\s]+)\)|\(([a-zA-Z\s]+)\)([a-zA-Z0-9]+))
See the regex demo at regex101 that supports PHP regex.
The regex matches 2 alternatives inside a branch reset group inside which all capturing groups in different branches are numbered starting with the same ID:
([a-zA-Z0-9]+)\(([a-zA-Z\s]+)\) - Group 1 (([a-zA-Z0-9]+)) matching 1+ alphanumeric chars, then (, and then Group 2 (([a-zA-Z\s]+)) matching 1+ letters and whitespaces and then a ) is matched
| - or
\(([a-zA-Z\s]+)\)([a-zA-Z0-9]+) - a (, then Group 1 (([a-zA-Z\s]+)) matching 1+ letters and whitespaces, ), and then Group 2 (([a-zA-Z0-9]+)) matching 1+ alphanumeric chars

How to construct a regex expression for username validation? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I'm working on constructing a PHP regex expression for a username validation that has the following constraints:
-must be 10-16 characters long, with a combo of alphabetic, numeric, and atleast one special character (*&^-_$)
-can't start with a numeric or special character
CORRECTION: the last SIX digits must be a month/date birthday (MMYYYY format). In order to further validate the username, the month/date must show the username is over 18 - if not, the username will not validate. Thank you in advance for any assistance! I've been stuck on this for a while.
Solution
You can do this with the following regex:
/(?=^.{10,16}$)(?=.+?[*&^_$-])[a-z].+?[01]\d{3}$/i
Here's a demo with some unit tests.
Explanation
/ delimiter
(?=^.{10,16}$) ensures there are 10-16 characters, start to finish
(?= starts a lookahead group
^ start of the string
.{10,16} ten to sixteen characters
$ end of the string
) ends the lookahead group
(?=.+?[*&^_$-]) ensures there is at least one special character in the set *&^_$-, and it's not first
(?= starts a lookahead group
.+? one or more characters, non-greedy
[*&^_$-] any character in the set *&^_$- (note the order; you must put - first or last, or escape it as \-)
) ends the lookahead group
[a-z] start with a letter
.+? match any characters in a non-greedy fashion, giving back as needed
[01]\d{3} match a 0 or 1 then 3 more digits
$ match the end of the string
/ closing delimiter
i make the match case-insensitive
Some Notes on Regex Construction
Note that there are multiple valid ways to do this. For pure efficiency, the solution above could be simplified somewhat to cut out a few steps for the processor.
But for readability, I like to go with something like the above. It's clear what each block, character set, or group does, which makes it readable and maintainable.
Something like /^[a-z](?=.*?[*&^_$-])[a-z0-9*&^_$-]{5,11}[01]\d{3}$ is hard to read and understand. What if you want to allow a 17 character username? You have to do a bunch of math to determine that you should change {5,11} to {5,12}. Or if you decide to allow the character #, you'd have to add it in two places (which, by the way, means that the regex already violates the DRY principle).
Bonus: Why Your Attempt Failed
You said in a comment that you tried this:
(?=^.{10,16}$)^[a-z][\d]*[_$^&*]?[a-z0-9]+
The first part, (?=^.{10,16}$), is fine. So is ^[a-z].
But [\d]* only matches zero or more digits; it wouldn't match a letter or special character. So, for example, a&a... would fail.
And [_$^&*]? only matches zero or one special characters. It would allow a username with no special characters to pass, but would fail one with 2 special characters.
[a-z0-9]+ only matches those characters, and you omit your last-four-characters-must-be-digits requirement.
You might find the explanation on regex101.com of your regex helpful. (Note: I have no affiliation with that site.)
You can use this regex :
^[a-zA-Z](?=.*[*&^_$-])[\w*&^$-]{5,11}[01]\d{3}$
Regex Breakup:
^ # Line start
[a-zA-Z] # # match an alhpabet
(?=.*[*&^_$-]) # lookahead to ensure there is at least one special char
[\w*&^$-]{5,11} # match 5 to 11 of allowed chars
[01]\d{3} # match digits 0/1 followed by 3 digits
$ # Line end
I used quantifier {5,11} because one char is matches at start and 4 are being matched in the end thus taking out 5 positions from desired {10,16} lengths.

PHP regex non-capture non-match group

I'm making a date matching regex, and it's all going pretty well, I've got this so far:
"/(?:[0-3])?[0-9]-(?:[0-1])?[0-9]-(?:20)[0-1][0-9]/"
It will (hopefully) match single or double digit days and months, and double or quadruple digit years in the 21st century. A few trials and errors have gotten me this far.
But, I've got two simple questions regarding these results:
(?: ) what is a simple explanation for this? Apparently it's a non-matching group. But then...
What is the trailing ? for? e.g. (? )?
[Edited (again) to improve formatting and fix the intro.]
This is a comment and an answer.
The answer part... I do agree with alex' earlier answer.
(?: ), in contrast to ( ), is used to avoid capturing text, generally so as to have fewer back references thrown in with those you do want or to improve speed performance.
The ? following the (?: ) -- or when following anything except * + ? or {} -- means that the preceding item may or may not be found within a legitimate match. Eg, /z34?/ will match z3 as well as z34 but it won't match z35 or z etc.
The comment part... I made what might considered to be improvements to the regex you were working on:
(?:^|\s)(0?[1-9]|[1-2][0-9]|30|31)-(0?[1-9]|10|11|12)-((?:20)?[0-9][0-9])(?:\s|$)
-- First, it avoids things like 0-0-2011
-- Second, it avoids things like 233443-4-201154564
-- Third, it includes things like 1-1-2022
-- Forth, it includes things like 1-1-11
-- Fifth, it avoids things like 34-4-11
-- Sixth, it allows you to capture the day, month, and year so you can refer to these more easily in code.. code that would, for example, do a further check (is the second captured group 2 and is either the first captured group 29 and this a leap year or else the first captured group is <29) in order to see if a feb 29 date qualified or not.
Finally, note that you'll still get dates that won't exist, eg, 31-6-11. If you want to avoid these, then try:
(?:^|\s)(?:(?:(0?[1-9]|[1-2][0-9]|30|31)-(0?[13578]|10|12))|(?:(0?[1-9]|[1-2][0-9]|30)-(0?[469]|11))|(?:(0?[1-9]|[1-2][0-9])-(0?2)))-((?:20)?[0-9][0-9])(?:\s|$)
Also, I assumed the dates would be preceded and followed by a space (or beg/end of line), but you may want ot adjust that (eg, to allow punctuations).
A commenter elsewhere referenced this resource which you might find useful:
http://rubular.com/
It is a non capturing group. You can not back reference it. Usually used to declutter backreferences and/or increase performance.
It means the previous capturing group is optional.
Subpatterns
Subpatterns are delimited by parentheses (round brackets), which can be nested. Marking part of a pattern as a subpattern does two things:
It localizes a set of alternatives. For example, the pattern
cat(aract|erpillar|) matches one of the words "cat", "cataract", or
"caterpillar". Without the parentheses, it would match "cataract",
"erpillar" or the empty string.
It sets up the subpattern as a capturing subpattern (as defined
above). When the whole pattern matches, that portion of the subject
string that matched the subpattern is passed back to the caller via
the ovector argument of pcre_exec(). Opening parentheses are counted
from left to right (starting from 1) to obtain the numbers of the
capturing subpatterns.
For example, if the string "the red king" is matched against the pattern the ((red|white) (king|queen)) the captured substrings are "red king", "red", and "king", and are numbered 1, 2, and 3.
The fact that plain parentheses fulfill two functions is not always helpful. There are often times when a grouping subpattern is required without a capturing requirement. If an opening parenthesis is followed by "?:", the subpattern does not do any capturing, and is not counted when computing the number of any subsequent capturing subpatterns. For example, if the string "the white queen" is matched against the pattern the ((?:red|white) (king|queen)) the captured substrings are "white queen" and "queen", and are numbered 1 and 2. The maximum number of captured substrings is 65535. It may not be possible to compile such large patterns, however, depending on the configuration options of libpcre.
As a convenient shorthand, if any option settings are required at the start of a non-capturing subpattern, the option letters may appear between the "?" and the ":". Thus the two patterns
(?i:saturday|sunday)
(?:(?i)saturday|sunday)
match exactly the same set of strings. Because alternative branches are tried from left to right, and options are not reset until the end of the subpattern is reached, an option setting in one branch does affect subsequent branches, so the above patterns match "SUNDAY" as well as "Saturday".
It is possible to name a subpattern using the syntax (?Ppattern). This subpattern will then be indexed in the matches array by its normal numeric position and also by name. PHP 5.2.2 introduced two alternative syntaxes (?pattern) and (?'name'pattern).
Sometimes it is necessary to have multiple matching, but alternating subgroups in a regular expression. Normally, each of these would be given their own backreference number even though only one of them would ever possibly match. To overcome this, the (?| syntax allows having duplicate numbers. Consider the following regex matched against the string Sunday:
(?:(Sat)ur|(Sun))day
Here Sun is stored in backreference 2, while backreference 1 is empty. Matching yields Sat in backreference 1 while backreference 2 does not exist. Changing the pattern to use the (?| fixes this problem:
(?|(Sat)ur|(Sun))day
Using this pattern, both Sun and Sat would be stored in backreference 1.
Reference : http://php.net/manual/en/regexp.reference.subpatterns.php

Ignore common words (the, and) in MySQL REGEXP query

I'm trying to query a database of Book titles based on the first letter of the title. However, I want to ignore common words such as "The" and "A".
So when searching for books that start with the letter "T"
"The Adventures of Huck Finn" - would NOT be matched
"Transformation of a Runner" - would be matched
I'm not very experienced with REGEX, but this is what I have so far (where $first_letter could equal 't')
... WHERE title = '^[(a )(the )]*[$first_letter]' ...
This successfully matches book titles that start with a particular letter even after the words "A" or "The", but doesn't ignore those words. So if $first_letter='t', it would match BOTH books mentioned above.
I've tried googling it, but haven't found any solutions. Any help would be greatly appreciated.
Thanks in advance.
Kevin
Read about MySQL full text search
The regular expression you've written isn't valid. []s are used to denote what is called a character class. Everything you enter between the brackets (with some characters potentially needing to be escaped, such as the literal characters [ and ]) is treated as standing-in for a single character.
edit After re-reading my answer, I realized lookaround wasn't a good way to approach this.
The functionality you're groping for is called negative lookahead, negative lookbehind, or some similar variant. I'm unsure whether MySQL's regex flavor supports it, but I don't think it would be a good fit for this problem.
Alternatively, you could do a regex that looks like this:
^((a|the|of|and) )?[letter of interest]
The breakdown:
There are two groups
The inner-most group looks for instances of words you want to ignore
The outer-most group just adds a space to the end of that
The ? asserts that there could be 0 or 1 instances of this group
You'll have to do the legwork of translating this into MySQL regex syntax yourself. My apologies.

Categories