Building a regex to capture INT., EXT., INT./EXT., etc - php

I'm working through a bunch of text in which I'm looking for the following strings:
INT.
EXT.
INT./EXT.
EXT./INT.
The text under analysis is, for instance,
17 INT. BLOOM HOUSE - NIGHT 17
27 INT./EXT. BLOOM HOUSE - (PRESENT) DAY 27
Calls in php to, for instance,
preg_match("/^\w.*(INT\.\/EXT\.|EXT\.\/INT\.|EXT\.|INT\.)(.*)$/", $a_line, $matches);
and variants of that don't quite handle the greediness right (or so I think, anyway), and something gets left out, usually INT./EXT. or EXT./INT. items. Any advice? Thanks!

True, you need to use lazy dot matching with \w.*?, but you can also optimize the pattern to shorten the alternation group like this:
/^\w.*?(INT\.(?:\/EXT\.)?|EXT\.(?:\/INT\.)?)(.*)$/
See the regex demo
Also, if you are processing the text as a whole, you will need a /m multiline modifer.
Details:
^ - start of a string
\w - a word char
.*? - any 0+ chars other than line break chars as few as possible up to the first
(INT\.(?:\/EXT\.)?|EXT\.(?:\/INT\.)?) - Group 1 capturing either:
INT\.(?:\/EXT\.)? - INT. followed with optional /EXT. substring
| - or
EXT\.(?:\/INT\.)? - EXT. followed with optional /INT. substring
(.*) - Group 2: any 0+ chars other than line break chars up to the...
$ - end of string.

Related

Regex optional groups and digit length

Maybe some regex-Master can solve my problem.
I have a big list with many addresses with no seperators( , ; ).
The address string contains following Information:
The first group is the street name
The second group is the street number
The third group is the zipcode (optional)
The last group is the town name (optional)
As you can see on the image above the last two test strings are not matching.
I need the last two regex groups to be optional and the third group should be either 4 or 5 digits.
I tried (\d{4,5}) for allowing 4 and 5 digits. But this only works halfways as you can see here: https://regex101.com/r/ZurqHh/1
(This sometimes mixes the street number and zipcode together)
I also tried (?:\d{5})? to make the third and fourth group optional. But this destroys my whole group layout...
https://regex101.com/r/EgxeMy/1
This is my current regex:
/^([a-zäöüÄÖÜß\s\d.,-]+?)\s*([\d\s]+(?:\s?[-|+\/]\s?\d+)?\s*[a-z]?)?\s*(\d{5})\s*(.+)?$/im
Try it out yourself:
https://regex101.com/r/zC8NCP/1
My brain is only farting at this moment and i can't think straight anymore.
Please help me fix this problem so i can die in peace.
You can use
^(.*?)(?:\s+(\d+(?:\s*[-|+\/]\s*\d+)*\s*[a-z]?\b))?(?:\s+(\d{4,5})(?:\s+(.*))?)?$
See the regex demo (note all \s are replaced with \h to only match horizontal whitespaces).
Details:
^ - start of string
(.*?) - Group 1: any zero or more chars other than line break chars
(?:\s+(\d+(?:\s*[-|+\/]\s*\d+)*\s*[a-z]?\b))? - an optional non-capturing group matching
\s+ - one or more whitespaces
(\d+(?:\s*[-|+\/]\s*\d+)*\s*[a-z]?\b) - Group 2:
\d+ - one or more digits
(?:\s*[-|+\/]\s*\d+)* - zero or more sequences of zero or more whitespaces, -, +, | or /, zero or more whitespaces, one or more digits
\s* - zero or more whitespaces
[a-z]?\b - an optional lowercase ASCII letter and a word boundary
(?:\s+(\d{4,5})\b(?:\s+(.*))?)? - an optional non-capturing group matching
\s+ - one or more whitespaces
(\d{4,5}) - Group 3: four or five digits
(?:\s+(.*))? - an optional sequence of one or more whitespaces and then any zero or more chars other than line break chars as many as possible
$ - end of string.
Please note that the (?:\s+(.*))? optional group must be inside the (?:\s+(\d{4,5})...)? group to work.
It is difficult to parse addresses because we are halfway between formatted text and natural language. Here is a pattern that tries as much as possible to reduce the number of optional parameters to succeed with the examples offered without asking too much to the regex engine. To do this, I mainly rely on character classes, atomic groups, and a relatively accurate description of the street names. Obviously, all the examples of the question cannot be representative of reality and characters could be added or removed from the classes to deal with new cases. Nevertheless, the structure of this pattern is a good starting point.
~
^
(?<strasse> [\pL\d-]+ \.? (?> \h+ [\pL\d-]+ \.? )*? ) \h*
(?<nummer> \b (?> \d+ | [-+/\h]+ | [a-z] \b )*? )
(?: \h+ (?<plz> \d{4,5} )
\h+ (?<stadt> .+ ) )?
$
~mxui
demo
Note that in the above link you can also see a previous version of this pattern with a more accurate description of the street number (a bit more efficient but longer).

How get for second match group with number?

Text for example
data=1 type=old
data=2 type=test (2)
type=test data=3 (3)
I need get data-id from 2 and 3 lines
My code:
(data=([\d]+)|type=test)\s+(?!\1)((?1))
but don't get data=3
You need the g from global and m from multiline in your regex:
/(data=([\d]+)|type=test)\s+(?!\1)((?1))/gm
In the most simple form you may use
^(?=.*type=test).*data=(\d+)
See the regex demo
You may add word/whitespace boundaries later if necessary, e.g.
^(?=.*\btype=test\b).*\bdata=(\d+)\b
^(?=.*(?<!\S)type=test(?!\S)).*(?<!\S)data=(\d+)(?!\S)
The point is
^ - start of string
(?=.*type=test) - there must be type=test after any 0+ chars as many as possible to the right of the current position
.* - any 0+ chars other than line break chars as many as possible
data= - a string
(\d+) - Group 1: 1+ digits

Need help building a regex to accept two forms of strings

I am looking to build a regular expression to parse a string, which can be of one of the following two forms: -
Part 1 (Part 2 - Part 3)
or
Part 1 (Part 2)
The following regular expression matches first string and captures all three parts
(.*)\((.*)(?:-)(.*)\)
But I am unable to improvise it so that it could match both strings. I want one regex to match both strings. Not sure if it is even possible.
You may use
'~(.*)\((.*?)(?:-(.*))?\)~'
See the regex demo
Details
(.*) - Group 1: any 0+ chars other than line break chars, as many as possible
\( - a ( char
(.*?) - Group 2: any 0+ chars other than line break chars, as few as possible
(?:-(.*))? - an optional group matching a - and then capturing into Group 3 any 0+ chars other than line break chars, as many as possible
\) - a ) char.
If there can be no other parentheses than those shown in the string, you may optimize the pattern to ^([^()]*)\(([^()-]*)(?:-([^()]*))?\)$.

PHP Pattern Validation

I'm having a bit of trouble getting my pattern to validate the string entry correctly. The PHP portion of this assignment is working correctly, so I won't include that here as to make this easier to read. Can someone tell me why this pattern isn't matching what I'm trying to do?
This pattern has these validation requirements:
Should first have 3-6 lowercase letters
This is immediately followed by either a hyphen or a space
Followed by 1-3 digits
$codecheck = '/^([[:lower:]]{3,6}-)|([[:lower:]]{3,6} ?)\d{1,3}$/';
Currently this catches most of the requirements, but it only seems to validate the minimum character requirements - and doesn't return false when more than 6 or 3 characters (respectively) are entered.
Thanks in advance for any assistance!
The problem here lies in how you group the alternatives. Right now, the regex matches a string that
^([[:lower:]]{3,6}-) - starts with 3-6 lowercase letters followed with a hyphen
| - or
([[:lower:]]{3,6} ?)\d{1,3}$ - ends with 3-6 lowercase letters followed with an optional space and followed with 1-3 digits.
In fact, you can get rid of the alternation altogether:
$codecheck = '/^\p{Ll}{3,6}[- ]\d{1,3}$/';
See the regex demo
Explanation:
^ - start of string
\p{Ll}{3,6} - 3-6 lowercase letters
[- ] - a positive character class matching one character, either a hyphen or a space
\d{1,3} - 1-3 digits
$ - end of string
You need to delimit the scope of the | operator in the middle of your regex.
As it is now:
the right-side argument of that OR runs up until the very end of your regex, even including the $. So the digits, nor the end-of-string condition do not apply for the left side of the |.
the left-side argument of the OR starts with ^, and only applies to the left side.
That is why you get a match when you supply 7 lowercase characters. The first character is ignored, and the rest matches with the right-side of the regex pattern.

Add minimum characters to 'bad word' regex?

I made a regex that captures 'bad words' and substitutes with *** so I can return to user in a form if bad words found, a simplified version can be found here:
https://regex101.com/r/alEb61/3
(?i)\b(Bitch)\b
I'd like to also require min 25 characters in the same regex instead of having to run two separate passes on it (e.g. 1) Bad Words 2) Enough Chars?) is that possible? I basically need to add to above some "less than 25 characters" pipe.
Regex minimum is {min,max} so {1,15} Min of 1 character, max of 15.
I'd do a list of "bad words" then say at least 1 must exist
As far as regex limit goes /^[word]{1,15}$/ - Must be 1 -> 15 "word" found
Check this post out Profanity Filter using a Regular Expression (list of 100 words)
If you plan to replace any bad word on your list and the whole string shorter than 25 chars, use
$s = preg_replace('~^.{0,24}$|\b(?:badWord1|badWordN)\b~i', 'CENSURED', $s);
See the regex demo.
Details
^.{0,24}$ - first alternative
| - or
\b(?:badWord1|badWordN)\b- the second alternative:
\b - leading word boundary
(?: - start of an alternation non-capturing group
badWord1 - bad word #1
| - or
badWordN - bad word N
) - end of the group
\b - a trailing word boundary.
If you plan to match any string longer than 24 chars and not having bad words in it, use
'/^(?!.*\bbadword\b).{25,}$/s'
It will match a string that has at least 25 chars and does not contain badword as a whole word.
See a regex demo.
Details
^ - start of string
(?!.*\bbadword\b) - a negative lookahead that fails the match if after any 0+ chars there is a whole word badword
.{25,} - any 25 or more chars'
$ - end of string.
In the end I created my own version as what I wanted to do was only capture matches IF there was a "bad word" or if there were less than X.
^(?i)(?P<Words>\bBadWord1|BadWordN\b)|(?P<Characters>^.{0,25}$)$
which can be tested here
This served my purpose as
if there are no bad words and > 25 chars it returns no matches and the substitution is not even needed (but can be used)
If there are bad words it indicates that and also substitutes them with * so I can replace the user input text with an alert to replace 'Bad Words' and I know this is the error since the Capture Group is named Words
If there are no bad words but not enough characters it will return the Capture Group as Characters so I can return that alert instead.

Categories