Need help building a regex to accept two forms of strings - php

I am looking to build a regular expression to parse a string, which can be of one of the following two forms: -
Part 1 (Part 2 - Part 3)
or
Part 1 (Part 2)
The following regular expression matches first string and captures all three parts
(.*)\((.*)(?:-)(.*)\)
But I am unable to improvise it so that it could match both strings. I want one regex to match both strings. Not sure if it is even possible.

You may use
'~(.*)\((.*?)(?:-(.*))?\)~'
See the regex demo
Details
(.*) - Group 1: any 0+ chars other than line break chars, as many as possible
\( - a ( char
(.*?) - Group 2: any 0+ chars other than line break chars, as few as possible
(?:-(.*))? - an optional group matching a - and then capturing into Group 3 any 0+ chars other than line break chars, as many as possible
\) - a ) char.
If there can be no other parentheses than those shown in the string, you may optimize the pattern to ^([^()]*)\(([^()-]*)(?:-([^()]*))?\)$.

Related

Regex optional groups and digit length

Maybe some regex-Master can solve my problem.
I have a big list with many addresses with no seperators( , ; ).
The address string contains following Information:
The first group is the street name
The second group is the street number
The third group is the zipcode (optional)
The last group is the town name (optional)
As you can see on the image above the last two test strings are not matching.
I need the last two regex groups to be optional and the third group should be either 4 or 5 digits.
I tried (\d{4,5}) for allowing 4 and 5 digits. But this only works halfways as you can see here: https://regex101.com/r/ZurqHh/1
(This sometimes mixes the street number and zipcode together)
I also tried (?:\d{5})? to make the third and fourth group optional. But this destroys my whole group layout...
https://regex101.com/r/EgxeMy/1
This is my current regex:
/^([a-zäöüÄÖÜß\s\d.,-]+?)\s*([\d\s]+(?:\s?[-|+\/]\s?\d+)?\s*[a-z]?)?\s*(\d{5})\s*(.+)?$/im
Try it out yourself:
https://regex101.com/r/zC8NCP/1
My brain is only farting at this moment and i can't think straight anymore.
Please help me fix this problem so i can die in peace.
You can use
^(.*?)(?:\s+(\d+(?:\s*[-|+\/]\s*\d+)*\s*[a-z]?\b))?(?:\s+(\d{4,5})(?:\s+(.*))?)?$
See the regex demo (note all \s are replaced with \h to only match horizontal whitespaces).
Details:
^ - start of string
(.*?) - Group 1: any zero or more chars other than line break chars
(?:\s+(\d+(?:\s*[-|+\/]\s*\d+)*\s*[a-z]?\b))? - an optional non-capturing group matching
\s+ - one or more whitespaces
(\d+(?:\s*[-|+\/]\s*\d+)*\s*[a-z]?\b) - Group 2:
\d+ - one or more digits
(?:\s*[-|+\/]\s*\d+)* - zero or more sequences of zero or more whitespaces, -, +, | or /, zero or more whitespaces, one or more digits
\s* - zero or more whitespaces
[a-z]?\b - an optional lowercase ASCII letter and a word boundary
(?:\s+(\d{4,5})\b(?:\s+(.*))?)? - an optional non-capturing group matching
\s+ - one or more whitespaces
(\d{4,5}) - Group 3: four or five digits
(?:\s+(.*))? - an optional sequence of one or more whitespaces and then any zero or more chars other than line break chars as many as possible
$ - end of string.
Please note that the (?:\s+(.*))? optional group must be inside the (?:\s+(\d{4,5})...)? group to work.
It is difficult to parse addresses because we are halfway between formatted text and natural language. Here is a pattern that tries as much as possible to reduce the number of optional parameters to succeed with the examples offered without asking too much to the regex engine. To do this, I mainly rely on character classes, atomic groups, and a relatively accurate description of the street names. Obviously, all the examples of the question cannot be representative of reality and characters could be added or removed from the classes to deal with new cases. Nevertheless, the structure of this pattern is a good starting point.
~
^
(?<strasse> [\pL\d-]+ \.? (?> \h+ [\pL\d-]+ \.? )*? ) \h*
(?<nummer> \b (?> \d+ | [-+/\h]+ | [a-z] \b )*? )
(?: \h+ (?<plz> \d{4,5} )
\h+ (?<stadt> .+ ) )?
$
~mxui
demo
Note that in the above link you can also see a previous version of this pattern with a more accurate description of the street number (a bit more efficient but longer).

How get for second match group with number?

Text for example
data=1 type=old
data=2 type=test (2)
type=test data=3 (3)
I need get data-id from 2 and 3 lines
My code:
(data=([\d]+)|type=test)\s+(?!\1)((?1))
but don't get data=3
You need the g from global and m from multiline in your regex:
/(data=([\d]+)|type=test)\s+(?!\1)((?1))/gm
In the most simple form you may use
^(?=.*type=test).*data=(\d+)
See the regex demo
You may add word/whitespace boundaries later if necessary, e.g.
^(?=.*\btype=test\b).*\bdata=(\d+)\b
^(?=.*(?<!\S)type=test(?!\S)).*(?<!\S)data=(\d+)(?!\S)
The point is
^ - start of string
(?=.*type=test) - there must be type=test after any 0+ chars as many as possible to the right of the current position
.* - any 0+ chars other than line break chars as many as possible
data= - a string
(\d+) - Group 1: 1+ digits

Match regular expression specific character quantities in any order

I need to match a series of strings that:
Contain at least 3 numbers
0 or more letters
0 or 1 - (not more)
0 or 1 \ (not more)
These characters can be in any position in the string.
The regular expression I have so far is:
([A-Z0-9]*[0-9]{3,}[\/]?[\-]?[0-9]*[A-Z]*)
This matches the following data in the following cases. The only one that does not match is the first one:
02ABU-D9435
013DFC
1123451
03323456782
ADS7124536768
03SDFA9433/0
03SDFA9433/
03SDFA9433/1
A41B03423523
O4AGFC4430
I think perhaps I am being too prescriptive about positioning. How can I update this regex to match all possibilities?
PHP PCRE
The following would not match:
01/01/2018 [multiple / or -]
AA-AA [no numbers]
Thanks
One option could be using lookaheads to assert 3 digits, not 2 backslashes and not 2 times a hyphen.
(?<!\S)(?=(?:[^\d\s]*\d){3})(?!(?:[^\s-]*-){2})(?!(?:[^\s\\]*\\){2})[A-Z0-9/\\-]+(?!\S)
About the pattern
(?<!\S) Assert what is on the left is not a non whitespace char
(?=(?:[^\d\s]*\d){3}) Assert wat is on the right is 3 times a whitespace char or digit
(?!(?:[^\s-]*-){2}) Assert what is on the right is not 2 times a whitespace char a hyphen
(?!(?:[^\s\\]*\\){2}) Assert what is on the right is not 2 times a whitespace char a backslash
[A-Z0-9/\\-]+ Match any of the listed 1+ times
(?!\S) Assert what is on the right is not a non whitespace char
Regex demo
Your patterns can be checked with positive/negative lookaheads anchored at the start of the string:
at least 3 digits -> find (not necessarily consecutive) 3 digits
no more than 1 '-' -> assert absence of (not necessarily consecutive) 2 '-' characters
no more than 1 '/' -> assert absence of (not necessarily consecutive) 2 '/' characters
0 or more letters -> no check needed.
If these conditions are met, any content is permitted.
The regex implementing this:
^(?=(([^0-9\r\n]*\d){3}))(?!(.*-){2})(?!(.*\/){2}).*$
Check out this Regex101 demo.
Remark
This solution assumes that each string tested resides on its own line, ie. not just being separated by whitespace.
In case the strings are separated by whitespace, choose the solution of user #TheFourthBird (which essentially is the same as this one but caters for the whitespace separation)
You can test the condition for both the hyphen and the slash into a same lookahead using a capture group and a backreference:
~\A(?!.*([-/]).*\1)(?:[A-Z/-]*\d){3,}[A-Z/-]*\z~
demo
detailled:
~ # using the tild as pattern delimiter avoids to escape all slashes in the pattern
\A # start of the string
(?! .* ([-/]) .* \1 ) # negative lookahead:
# check that there's no more than one hyphen and one slash
(?: [A-Z/-]* \d ){3,} # at least 3 digits
[A-Z/-]* # eventual other characters until the end of the string
\z # end of the string.
~
To better understand (if you are not familiar with): these three subpatterns start from the same position (in this case the beginning of the string):
\A
(?! .* ([-/]) .* \1 )
(?: [A-Z/-]* \d ){3,}
This is possible only because the two first are zero-width assertions that are simple tests and don't consume any character.

Building a regex to capture INT., EXT., INT./EXT., etc

I'm working through a bunch of text in which I'm looking for the following strings:
INT.
EXT.
INT./EXT.
EXT./INT.
The text under analysis is, for instance,
17 INT. BLOOM HOUSE - NIGHT 17
27 INT./EXT. BLOOM HOUSE - (PRESENT) DAY 27
Calls in php to, for instance,
preg_match("/^\w.*(INT\.\/EXT\.|EXT\.\/INT\.|EXT\.|INT\.)(.*)$/", $a_line, $matches);
and variants of that don't quite handle the greediness right (or so I think, anyway), and something gets left out, usually INT./EXT. or EXT./INT. items. Any advice? Thanks!
True, you need to use lazy dot matching with \w.*?, but you can also optimize the pattern to shorten the alternation group like this:
/^\w.*?(INT\.(?:\/EXT\.)?|EXT\.(?:\/INT\.)?)(.*)$/
See the regex demo
Also, if you are processing the text as a whole, you will need a /m multiline modifer.
Details:
^ - start of a string
\w - a word char
.*? - any 0+ chars other than line break chars as few as possible up to the first
(INT\.(?:\/EXT\.)?|EXT\.(?:\/INT\.)?) - Group 1 capturing either:
INT\.(?:\/EXT\.)? - INT. followed with optional /EXT. substring
| - or
EXT\.(?:\/INT\.)? - EXT. followed with optional /INT. substring
(.*) - Group 2: any 0+ chars other than line break chars up to the...
$ - end of string.

PHP Regex display either abc or abc xyz format

I am trying to build regex for the expression to get values for either Boost Mobile or BoostMobile whichever is present.
Any suggestions please ?
In NFA regexes, in unanchored alternation groups, the first branch matched stops the group processing, the other branches located further on the right are not checked against the string. You may read more on that at Alternation with The Vertical Bar or Pipe Symbol.
So, swapping the values and simplifying the pattern you could use
/\b(Boost \s*Mobile|Boost)\b/i
However, the most effective way here is through using an optional group:
/\bBoost(?:\s*Mobile)?\b/i
^^ ^^
See the regex demo
The i case insensitive modifier is set on the whole regex. You need not switch it on and off at the beginning/end of the pattern. Also, \W* can match an empty string, so your way of checking a word boundary may fail here when \b will work.
Pattern details:
\b - leading word boundary
Boost - a literal substring
(?:\s*Mobile)? - an optional group matching 1 or 0 sequences of
\s* - 0+ whitespaces
Mobile - a literal substring
\b - trailing word boundary

Categories