How to optionally add a comma and whitespace to a capture group?

How to optionally add a comma and whitespace to a capture group? - php

I am trying to match five substrings in each block of text (there are 100 blocks total).
I am matching 99% of the blocks of text, but with a few errors regarding groups 3 and 4.
Here is a demo link: https://regex101.com/r/cW2Is3/4
Group 3 is "parts of speech", and group 4 is an English translation.
In the first block of text, det, pro should all be in group 3, and then the; him, her, it, them should be in group 4.
The same issue occurs again in the third block of text.
Group 3 should be adj, det, nm, pro and Group 4 should be a, an, one.
This is my pattern:
([0-9]+)\s+(\w+(?:, \w+)?)\s+(\N+?)\s+(\H.+).*?\r?\n•\s+([\s\S]*?)\s+[0-9]+\s\|.*\s*

Voici...
/^(\d+) +(\w+) +([acdefijlmnoprtv()]+(?:, ?[acdefijlmnoprtv()]+)*) +([\S\s]+?)\n\x{2022} +([\S\s]+?)\n\d+ \| [-\dn]+\s*/gum
Demo Link
I have done my best to optimize the pattern. I shaved nearly 10,000 steps off of your pattern and reached 100 matches as desired.
Starting anchor ^ is used to identify start of each block (Efficiency / Accuracy)
\d is used instead of [0-9] (Brevity)
\s is replaced with a literal space where applicable (Brevity)
A character class of specific letters and parentheses was used in place of \w for capture group 3. (Efficiency) *could be replaced with [\w()] for brevity with a loss of efficiency
The bullet was specified using the literal \x{2022} (Personal preference)
Character class used on trailing characters of each block [-\dn]. (Efficiency / Accuracy)

When you have to describe a long string with many parts, the first reflex is to use the free-space mode (x modifier) and named groups (even if named groups aren't very useful in a replacement context, they help to make the pattern readable and more easy to debug):
~^
(?<No> [0-9]+ ) \h+
(?<word> \pL+ ) \h+
(?<type> [\pL()]+ (?: , \h* [\pL()]+ )* ) \h+
(?<wd_tr> [^•]* [^•\s] ) \h* \R
• \h*
(?<sent_fr> [^–]* [^\s–] ) \s* – \s*
(?<sent_eng> .* (?:\R .*)*? ) \h* \R
(?<num1> [0-9]+ ) \h* \| \h*
(?<num2> .*\S )
~xum
demo
There are no magic recipe to build a pattern for a string with a blurred format. All you can do is to be the most constrictive at the beginning and to add flexibility when you encounter cases that don't match.

Related

Regex optional groups and digit length

Maybe some regex-Master can solve my problem.
I have a big list with many addresses with no seperators( , ; ).
The address string contains following Information:
The first group is the street name
The second group is the street number
The third group is the zipcode (optional)
The last group is the town name (optional)
As you can see on the image above the last two test strings are not matching.
I need the last two regex groups to be optional and the third group should be either 4 or 5 digits.
I tried (\d{4,5}) for allowing 4 and 5 digits. But this only works halfways as you can see here: https://regex101.com/r/ZurqHh/1
(This sometimes mixes the street number and zipcode together)
I also tried (?:\d{5})? to make the third and fourth group optional. But this destroys my whole group layout...
https://regex101.com/r/EgxeMy/1
This is my current regex:
/^([a-zäöüÄÖÜß\s\d.,-]+?)\s*([\d\s]+(?:\s?[-|+\/]\s?\d+)?\s*[a-z]?)?\s*(\d{5})\s*(.+)?$/im
Try it out yourself:
https://regex101.com/r/zC8NCP/1
My brain is only farting at this moment and i can't think straight anymore.
Please help me fix this problem so i can die in peace.

You can use
^(.*?)(?:\s+(\d+(?:\s*[-|+\/]\s*\d+)*\s*[a-z]?\b))?(?:\s+(\d{4,5})(?:\s+(.*))?)?$
See the regex demo (note all \s are replaced with \h to only match horizontal whitespaces).
Details:
^ - start of string
(.*?) - Group 1: any zero or more chars other than line break chars
(?:\s+(\d+(?:\s*[-|+\/]\s*\d+)*\s*[a-z]?\b))? - an optional non-capturing group matching
\s+ - one or more whitespaces
(\d+(?:\s*[-|+\/]\s*\d+)*\s*[a-z]?\b) - Group 2:
\d+ - one or more digits
(?:\s*[-|+\/]\s*\d+)* - zero or more sequences of zero or more whitespaces, -, +, | or /, zero or more whitespaces, one or more digits
\s* - zero or more whitespaces
[a-z]?\b - an optional lowercase ASCII letter and a word boundary
(?:\s+(\d{4,5})\b(?:\s+(.*))?)? - an optional non-capturing group matching
\s+ - one or more whitespaces
(\d{4,5}) - Group 3: four or five digits
(?:\s+(.*))? - an optional sequence of one or more whitespaces and then any zero or more chars other than line break chars as many as possible
$ - end of string.
Please note that the (?:\s+(.*))? optional group must be inside the (?:\s+(\d{4,5})...)? group to work.

It is difficult to parse addresses because we are halfway between formatted text and natural language. Here is a pattern that tries as much as possible to reduce the number of optional parameters to succeed with the examples offered without asking too much to the regex engine. To do this, I mainly rely on character classes, atomic groups, and a relatively accurate description of the street names. Obviously, all the examples of the question cannot be representative of reality and characters could be added or removed from the classes to deal with new cases. Nevertheless, the structure of this pattern is a good starting point.
~
^
(?<strasse> [\pL\d-]+ \.? (?> \h+ [\pL\d-]+ \.? )*? ) \h*
(?<nummer> \b (?> \d+ | [-+/\h]+ | [a-z] \b )*? )
(?: \h+ (?<plz> \d{4,5} )
\h+ (?<stadt> .+ ) )?
$
~mxui
demo
Note that in the above link you can also see a previous version of this pattern with a more accurate description of the street number (a bit more efficient but longer).

Match regular expression specific character quantities in any order

I need to match a series of strings that:
Contain at least 3 numbers
0 or more letters
0 or 1 - (not more)
0 or 1 \ (not more)
These characters can be in any position in the string.
The regular expression I have so far is:
([A-Z0-9]*[0-9]{3,}[\/]?[\-]?[0-9]*[A-Z]*)
This matches the following data in the following cases. The only one that does not match is the first one:
02ABU-D9435
013DFC
1123451
03323456782
ADS7124536768
03SDFA9433/0
03SDFA9433/
03SDFA9433/1
A41B03423523
O4AGFC4430
I think perhaps I am being too prescriptive about positioning. How can I update this regex to match all possibilities?
PHP PCRE
The following would not match:
01/01/2018 [multiple / or -]
AA-AA [no numbers]
Thanks

One option could be using lookaheads to assert 3 digits, not 2 backslashes and not 2 times a hyphen.
(?<!\S)(?=(?:[^\d\s]*\d){3})(?!(?:[^\s-]*-){2})(?!(?:[^\s\\]*\\){2})[A-Z0-9/\\-]+(?!\S)
About the pattern
(?<!\S) Assert what is on the left is not a non whitespace char
(?=(?:[^\d\s]*\d){3}) Assert wat is on the right is 3 times a whitespace char or digit
(?!(?:[^\s-]*-){2}) Assert what is on the right is not 2 times a whitespace char a hyphen
(?!(?:[^\s\\]*\\){2}) Assert what is on the right is not 2 times a whitespace char a backslash
[A-Z0-9/\\-]+ Match any of the listed 1+ times
(?!\S) Assert what is on the right is not a non whitespace char
Regex demo

Your patterns can be checked with positive/negative lookaheads anchored at the start of the string:
at least 3 digits -> find (not necessarily consecutive) 3 digits
no more than 1 '-' -> assert absence of (not necessarily consecutive) 2 '-' characters
no more than 1 '/' -> assert absence of (not necessarily consecutive) 2 '/' characters
0 or more letters -> no check needed.
If these conditions are met, any content is permitted.
The regex implementing this:
^(?=(([^0-9\r\n]*\d){3}))(?!(.*-){2})(?!(.*\/){2}).*$
Check out this Regex101 demo.
Remark
This solution assumes that each string tested resides on its own line, ie. not just being separated by whitespace.
In case the strings are separated by whitespace, choose the solution of user #TheFourthBird (which essentially is the same as this one but caters for the whitespace separation)

You can test the condition for both the hyphen and the slash into a same lookahead using a capture group and a backreference:
~\A(?!.*([-/]).*\1)(?:[A-Z/-]*\d){3,}[A-Z/-]*\z~
demo
detailled:
~ # using the tild as pattern delimiter avoids to escape all slashes in the pattern
\A # start of the string
(?! .* ([-/]) .* \1 ) # negative lookahead:
# check that there's no more than one hyphen and one slash
(?: [A-Z/-]* \d ){3,} # at least 3 digits
[A-Z/-]* # eventual other characters until the end of the string
\z # end of the string.
~
To better understand (if you are not familiar with): these three subpatterns start from the same position (in this case the beginning of the string):
\A
(?! .* ([-/]) .* \1 )
(?: [A-Z/-]* \d ){3,}
This is possible only because the two first are zero-width assertions that are simple tests and don't consume any character.

How to add whitespace & punctuation marks to capture first group with regex? How to stop certain tabs dividing into two columns within LibreOffice?

Anyone help me out. Been trying to get this regex working, and it’s nearly there. They all seem to be correct, but the first one should be:
word: el, la
gender: art
word_en: the (+m, f)
The first test string is:
1
el, la art the (+m, f)
• el diccionario tenía también frases útiles – the dictionary also had
useful phrases
2055835 | 201481381
The other issue is that I’ve been trying to simply copy info. from the ‘Substitution’ section into LibreOffice. All I want to do is create 6 columns for the data. The Problem is that the 6th column (sent_en) can sometimes divide between columns ‘G’ and ‘A’, instead of all the data for sent_en being in column ‘G’. If you copy the data below ‘Substitution’ into LibreOffice Calc, you’ll get a better idea of what I mean. I just can’t figure this out, and if someone can help me out I’d really appreciate it. Thanks.
Here’s the link https://regex101.com/r/m3yySN/2/
^
(?<frequency>[0-9]+) \W+
(?<word>\pL+\W?) \h+
(?<gender> [\pL()]+ (?:, \h* [\pL()]+)* ) \h+
(?<word_en> [^•]*[^•\s]) \h* \R
• \h*
(?<sent_esp> [^–]*[^\s–] ) \s*–\s*
(?<sent_en> .* (?:\R .*)*? ) \h* \R
(?<num1> [0-9]+) \h* \| \h*
(?<num2> .*\S)
\1\t\2\t\3\t\4\t\5\t\6\t

This one was a bit hairy, but after all, just a small adjustment was needed:
^
(?<frequency>[0-9]+) \W+
(?<word>\pL+(?:,\h\pL+|\W)*) \h+
(?<gender> [\pL()]+ (?:, \h* [\pL()]+)* ) \h+
(?<word_en> [^•]*[^•\s]) \h* \R
• \h*
(?<sent_esp> [^–]*[^\s–] ) \s*–\s*
(?<sent_en> .* (?:\R .*)*? ) \h* \R
(?<num1> [0-9]+) \h* \| \h*
(?<num2> .*\S)
Results look good to me now.

How do I write a regular expression that only matches if match three required capture groups

I'd like to match strings that are comprised of:
First Iniitial
Middle Name
Last Name + optional suffix (Jr. Sr. III, etc.)
and not match string that are comprised of a First Name + Last Name and suffix.
I have the following sample data:
H. Graham Motion
T. James Kelly
J. Palacios Moli
A. Chadwick Box
H. Graham Motion III
T. James Kelly, Jr.
H. Graham Motion II
V. Barboza Jr.
I would like to match all of the strings except the last.
Here is what I have for a regular expression:
^(\w\.)(\s\w+\s[\sI\,\sJSr.]{0,5})*(\w+[\sI\,\sJSr.]{0,5})$
but it not working. You can see the regular expression here at regex101.

I've tweaked your expression a bit and come up with ^(\w\.)\s(\w+)\s(\w+(?:,?\s(?:I{0,5}|Jr\.|Sr\.))?)$. For the sake of sanity and clarity, I moved the \s out of the capture groups, since I assume you don't define a middle name as a string of word characters with a leading and trailing space. I think I kept the spirit of your definition of a last name + suffix.
(Very verbose) Explanation:
^ start
( 1st group (1st initial)
\w\. one word char followed by a period
)
\s one whitespace char
( 2nd group (middle name)
\w+ 1 or more word chars
)
\s one whitespace char
( 3rd group (last name + optional suffix)
\w+ 1 or more word chars
(?: non-capturing group (optional suffix)
,? 0 or 1 commas
\s one whitespace char
(?:I{1,5}|Jr\.|Sr\.) one of: 1-5 I chars, "Jr." or "Sr."
)? match suffix group 0 or 1 times
)
$ end
You'll notice I made the change from I{0,5} to I{1,5} because 0 characters doesn't seem like much of a suffix to me. However I don't see a lot of people with the suffix IIII or IIIII so you may want to change it to I{0,3}|IV|V. You may also want to change the optional comma after the last name to require it before Jr./Sr. and disallow it before a Roman numeral.
Also, remember that \w also matches underscores and digits! And that \s matches most whitespace characters, and not just a regular space.

require whitespace in first group regex

In the process of writing a custom little template engine I want to match a block like
{foreach foo as bar}
{bar.name}
{endforeach}
//with regex
preg_match_all('/{(?!{)([\w\s]+)}(?!})(.*?){(?!{)(\w+)}(?!})/us', $string, $matches, PREG_SET_ORDER)
So the first group must have alnum and whitespace chars with [\w\s]+
the negative lookahead (?!{) is to not allow blocks that start with {{
//so a block like
{{foreach bla as bla}}
//would not be matched.
The problem is that this regex also matches {var} without whitespace.
And this is what I dont understand due to my first class definition
of [\w\s]+

To match at least 2 word char sequences separated with at least 1 whitespace, and allow leading and trailing whitespaces, you may use
\s*\w+(?:\s+\w+)+\s*
In details:
\s* - 0+ whitespaces
\w+ - 1 or more word chars
(?: - start of a non-capturing group that is used for grouping subpattern sequences*:
\s+ - 1 or more whitespaces
\w+ - 1 or more word chars
)+ - 1 or more occurrences of the group
\s* - trailing 0+ whitespace chars.
The entire regex will look like
{(?!{)(\s*\w+(?:\s+\w+)+\s*)}(?!})(.*?){(?!{)(\w+)}(?!})
See the updated regex demo

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to optionally add a comma and whitespace to a capture group? - php

Related

Regex optional groups and digit length

Match regular expression specific character quantities in any order

How to add whitespace & punctuation marks to capture first group with regex? How to stop certain tabs dividing into two columns within LibreOffice?

How do I write a regular expression that only matches if match three required capture groups

require whitespace in first group regex

Categories

Resources