Avoid possible statement with regex - php

How to replace a part of string with avoid a year numbers (f.e. 2019 or 2019-2020) before the first slash occurance with Regex
//something is wrong here
preg_replace('/^[a-z0-9\-]+(-20[0-9]{2}(-20[0-9]{2})?)?/', '$1', $input_lines);
Needed:
abc-def/something/else/ [incl. slash if there is not character before it]
abc-def-2019/something/else/
abc-def-2019-2020/something/else/
abc-def-125-2019/something/else/

My initial closure was insufficient to handle all requirements. Yes, you have a greedy quantifier problem, but there is more to handle.
Code: (Demo) (Regex101 Demo)
$tests = [
'abc-def/something/else/',
'abc-def-2019/something/else/',
'abc-def-2019-2020/something/else/',
'abc-def-125-2019/something/else/'
];
var_export(
preg_replace('~^(?:[a-z\d]+-?)*?(?:/|(?=20\d{2}-?){1,2})~', '', $tests)
);
Output:
array (
0 => 'something/else/',
1 => '2019/something/else/',
2 => '2019-2020/something/else/',
3 => '2019/something/else/',
)
My pattern matches alpha-numeric sequences, optionally followed by a hyphen -- a subpattern than may be repeated zero or more times ("giving back", aka non-greedy, when possible).
Then the first non-capturing group must be followed by a slash (which is matched) or a your year substrings which also may have a trailing hyphen (this is not matched, but found via a lookahead).
If this doesn't suit your real projects data, you will need to provide more and more accurate samples to test against which reveal the fringe cases.

If the forward slash has to be present and it should stop after the first occurrence of 2019 or 2020, you might use:
^(?=[a-z\d-]*/)[a-zA-Z013-9-]+(?>2(?!0(?:19|20)(?!\d))|[a-zA-Z013-9-]+)*/?
In separate parts that would look like
^ Start of string
(?=[a-z\d-]*/) Assert that a / is present
[a-zA-Z013-9-]+ Match 1+ times any of the listed (Note that the 2 is not listed)
(?> Atomic group
2(?!0(?:19|20)(?!\d)) Match 2 and assert what is on the right is not 019 or 020
| Or
[a-zA-Z013-9-]+ Match 1+ times any of the listed
)* Close group and repeat 0+ times
/? Match optional /
Regex demo | Php demo
Your code might look like
preg_replace('~^(?=[a-z\d-]*/)[a-zA-Z013-9-]+(?>2(?!0(?:19|20)(?!\d))|[a-zA-Z013-9-]+)*/?~', '', $input_lines);

Related

Regex optional groups and digit length

Maybe some regex-Master can solve my problem.
I have a big list with many addresses with no seperators( , ; ).
The address string contains following Information:
The first group is the street name
The second group is the street number
The third group is the zipcode (optional)
The last group is the town name (optional)
As you can see on the image above the last two test strings are not matching.
I need the last two regex groups to be optional and the third group should be either 4 or 5 digits.
I tried (\d{4,5}) for allowing 4 and 5 digits. But this only works halfways as you can see here: https://regex101.com/r/ZurqHh/1
(This sometimes mixes the street number and zipcode together)
I also tried (?:\d{5})? to make the third and fourth group optional. But this destroys my whole group layout...
https://regex101.com/r/EgxeMy/1
This is my current regex:
/^([a-zäöüÄÖÜß\s\d.,-]+?)\s*([\d\s]+(?:\s?[-|+\/]\s?\d+)?\s*[a-z]?)?\s*(\d{5})\s*(.+)?$/im
Try it out yourself:
https://regex101.com/r/zC8NCP/1
My brain is only farting at this moment and i can't think straight anymore.
Please help me fix this problem so i can die in peace.
You can use
^(.*?)(?:\s+(\d+(?:\s*[-|+\/]\s*\d+)*\s*[a-z]?\b))?(?:\s+(\d{4,5})(?:\s+(.*))?)?$
See the regex demo (note all \s are replaced with \h to only match horizontal whitespaces).
Details:
^ - start of string
(.*?) - Group 1: any zero or more chars other than line break chars
(?:\s+(\d+(?:\s*[-|+\/]\s*\d+)*\s*[a-z]?\b))? - an optional non-capturing group matching
\s+ - one or more whitespaces
(\d+(?:\s*[-|+\/]\s*\d+)*\s*[a-z]?\b) - Group 2:
\d+ - one or more digits
(?:\s*[-|+\/]\s*\d+)* - zero or more sequences of zero or more whitespaces, -, +, | or /, zero or more whitespaces, one or more digits
\s* - zero or more whitespaces
[a-z]?\b - an optional lowercase ASCII letter and a word boundary
(?:\s+(\d{4,5})\b(?:\s+(.*))?)? - an optional non-capturing group matching
\s+ - one or more whitespaces
(\d{4,5}) - Group 3: four or five digits
(?:\s+(.*))? - an optional sequence of one or more whitespaces and then any zero or more chars other than line break chars as many as possible
$ - end of string.
Please note that the (?:\s+(.*))? optional group must be inside the (?:\s+(\d{4,5})...)? group to work.
It is difficult to parse addresses because we are halfway between formatted text and natural language. Here is a pattern that tries as much as possible to reduce the number of optional parameters to succeed with the examples offered without asking too much to the regex engine. To do this, I mainly rely on character classes, atomic groups, and a relatively accurate description of the street names. Obviously, all the examples of the question cannot be representative of reality and characters could be added or removed from the classes to deal with new cases. Nevertheless, the structure of this pattern is a good starting point.
~
^
(?<strasse> [\pL\d-]+ \.? (?> \h+ [\pL\d-]+ \.? )*? ) \h*
(?<nummer> \b (?> \d+ | [-+/\h]+ | [a-z] \b )*? )
(?: \h+ (?<plz> \d{4,5} )
\h+ (?<stadt> .+ ) )?
$
~mxui
demo
Note that in the above link you can also see a previous version of this pattern with a more accurate description of the street number (a bit more efficient but longer).

Match regular expression specific character quantities in any order

I need to match a series of strings that:
Contain at least 3 numbers
0 or more letters
0 or 1 - (not more)
0 or 1 \ (not more)
These characters can be in any position in the string.
The regular expression I have so far is:
([A-Z0-9]*[0-9]{3,}[\/]?[\-]?[0-9]*[A-Z]*)
This matches the following data in the following cases. The only one that does not match is the first one:
02ABU-D9435
013DFC
1123451
03323456782
ADS7124536768
03SDFA9433/0
03SDFA9433/
03SDFA9433/1
A41B03423523
O4AGFC4430
I think perhaps I am being too prescriptive about positioning. How can I update this regex to match all possibilities?
PHP PCRE
The following would not match:
01/01/2018 [multiple / or -]
AA-AA [no numbers]
Thanks
One option could be using lookaheads to assert 3 digits, not 2 backslashes and not 2 times a hyphen.
(?<!\S)(?=(?:[^\d\s]*\d){3})(?!(?:[^\s-]*-){2})(?!(?:[^\s\\]*\\){2})[A-Z0-9/\\-]+(?!\S)
About the pattern
(?<!\S) Assert what is on the left is not a non whitespace char
(?=(?:[^\d\s]*\d){3}) Assert wat is on the right is 3 times a whitespace char or digit
(?!(?:[^\s-]*-){2}) Assert what is on the right is not 2 times a whitespace char a hyphen
(?!(?:[^\s\\]*\\){2}) Assert what is on the right is not 2 times a whitespace char a backslash
[A-Z0-9/\\-]+ Match any of the listed 1+ times
(?!\S) Assert what is on the right is not a non whitespace char
Regex demo
Your patterns can be checked with positive/negative lookaheads anchored at the start of the string:
at least 3 digits -> find (not necessarily consecutive) 3 digits
no more than 1 '-' -> assert absence of (not necessarily consecutive) 2 '-' characters
no more than 1 '/' -> assert absence of (not necessarily consecutive) 2 '/' characters
0 or more letters -> no check needed.
If these conditions are met, any content is permitted.
The regex implementing this:
^(?=(([^0-9\r\n]*\d){3}))(?!(.*-){2})(?!(.*\/){2}).*$
Check out this Regex101 demo.
Remark
This solution assumes that each string tested resides on its own line, ie. not just being separated by whitespace.
In case the strings are separated by whitespace, choose the solution of user #TheFourthBird (which essentially is the same as this one but caters for the whitespace separation)
You can test the condition for both the hyphen and the slash into a same lookahead using a capture group and a backreference:
~\A(?!.*([-/]).*\1)(?:[A-Z/-]*\d){3,}[A-Z/-]*\z~
demo
detailled:
~ # using the tild as pattern delimiter avoids to escape all slashes in the pattern
\A # start of the string
(?! .* ([-/]) .* \1 ) # negative lookahead:
# check that there's no more than one hyphen and one slash
(?: [A-Z/-]* \d ){3,} # at least 3 digits
[A-Z/-]* # eventual other characters until the end of the string
\z # end of the string.
~
To better understand (if you are not familiar with): these three subpatterns start from the same position (in this case the beginning of the string):
\A
(?! .* ([-/]) .* \1 )
(?: [A-Z/-]* \d ){3,}
This is possible only because the two first are zero-width assertions that are simple tests and don't consume any character.

Regex Preg_match for licence key 25 alphanumeric and 4 hyphens

I'm still trying to get to grips with regex patterns and just after a little double-checking if someone wouldn't mind obliging!
I have a string which should either contain:
A 10 digit (numbers and letters) licence key, for example: 1234567890 OR
A 25 digit (numbers and letters) licence key, for example: ABCD1EFGH2IJKL3MNOP4QRST5 OR
A 29 digit licence number (25 numbers and letters, separated into 5 group by hyphens), for example: ABCD1-EFGH2-IJKL3-MNOP4-QRST51
I can match the first two fine, using ctype_alnum and strlen functions. However, for the last one I think I'll need to use regex and preg_match.
I had a go over at regex101.com and came up with the following:
preg_match('^([A-Za-z0-9]{5})+-+([A-Za-z0-9]{5})+-+([A-Za-z0-9]{5})+-([A-Za-z0-9]{5})+-+([A-Za-z0-9]{5})', $str);
Which seems to match what I'm looking for.
I want the string to only contain an exact match for a string beginning with the licence number, and contain nothing other than mixed upper/lower case letters and numbers in any order and hyphens between each group of 5 characters (so a total of 29 characters - I don't want any further matches). No white space, no other characters and nothing else before or after the 29 digit key.
Will the above work, without allowing any other combinations? Will it stop checking at 29 characters? I'm not sure if there is a simpler way to express this in regex?
Thanks for your time!
The main point is that you need to use both ^ (start of string) and $ (end of string) anchors. Also, when you use + after (...), you allow 1 or more repetitions of the whole subpattern inside the (...). So, you need to remove the +s and add the $ anchor. Also, you need regex delimiters for your regex to work in PHP preg_match. I prefer ~ so as not to escape /. Maybe it is not the case here, but this is a habit.
So, the regex can look like
'~^[A-Za-z0-9]{5}(?:-[A-Za-z0-9]{5}){4}$~'
See the regex demo
The (?:-[A-Za-z0-9]{5}){4} matches 4 occurrences of -[A-Za-z0-9]{5} subpattern. The (?:...) is a non-capturing group whose matched text does not get stored in any buffer (unlike the capturing group).
See the IDEONE demo:
$re = '~^[A-Za-z0-9]{5}(?:-[A-Za-z0-9]{5}){4}$~';
$str = "ABCD1-EFGH2-IJKL3-MNOP4-QRST5";
if (preg_match($re, $str, $matches)) {
echo "Matched!";
}
How about:
preg_match('/^([a-z0-9]{5})(?:-(?1)){4}$/i', $str);
Explanation:
/ : regex delimiter
^ : begining of string
( : begin group 1
[a-z0-9]{5} : exactly 5 alphanum.
) : end of group 1
(?: : begin NON capture group
- : a dash
(?1) : same as definition in group 1 (ie. [a-z0-9]{5})
){4} : this group must be repeated 4 times
$ : end of string
/i : regex delimiter with case insensitive modifier

How to match multiple substrings that occur after a specific substring?

I am trying to read out the server names from a nginx config file.
I need to use regex on a line like this:
server_name this.com www.this.com someother-example.com;
I am using PHP's preg_match_all() and I've tried different things so far:
/^(?:server_name[\s]*)(?:(.*)(?:\s*))*;$/m
// no output
/^(?:server_name[\s]*)((?:(?:.*)(?:\s*))*);$/m
// this.com www.this.com someother-example.com
But I can't find the right one to list the domains as separate values.
[
0 => 'this.com',
1 => 'www.this.com',
2 => 'someother-example.com'
]
as Bob's your uncle wrote:
(?:server_name|\G(?!^))\s*\K[^;|\s]+
Does the trick!
The plain English requirement is to extract the space-delimited strings that immediately follow server_name then several spaces.
The dynamic duo of \G (start from the start / continue from the end of the last match) and \K (restart the fullstring match) will be the heroes of the day.
Code: (Demo)
$string = "server_name this.com www.this.com someother-example.com;";
var_export(preg_match_all('~(?:server_name +|\G(?!^) )\K[^; ]+~', $string, $out) ? $out[0] : 'no matches');
Output:
array (
0 => 'this.com',
1 => 'www.this.com',
2 => 'someother-example.com',
)
Pattern Explanation:
(?: # start of non-capturing group (to separate piped expressions from end of the pattern)
server_name + # literally match "server_name" followed by one or more spaces
| # OR
\G(?!^) # continue searching for matches immediately after the previous match, then match a single space
) # end of the non-capturing group
\K # restart the fullstring match (aka forget any previously matched characters in "this run through")
[^; ]+ # match one or more characters that are NOT a semicolon or a space
The reason that you see \G(?!^) versus just \G (which, for the record, will work just fine on your sample input) is because \G can potentially match from two different points by its default behavior. https://www.regular-expressions.info/continue.html
If you were to use the naked \G version of my pattern AND add a single space to the front of the input string, you would not make the intended matches. \G would successfully start at the beginning of the string, then match the single space, then server_name via the negated character class [^; ].
For this reason, disabling \G's "start at the start of the string` ability makes the pattern more stable/reliable/accurate.
preg_match_all() returns an array of matches. The first element [0] is a collection of fullstring matches (what is matched regardless of capture groups). If there are any capture groups, they begin from [1] and increment with each new group.
Because you need to match server_name before targeting the substrings to extract, using capture groups would mean a bloated output array and an unusable [0] subarray of fullstring matches.
To extract the desired space-delimited substrings and omit server_name from the results, \K is used to "forget" the characters that are matched prior to finding the desired substrings. https://www.regular-expressions.info/keep.html
Without the \K to purge the unwanted leading characters, the output would be:
array (
0 => 'server_name this.com',
1 => ' www.this.com',
2 => ' someother-example.com',
)
If anyone is comparing my answer to user3776824's or HamZa's:
I am electing to be very literal with space character matching. There are 4 spaces after server_name, so I could have used an exact quantifier {4} but opted for a bit of flexibility here. \s* isn't the most ideal because when matching there will always be "one or more spaces" to match. I don't have a problem with \s, but to be clear it does match spaces, tabs, newlines, and line returns.
I am using (?!^) -- a negative lookahead -- versus (?<!^) -- a negative lookbehind because it does the same job with a less character. You will more commonly see the use of \G(?!^) from experienced regex craftsmen.
There is never a need to use "alternative" syntax (|) within a character class to separate values. user3776824's pattern will actually exclude pipes in addition to semicolons and spaces -- though I don't expect any negative impact in the outcome based on the sample data. The pipe in the pattern simply should not be written.

How does this PCRE pattern detect palindromes?

This question is an educational demonstration of the usage of lookahead, nested reference, and conditionals in a PCRE pattern to match ALL palindromes, including the ones that can't be matched by the recursive pattern given in the PCRE man page.
Examine this PCRE pattern in PHP snippet:
$palindrome = '/(?x)
^
(?:
(.) (?=
.*
(
\1
(?(2) \2 | )
)
$
)
)*
.?
\2?
$
/';
This pattern seems to detect palindromes, as seen in this test cases (see also on ideone.com):
$tests = array(
# palindromes
'',
'a',
'aa',
'aaa',
'aba',
'aaaa',
'abba',
'aaaaa',
'abcba',
'ababa',
# non-palindromes
'aab',
'abab',
'xyz',
);
foreach ($tests as $test) {
echo sprintf("%s '%s'\n", preg_match($palindrome, $test), $test);
}
So how does this pattern work?
Notes
This pattern uses a nested reference, which is a similar technique used in How does this Java regex detect palindromes?, but unlike that Java pattern, there's no lookbehind (but it does use a conditional).
Also, note that the PCRE man page presents a recursive pattern to match some palindromes:
# the recursive pattern to detect some palindromes from PCRE man page
^(?:((.)(?1)\2|)|((.)(?3)\4|.))$
The man page warns that this recursive pattern can NOT detect all palindromes (see: Why will this recursive regex only match when a character repeats 2n - 1 times? and also on ideone.com), but the nested reference/positive lookahead pattern presented in this question can.
Let's try to understand the regex by constructing it. Firstly, a palindrome must start and end with the same sequence of character in the opposite direction:
^(.)(.)(.) ... \3\2\1$
we want to rewrite this such that the ... is only followed by a finite length of patterns, so that it could be possible for us to convert it into a *. This is possible with a lookahead:
^(.)(?=.*\1$)
(.)(?=.*\2\1$)
(.)(?=.*\3\2\1$) ...
but there are still uncommon parts. What if we can "record" the previously captured groups? If it is possible we could rewrite it as:
^(.)(?=.*(?<record>\1\k<record>)$) # \1 = \1 + (empty)
(.)(?=.*(?<record>\2\k<record>)$) # \2\1 = \2 + \1
(.)(?=.*(?<record>\3\k<record>)$) # \3\2\1 = \3 + \2\1
...
which could be converted into
^(?:
(.)(?=.*(\1\2)$)
)*
Almost good, except that \2 (the recorded capture) is not empty initially. It will just fail to match anything. We need it to match empty if the recorded capture doesn't exist. This is how the conditional expression creeps in.
(?(2)\2|) # matches \2 if it exist, empty otherwise.
so our expression becomes
^(?:
(.)(?=.*(\1(?(2)\2|))$)
)*
Now it matches the first half of the palindrome. How about the 2nd half? Well, after the 1st half is matched, the recorded capture \2 will contain the 2nd half. So let's just put it in the end.
^(?:
(.)(?=.*(\1(?(2)\2|))$)
)*\2$
We want to take care of odd-length palindrome as well. There would be a free character between the 1st and 2nd half.
^(?:
(.)(?=.*(\1(?(2)\2|))$)
)*.?\2$
This works good except in one case — when there is only 1 character. This is again due to \2 matches nothing. So
^(?:
(.)(?=.*(\1(?(2)\2|))$)
)*.?\2?$
# ^ since \2 must be at the end in the look-ahead anyway.
I want to bring my very own solution to the table.
This is a regex that I've written a while ago to solve matching palindromes using PCRE/PCRE2
^((\w)(((\w)(?5)\5?)*|(?1)|\w?)\2)$
Example:
https://regex101.com/r/xvZ1H0/1

Categories