Variable sized lookahead consume - php

Im a trying to use a regular expression to parse a varying string using php, that string can be, for example;
"twoX // threeY"
or
"twoX /// threeY"
So there is a left-keyword, a divider consisting of 2 or 3 slashes and a right-keyword. These are also the parts I would like to consume separately.
"/((?<left>.+)?)(?=(?<divider>[\/]{2,3}))([\/]{2,3})((?<right>.+)?)/";
When I use this regular expression on the first string, everything gets parsed correctly, so;
left: twoX
divider: //
right: threeY
but when I run this expression on the second string, the left and the divider don't get parsed properly. The result I then get is;
left: twoX /
divider: //
right: threeY
I do use the {2,3} in the regular expression to either select 2 or 3 slashes for the divider. But this somehow doesn't seem to work with the match-all character .
Is there a way to get the regex to parse either 2 or 3 slashes without duplicating the entire sequence?

The (.+)? is a greedy dot matching pattern and matches as many chars as possible with 1 being the minimum. So, since the next pattern requires only 2 chars, only 2 chars will be captured into the next group, the first / will belong to Group 1.
Use a lazy pattern in the first group:
'~(?<left>.*?)(?<divider>/{2,3})(?<right>.*)~'
^^^
See the regex demo. Add ^ and $ anchors around the pattern to match the whole string if necessary.
Note you do not need to repeat the same pattern in the lookahead and the consuming pattern part, it only makes the pattern cumbersome, (?=(?<divider>[\/]{2,3}))([\/]{2,3}) = (?<divider>[\/]{2,3}).
Details
(?<left>.*?) - Group "left" that matches any 0+ chars other than line break chars as few as possible
(?<divider>/{2,3}) - 2 or 3 slashes (no need to escape since ~ is used as a regex delimiter)
(?<right>.*) - Group "right" matching any 0+ chars other than line break chars as many as possible (up to the end of line).
And a more natrual-looking splitting approach, see a PHP demo:
$s = "twoX // threeY";
print_r(preg_split('~\s*(/{2,3})\s*~', $s, -1, PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY));
// => Array ( [0] => twoX [1] => // [2] => threeY )
You lose the names, but you may add them at a later step.

The + quantifier is greedy by default, meaning that it will try and match as many characters as possible. So you want to make the first + lazy so it will not try and match the first / by adding the ? quantifier you can make the + lazy: +?.
This will result in the following regex:
((?<left>.+?)?)(?=(?<divider>[\/]{2,3}))([\/]{2,3})((?<right>.+)?)

Related

Sanitize phone number: regular expression match all except first occurence is on first position

regarding to this post "https://stackoverflow.com/questions/35413960/regular-expression-match-all-except-first-occurence" I'm wondering how to find the first occurence on a string only if it start's with a specfic character in PHP.
I would like to sanitize phonenumbers. Example bad phone number:
+49+12423#23492#aosd#+dasd
Regex to remove all "+" except first occurence.
\G(?:\A[^\+]*\+)?+[^\+]*\K\+
Problem: it should remove every "+" only if it starts with "+" not if the first occurence-position is greater than 1.
The regex to remove everything except numbers is easy:
[^0-9]*
But I don't know how to combine those two within one regex. I would just use preg_replace() twice.
Of course I would be able to use a workaround like if ($str[0] === '+') {...} but I prefer to learn some new stuff (regex :)
Thanks for helping.
You can use
(?:\G(?!\A)|^\+)[^+]*\K\+
See the regex demo. Details:
(?:\G(?!\A)|^\+) - either the end of the preceding successful match or a + at the start of string
[^+]* - zero or more chars other than +
\K - match reset operator discarding the text matched so far
\+ - a + char.
See the PHP demo:
$re = '/(?:\G(?!\A)|^\+)[^+]*\K\+/m';
$str = '+49+12423#23492#aosd#+dasd';
echo preg_replace($re, '', $str);
// => +4912423#23492#aosd#dasd
You seem to want to combine the two queries:
A regex to remove everything except numbers
A regex to remove all "+" except first occurence
Here is my two cents:
(?:^\+|\d)(*SKIP)(*F)|.
Replace what is matched with nothing. Here is an online demo
(?:^\+|\d) - A non-capture group to match a starting literal plus or any digit in the range from 0-9.
(*SKIP)(*F) - Consume the previous matched characters and fail them in the rest of the matching result.
| - Or:
. - Any single character other than newline.
I'd like to think that this is a slight adaptation of what some consider "The best regex trick ever" where one would first try to match what you don't want, then use an alternation to match what you do want. With the use of the backtracking control verbs (*SKIP)(*F) we reverse the logic. We first match what we do want, exclude it from the results and then match what we don't want.

Need help building a regex to accept two forms of strings

I am looking to build a regular expression to parse a string, which can be of one of the following two forms: -
Part 1 (Part 2 - Part 3)
or
Part 1 (Part 2)
The following regular expression matches first string and captures all three parts
(.*)\((.*)(?:-)(.*)\)
But I am unable to improvise it so that it could match both strings. I want one regex to match both strings. Not sure if it is even possible.
You may use
'~(.*)\((.*?)(?:-(.*))?\)~'
See the regex demo
Details
(.*) - Group 1: any 0+ chars other than line break chars, as many as possible
\( - a ( char
(.*?) - Group 2: any 0+ chars other than line break chars, as few as possible
(?:-(.*))? - an optional group matching a - and then capturing into Group 3 any 0+ chars other than line break chars, as many as possible
\) - a ) char.
If there can be no other parentheses than those shown in the string, you may optimize the pattern to ^([^()]*)\(([^()-]*)(?:-([^()]*))?\)$.

regexp - match pattern and prefix before pattern

I need to match a specific pattern
(?<!\d|\d )(?:dk)?(\d{2})\D?(\d{2})\D?(\d{2})\D?(\d{2})(?!\d)
eg.
dk30344510
dk30 34 45 10
30344510
30 34 45 10
But I also need to fetch the "prefix" string before the pattern
This is my solution, but it doesn't always work
^(.*)(?<!\d|\d )(?:dk)?(\d{2})\D?(\d{2})\D?(\d{2})\D?(\d{2})(?!\d)
It's hard to explain so check it here.
https://regex101.com/r/fM1xD3/2
It's too "greedy" and match multiple pattern in the string. The actual match is here a part of the "prefix" of the second match
The example should output two matches. One with dk30344510 and 62226420
The example should output CVR-nr. as prefix and dk30344510 as the pattern and second match should be / Tlf. as prefix and 62226420 as the pattern
Your regex doesn't output expected results since you have a start of string anchor ^ and a greedy dot .*. It means it starts at only start of a string and ends to one successful match only.
Solution
Regex:
\s*(.*?)\s*\b((?i:dk)?(?:\d{2}\D?){3}\d{2})\b
I didn't apply many changes to your main regex. What I did is reducing repeating pattern \d{2}\D? and replacing lookarounds with word boundary \b token.
Live demo
you can try this one with the optionn 'g' to get multiple resultes
^(.*?)\s(dk\d+)\s(.*?)\s(\d+)
demo

Match string that doesn't have number after letter

I've got a scenario as follows. Our systems needs to pull filters from a string passed in as a query parameter, but also throw a 404 error if the string isn't correctly formatted. So let's take the following three strings as an exmple:
pf0pt1000r
pfasdfadf
pf2000pt2100
By the application requirements, only #3 is supposed to match as a "valid" string. So my current regex to match that is /([a-z]+)(\d+)/. But this also matches #1, if not entirely, but it still matches.
My problem thus is twofold - I need 2 patterns, 1 that will match only the 3rd string in this list, and another that will match the "not-acceptable" strings 1 and 2. I believe there must be some way to "negate" a pattern (so then I'd technically only need one pattern, I'm assuming), but I'm not sure how exactly to do that.
Thanks for any help!
EDIT
For clarity's sake, let me explain. The "filter parameters" present here take the following structure - 1 or 2 letters, followed by a number of, well, numbers. That structure can repeat itself however many times. So for example, valid filter strings could be
pf100pt2000
pf100pt2000r2wp0to1
etc.
Invalid strings could be
pf10000pt2000r
pf3000pt2123wpno
... anything not following the structure above.
After clarifying the question:
^([a-zA-Z]{1,2}\d+)*$
Explanation:
[a-zA-Z] - a lower or upper case letter
{1,2} - one or two of those
\d+ - one or more digits
()* - the whole thing repeated any number of times
^$ - match the entire string from start(^) to end($)
You can use this regex for valid input:
^([a-zA-Z]+\d+)+$
RegEx Demo 1
To find invalid inputs use:
^(?!([a-zA-Z]+\d+)+$).+$
RegEx Demo 2
/^(?:(?:[a-z]+)(?:\d+))*$/
You were hella close, man. Just need to repeat that pattern over and over again till the end.
Change the * to a + to reject the empty string.
Oh, you had more specific requirements, try this:
/^(?:[a-z]{1,2}\d+)*$/
Broken down:
^ - Matches the start of the string an anchor
(?: - start a non-capturing group
[a-z] - A to Z. This you already had.
{1,2} - Repeat 1 or 2 times
\d+ - a digit or more You had this, too.
)* - Repeat that group ad nauseum
$ - Match the end of the string
If you only want digits at the end of the string, then
/\d$/
would do. \d = digit, $ = end of string.

Matching ugly extra abbreviations and numbers in titles with PHP regex

I have to create regex to match ugly abbreviations and numbers. These can be one of following "formats":
1) [any alphabet char length of 1 char][0-9]
2) [double][whitespace][2-3 length of any alphabet char]
I tried to match double:
preg_match("/^-?(?:\d+|\d*\.\d+)$/", $source, $matches);
But I coldn't get it to select following example: 1.1 AA My test title. What is wrong with my regex and how can I add those others to my regex too?
In your regex you say "start of string, followed by maybe a - followed by at least one digit or followed by 0 or more digits, followed by a dot and followed by at least one digit and followed by the end of string.
So you regex could match for example.. 4.5, -.1 etc. This is exactly what you tell it to do.
You test input string does not match since there are other characters present after the number 1.1 and even if it somehow magically matched your "double" matching regex is wrong.
For a double without scientific notation you usually use this regex :
[-+]?\b[0-9]+(\.[0-9]+)?\b
Now that we have this out of our way we need a whitespace \s and
[2-3 length of alphabet]
Now I have no idea what [2-3 length of alphabet] means but by combining the above you get a regex like this :
[-+]?\b[0-9]+(\.[0-9]+)?\b\s[2-3 length of alphabet]
You can also place anchors ^$ if you want the string to match entirely :
^[-+]?\b[0-9]+(\.[0-9]+)?\b\s[2-3 length of alphabet]$
Feel free to ask if you are stuck! :)
I see multiple issues with your regex:
You try to match the whole string (as a number) by the anchors: ^ at the beginning and $ at the end. If you don't want that, remove those.
The number group is non-catching. It will be checked for matches, but those won't be added to $matches. That's because of the ?: internal options you set in (?:...). Remove ?: to make that group catching.
You place the shorter digit-pattern before the longer one. If you swap the order, the regex engine will look for it first and on success prefer it over the shorter one.
Maybe this already solves your issue:
preg_match("/-?(\d*\.\d+|\d+)/", $source, $matches);
Demo

Categories