Groups and quantifiers {m,n}

Groups and quantifiers {m,n} - php

Is it possible to use quantifiers with groups?
For example. I want to match something like:
11%
09%
aa%
zy%
g1%
8b%
...
The pattern is: 2 letters or numbers (mixed, or not) and a % ending the string ...
<?php
echo preg_match('~^([a-z]+[0-9]+){2}%$~', 'a1%'); // 0, I expect 1.
I know, this example doesn't make too much sense. A simple [list]{m,n} would solve this one. It's as simples as possible just to get an answer.

You sure can apply quantifiers to groups. For example, I have the string:
HouseCatMouseDog
And I have the regex:
(Mouse|Cat|Dog){n}
Where n is any number. You can play around changing the value of n here.
As for your example (yes, [list]{m,n} would be simpler), it will work only if there is an alphabet or more, followed by a number, or more. As such, only g1 will match.

You don't need use 2 characters classes, only one would do your job.
echo preg_match('~^([a-z0-9]{2})%$~', 'a1%');
RegExp meaning
^ => It will match at beggining of the string/line
(
[a-z0-9] => Will match every single character that match a-z(abcdefghijklmnopqrstuvwxyz) class and 0-9(0123456789) class.
{2} => rule above must be true 2 times
) => Capture block
% => that character must be matched after a-z and 0-9 classes
$ => end of string/line

Related

How preg_match return 1

It is not possible to create a regular expression of this type xx.xx.xxx,
where x - can be any Latin or Russian character of any register or digit. But there must be 2 symbols, then the dot => 2 symbols => point => 3 characters
Made the following expression -
var_dump(preg_match('/^([а-я]*[А-Я]*[A-Z]*[a-z]*ё*Ё*[0-9]*){2}.([а-я]*[А-Я]*[A-Z]*[a-z]*ё*Ё*[0-9]*){2}.([а-я]*[А-Я]*[A-Z]*[a-z]*ё*Ё*[0-9]*){3}$/u', 'd1.df.dfd'));
The expression works correctly, but if you delete 1 character at the end, for example d1.df.df, it returns 1, although it should 0. Tell me please what is the problem?

The ([а-я]*[А-Я]*[A-Z]*[a-z]*ё*Ё*[0-9]*){2} pattern part matches 0 or more letters from а to я, then 0+ chars А to Я, etc. All that can match 0 or more times (see the * quantifier after ) that creates a repeated capturing group, so, the captures always only contain empty strings).
What you need is to "merge" all character classes inside each group into single character class, and apply the limiting quantifier to the class:
'~^([а-яА-ЯA-Za-zёЁ0-9]{2})\.([а-яА-ЯA-Za-zёЁ0-9]{2})\.([а-яА-ЯA-Za-zёЁ0-9]{3})$~u'
See the regex demo
With a case insensitive modifier, it will be a bit shorter:
'~^([а-яa-zё0-9]{2})\.([а-яa-zё0-9]{2})\.([а-яa-zё0-9]{3})$~ui'
Also, you may shorten the pattern using subroutines:
'~^(([а-яa-zё0-9]){2})\.((?2){2})\.((?2){3})$~ui'
See another regex demo. Herem (?2) repeats the pattern inside Capturing group #2, ([а-яa-zё0-9]).

Double regex matches

I'm preg_match_all looping through a string using different patterns. Sometimes these patterns look a lot like each other, but differ slightly.
Right now I'm looking for a way to stop pattern A from matching strings that only pattern B - which has a 'T' in front of the 4 digits - should match.
The problem I'm running into is that pattern A also matches pattern B:
A:
(\d{4})(A|B)?(C|D)?
... matches 1234, 1234A, 1234AD, etc.
B:
I also have another pattern:
T(\d{4})\/(\d{4})
... which matches strings like: T7878/6767
The result
When running a preg_match_all on "T7878/6767 1234AD", A will give the following matches:
7878, 6767, 1234AD
Does anyone have a suggestion how to prevent A from matching B in a string like "Some text T7878/6767 1234AD and some more text"?
Your help is greatly appreciated!

Scenario with boundaries
If you only want to match those specific strings within some boundaries, use those boundary patterns on each side of the pattern.
If you expect a whitespace boundary before each match, then add the (?<!\S) negative lookbehind at the start of the pattern. If you expect a whitespace boundary at the end of the match, add the (?!\S) negative lookahead. If there can be any chars (as is in your original question), then SKIP-FAIL is the only way (see below).
So, in this first case, you may use
(?<!\S)(\d{4})([AB]?)([CD]?)(?!\S)
and
(?<!\S)T(\d{4})\/(\d{4})(?!\S)
See Pattern 1 demo and Pattern 2 demo.
Scenario with no specific boundaries
You need to make sure the second pattern is skipped when you parse the string with the first one. Use SKIP-FAIL technique for this:
'~T\d{4}/\d{4}(*SKIP)(*F)|(\d{4})(A|B)?(C|D)?~'
See the regex demo.
If you do not need the capturing groups, you may simplify it to
'~T\d{4}/\d{4}(*SKIP)(*F)|\d{4}[AB]?[CD]?~'
See another demo
Details
T\d{4}/\d{4} - T followed with 4 digits, / and another 4 digits
(*SKIP)(*F) - the matched text is discarded and the next match is searched from the matched text end
| - or
\d{4}[AB]?[CD]? - 4 digits, then optionally A or B and then optionally C or D.

From what you're asking, your current regexes don't really work. (A|B)?(C|D)? will never match AB. So I think you meant [ABCD]
Here's your new regex:
T(\d{4})\/(\d{4}) (\d{4}[ABCD]*)
For the string input:
T7878/6767 1234AB
We get the groups:
Match 1
Full match 0-17 `T7878/6767 1234AB`
Group 1. 1-5 `7878`
Group 2. 6-10 `6767`
Group 3. 11-17 `1234AB`
Regex101

Your syntax is pretty specific, so you regex just needs to be. Get rid of all your capture groups because they are screwing things up. You only need two groups which match your string syntax exactly.
First groups looks for word bounday followed by T then 4 digits then / then 4 more digits and a word break.
Second groups matches 4 digits and then letters A-D between 0 and 2 times. It has a negative lookbehind so will only match if there is a whitespace character before the 4 digits
(\bT\d{4}\/\d{4}\b)|(?<!\S)(\d{4}[A-D]{0,2})
Preg match all output:
Array
(
[0] => Array
(
[0] => T7878/6767
[1] => 1234AB
)
[1] => Array
(
[0] => T7878/6767
[1] =>
)
[2] => Array
(
[0] =>
[1] => 1234AB
)
)

Variable sized lookahead consume

Im a trying to use a regular expression to parse a varying string using php, that string can be, for example;
"twoX // threeY"
or
"twoX /// threeY"
So there is a left-keyword, a divider consisting of 2 or 3 slashes and a right-keyword. These are also the parts I would like to consume separately.
"/((?<left>.+)?)(?=(?<divider>[\/]{2,3}))([\/]{2,3})((?<right>.+)?)/";
When I use this regular expression on the first string, everything gets parsed correctly, so;
left: twoX
divider: //
right: threeY
but when I run this expression on the second string, the left and the divider don't get parsed properly. The result I then get is;
left: twoX /
divider: //
right: threeY
I do use the {2,3} in the regular expression to either select 2 or 3 slashes for the divider. But this somehow doesn't seem to work with the match-all character .
Is there a way to get the regex to parse either 2 or 3 slashes without duplicating the entire sequence?

The (.+)? is a greedy dot matching pattern and matches as many chars as possible with 1 being the minimum. So, since the next pattern requires only 2 chars, only 2 chars will be captured into the next group, the first / will belong to Group 1.
Use a lazy pattern in the first group:
'~(?<left>.*?)(?<divider>/{2,3})(?<right>.*)~'
^^^
See the regex demo. Add ^ and $ anchors around the pattern to match the whole string if necessary.
Note you do not need to repeat the same pattern in the lookahead and the consuming pattern part, it only makes the pattern cumbersome, (?=(?<divider>[\/]{2,3}))([\/]{2,3}) = (?<divider>[\/]{2,3}).
Details
(?<left>.*?) - Group "left" that matches any 0+ chars other than line break chars as few as possible
(?<divider>/{2,3}) - 2 or 3 slashes (no need to escape since ~ is used as a regex delimiter)
(?<right>.*) - Group "right" matching any 0+ chars other than line break chars as many as possible (up to the end of line).
And a more natrual-looking splitting approach, see a PHP demo:
$s = "twoX // threeY";
print_r(preg_split('~\s*(/{2,3})\s*~', $s, -1, PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY));
// => Array ( [0] => twoX [1] => // [2] => threeY )
You lose the names, but you may add them at a later step.

The + quantifier is greedy by default, meaning that it will try and match as many characters as possible. So you want to make the first + lazy so it will not try and match the first / by adding the ? quantifier you can make the + lazy: +?.
This will result in the following regex:
((?<left>.+?)?)(?=(?<divider>[\/]{2,3}))([\/]{2,3})((?<right>.+)?)

Regex for matching token wrapped in %

I have user-entered text with potentially mistyped "tokens" I'm trying to find using PHP.
A valid "token" is any number of word characters wrapped in percent signs - so %blah% %blah_moreblah%. Basically I'm looking for tokens where the user may have forgotten to put a leading or trailing '%'. I'm also looking for tokens in the valid format - as at this point in my code, all replaceable tokens have already been replaced.
So, the 3 situations I'm looking for are (to borrow regex syntax): %\w+, %\w+%, or \w+%.
In English, what I'm looking for is, "a string that starts with a % and/or ends with a % and contains only word characters'
The regex I have this far is: (%*\w+%*), but you'll notice it matches every single word. I'm stuck on making a match require at least a leading or a trailing %.
Edit: Initially I tried to have all 3 situations found with their own regex. However, I was finding that the regex for finding tokens in the first situation would also find tokens in the second situation, just without the trailing %. For example, /(%\w+)/, when checked against %before %both%, would match %before and %both.

To match tokens enclosed with %, or having % on either side, use
(?=\w*%)%*\w+%*
See another regex demo.
This is your pattern that I added a positive lookahead to. The (?=\w*%) restricts to only such matches where a % appears after a zero or more occurrences of word characters.
Note also that %* will match zero or more percent signs, it may match %%%word%%. If it is not what you need, and if you need to match 1 or 0 %s, just replace the * with ? quantifier.

Try this:
$input_lines = "Hello this is a %string% with %some_words in it just for demo% purposes.";
preg_match_all("/\s[\w_\-]+%\.?|%[\w_\-]+(%|\s|\.)/", $input_lines, $output_array);
That will output this:
array(
0 => %string%
1 => %some_words
2 => demo%
)
Note that this will catch the valid cases, as well as the typos you are looking for.

Can Someone explain this reg ex to me?

I recently asked a question on formatting a telephone number and I got lots of responses. Most of the responses were great but one i really wanted to figure out what its doing because it worked great. If phone is the following how do the other lines work...what are they doing so i can learn
$phone = "(407)888-9999";
$phone = preg_replace("~[^0-9]~", "", $phone);
preg_match('~([0-9]{3})([0-9]{3})([0-9]{4})~', $phone, $matches);

Let's break the code into two lines.
preg_replace("~[^0-9]~", "", $phone);
First, we're going to replace matches to a regex with an empty string (in other words, delete matches from the string). The regex is [^0-9] (the ~ on each end is a delimiter). [...] in a regex defines a character class, which tells the regex engine to match one character within the class. Dashes are generally special characters inside a character class, and are used to specify a range (ie. 0-9 means all characters between 0 and 9, inclusive).
You can think of a character class like a shorthand for a big OR condition: ie. [0-9] is a shorthand for 1 or 2 or 3 or 4 or 5 or 6 or 7 or 8 or 9. Note that classes don't have to contain ranges, either -- [aeiou] is a character class that matches a or e or i or o or u (or in other words, any vowel).
When the first character in the class is ^, the class is negated, which means that the regex engine should match any character that isn't in the class. So when you put all that together, the first line removes anything that isn't a digit (a character between 0 and 9) from $phone.
preg_match('~([0-9]{3})([0-9]{3})([0-9]{4})~', $phone, $matches);
The second line tries to match $phone against a second expression, and puts the results into an array called $matches, if a match is made. You will note there are three sets of brackets; these define capturing groups -- ie. if there is a match of a pattern as a whole, you will end up with three submatches, which in this case will contain the area code, prefix and suffix of the phone number. In general, anything contained in brackets in a regular expression is capturing (while there are exceptions, they are beyond the scope of this explanation). Groups can be useful for other things too, without wanting the overhead of capturing, so a group can be made non-capturing by prefacing it with ?: (ie. (?:...)).
Each group does a similar thing: [0-9]{3} or [0-9]{4}. As we saw above, [0-9] defines a character class containing the digits between 0 and 9 (as the classes here don't start with ^, these are not negated groups). The {3} or {4} is a repetition operator, which says "match exactly 3 (or 4) of the previous token (or group)". So [0-9]{3} will match exactly three digits in a row, and [0-9]{4} will match exactly four digits in a row. Note that the digits don't have to be all the same (ie. 111), because the character class is evaluate for each repetition (so 123 will match because 1 matches [0-9], then 2 matches [0-9], and then 3 matches [0-9]).

In the preg_replace it looks for anything that is not, ^ inside of the [], 0-9 (basically not a number) and replaces / removes it from that string given the replacement is "".
For the first section, it pulls out the first 3 numbers ([0-9]{3}) the {3} is the number of characters to match the items inside the [] are what to match and since this is inside of paranthesis () it stores it as a match in the array $matches. The second part pulls out the next 3 numbers and the last part pulls out the last 4 numbers from $phone and stores the matches that were matched in $matches.
The ~ are delimeters for the regular expressions.

You know it's a regular expression from the regex tag.
So, you are pattern matching.
The pattern you are matching is: [^0-9] followed by the phone number.
[^0-9] is NOT '^' any one digit
So, the match after that is any 3 digits, followed by any 3 digits, followed by any 4 digits.
I don't think it will match because of the () around the area code and the dash are missing.
I'd do this:
~\(([0-9]{3})\)([0-9]{3})-([0-9]{4})~'

"[^0-9]" means everything but numbers from 0 to 9. So basically, first line replace everything but numbers with "" (nothing)
[0-9]{3} means number from 0 to 9, 3 times in a row.
So it check if you have 3 numbers then 3 numbers than 4 numbers and try to match it with $matches.

Check this tuts
Using Regular Expressions with PHP
http://www.webcheatsheet.com/php/regular_expressions.php

$phone = "(407)888-9999";
$phone = preg_replace("~[^0-9]~", "", $phone);
In php you have to delimit regex pattern in some non-alphanumeric character "~" is used here.
[^0-9] is regex pattern used to remove anything out of $phone that is not in 0-9 range remember [^...] will negate the pattern it precedes.
preg_match('~([0-9]{3})([0-9]{3})([0-9]{4})~', $phone, $matches);
Again in this line of code you have "~" as delimiter and
([0-9]{3}) this part of pattern will return 3 numbers from string (note: {} is used to specify range/number of characters to match) in a different output array dimension (check your $matches variable for result) using ( ) in a pattern results in groups/submatches

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Groups and quantifiers {m,n} - php

Related

How preg_match return 1

Double regex matches

Variable sized lookahead consume

Regex for matching token wrapped in %

Can Someone explain this reg ex to me?

Categories

Resources