How to capture multiple occurences of subpattern into one capture?

How to capture multiple occurences of subpattern into one capture? - php

I want a regex expression which will capture multiple occurrences into one group. As an example, imagine the following phrases:
cat | likes her | mat
dog | goes to his | basket
I want to be able to capture each part of the phrase into a fixed position
array(
0 => cat likes her mat
1 => cat
2 => likes her
3 => mat
)
Obviously using:
$regex = '/(cat|dog)( likes| goes| to| his| her)* (mat|basket)/';
preg_match($regex, "The cat likes her mat", $m);
gives:
array(
0 => cat likes her mat
1 => cat
2 => likes
3 => her
4 => mat
)
But I always want mat/basket in $m[3], regardless of how many words are matched in the middle.
I have tried this:
$regex = '/(cat|dog)(?:( likes| goes| to| his| her)*) (mat|basket)/';
to try and prevent capturing of the multiple subpatterns, but this causes only the first word to be captured i.e.
array(
0 => cat likes her mat
1 => cat
2 => likes
3 => mat
)
Does anyone know how I can capture the whole of the middle part of the phrase (of an unknown number of wards length), but still get it into predicted output.
btw I cannot use (cat|dog).*?(mat|basket) because there are only specified words which are allowed in the middle.
The above is just an example; the actual usage has many more options for each of the subpatterns.
Thanks.

did you try this pattern:
/\b(cat|dog) ((?: ?(?:likes|goes|to|his|her)\b)*) ?(mat|basket)\b/

How about this pattern?
$regex = '/\b(cat|dog)\b((?:\b(?:\s+|likes|goes|to|his|her)\b)*)\b(mat|basket)\b/';
preg_match($regex, "The cat likes her mat", $m);
I have this result:
array (size=4)
0 => string 'cat likes her mat' (length=17)
1 => string 'cat' (length=3)
2 => string ' likes her ' (length=11)
3 => string 'mat' (length=3)
I voted for Casimir's result, however his pattern returns false positive on these strings:
cat likesher mat
cat likes her mat
cat mat

Related

Parse strictly formatted text containing multiple entries with no delimiting character

I have a string containing multiple products orders which have been joined together without a delimiter.
I need to parse the input string and convert sets of three substrings into separate rows of data.
I tried splitting the string using split() and strstr() function, but could not generate the desired result.
How can I convert this statement into different columns?
RM is Malaysian Ringgit
From this statement:
"2 x Brew Coffeee Panas: RM7.42 x Tongkat Ali Ais: RM8.6"
Into seperate row:
2 x Brew Coffeee Panas: RM7.4
2 x Tongkat Ali Ais: RM8.6
And this 2 row into this table in DB:
Table: Products
Product Name
Quantity
Total Amount (RM)
Brew Coffeee Panas
2
7.4
Tongkat Ali Ais
2
8.6
*Note: the "total amount" substrings will reliably have a numeric value with precision to one decimal place.

You could use regex if your string format is consistent. Here's an expression that could do that:
(\d) x (.+?): RM(\d+\.\d)
Basic usage
$re = '/(\d) x (.+?): RM(\d+\.\d)/';
$str = '2 x Brew Coffeee Panas: RM7.42 x Tongkat Ali Ais: RM8.6';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
var_export($matches);
Which gives
array (
0 =>
array (
0 => '2 x Brew Coffeee Panas: RM7.4',
1 => '2',
2 => 'Brew Coffeee Panas',
3 => '7.4',
),
1 =>
array (
0 => '2 x Tongkat Ali Ais: RM8.6',
1 => '2',
2 => 'Tongkat Ali Ais',
3 => '8.6',
),
)
Group 0 will always be the full match, after that the groups will be quantity, product and price.
Try it online

Capture one or more digits
Match the space, x, space
Capture one or more non-colon characters until the first occuring colon
Match the colon, space, then RM
Capture the float value that has a max decimal length of 1OP says in comment under question: it only take one decimal place for the amount
There are no "lazy quantifiers" in my pattern, so the regex can move most swiftly.
This regex pattern is as Accurate as the sample data and requirement explanation allows, as Efficient as it can be because it only contains greedy quantifiers, as Concise as it can be thanks to the negated character class, and as Readable as the pattern can be made because there are no superfluous characters.
Code: (Demo)
var_export(
preg_match_all('~(\d+) x ([^:]+): RM(\d+\.\d)~', $string, $m)
? array_slice($m, 1) // omit the fullstring matches
: [] // if there are no matches
);
Output:
array (
0 =>
array (
0 => '2',
1 => '2',
),
1 =>
array (
0 => 'Brew Coffeee Panas',
1 => 'Tongkat Ali Ais',
),
2 =>
array (
0 => '7.4',
1 => '8.6',
),
)
You can add the PREG_SET_ORDER argument to the preg_match_all() call to aid in iterating the matches as rows.
preg_match_all('~(\d+) x ([^:]+): RM(\d+\.\d)~', $string, $matches, PREG_SET_ORDER);
foreach ($matches as $match) {
echo '<tr><td>' . implode('</td><td>', array_slice($match, 1)) . '</td></tr>';
}

You can use a regex like this:
/(\d+)\sx\s([^:]+):\sRM(\d+\.?\d?)(?=\d|$)/
Explanation:
(\d+) captures one or more digits
\s matches a whitespace character
([^:]+): captures one or more non : characters that come before a : character (you can also use something like [a-zA-Z0-9\s]+): if you know exactly which characters can exist before the : character - in this case lower case and upper case letters, digits 0 through 9 and whitespace characters)
(\d+\.?\d?) captures one or more digits, followed by a . and another digit if they exist
(?=\d|$) is a positive lookahead which matches a digit after the main expression without including it in the result, or the end of the string
You can also add the PREG_SET_ORDER flag to preg_match_all() to group the results:
PREG_SET_ORDER
Orders results so that $matches[0] is an array of first set of matches, $matches[1] is an array of second set of matches, and so on.
Code example:
<?php
$txt = "2 x Brew Coffeee Panas: RM7.42 x Tongkat Ali Ais: RM8.62 x B026 Kopi Hainan Kecil: RM312 x B006 Kopi Hainan Besar: RM19.5";
$pattern = "/(\d+)\sx\s([^:]+):\sRM(\d+\.?\d?)(?=\d|$)/";
if(preg_match_all($pattern, $txt, $matches, PREG_SET_ORDER)) {
print_r($matches);
}
?>
Output:
Array
(
[0] => Array
(
[0] => 2 x Brew Coffeee Panas: RM7.4
[1] => 2
[2] => Brew Coffeee Panas
[3] => 7.4
)
[1] => Array
(
[0] => 2 x Tongkat Ali Ais: RM8.6
[1] => 2
[2] => Tongkat Ali Ais
[3] => 8.6
)
[2] => Array
(
[0] => 2 x B026 Kopi Hainan Kecil: RM31
[1] => 2
[2] => B026 Kopi Hainan Kecil
[3] => 31
)
[3] => Array
(
[0] => 2 x B006 Kopi Hainan Besar: RM19.5
[1] => 2
[2] => B006 Kopi Hainan Besar
[3] => 19.5
)
)
See it live here php live editor and here regex tester.

The first thing I would do would be to perform a simple replacement using preg_replace to insert, with the aid of a a back-reference to the captured item, based upon the known format of a single decimal point. Anything beyond that single decimal point forms part of the next item - the quantity in this case.
$str="2 x Brew Coffeee Panas: RM7.42 x Tongkat Ali Ais: RM8.625 x Koala Kebabs: RM15.23 x Fried Squirrel Fritters: RM32.4";
# qty price
# 2 7.4
# 2 8.6
# 25 15.2
# 3 32.4
/*
Our RegEx to find the decimal precision,
to split the string apart and the quantity
*/
$pttns=(object)array(
'repchar' => '#(RM\d{1,}\.\d{1})#',
'splitter' => '#(\|)#',
'combo' => '#^((\d{1,}) x)(.*): RM(\d{1,}\.\d{1})$#'
);
# create a new version of the string with our specified delimiter - the PIPE
$str = preg_replace( $pttns->repchar, '$1|', $str );
# split the string intp pieces - discard empty items
$a=array_filter( preg_split( $pttns->splitter, $str, null ) );
#iterate through matches - find the quantity,item & price
foreach($a as $str){
preg_match($pttns->combo,$str,$matches);
$qty=$matches[2];
$item=$matches[3];
$price=$matches[4];
printf('%s %d %d<br />',$item,$qty,$price);
}
Which yields:
Brew Coffeee Panas 2 7
Tongkat Ali Ais 2 8
Koala Kebabs 25 15
Fried Squirrel Fritters 3 32

Get all matches with pure regex?

I'm working in PHP and need to parse strings looking like this:
Rake (100) Pot (1000) Players (andy: 10, bob: 20, cindy: 70)
I need to get the rake, pot, and rake contribution per player with names. The number of players is variable. Order is irrelevant so long as I can match player name to rake contribution in a consistent way.
For example I'm looking to get something like this:
Array
(
[0] => Rake (100) Pot (1000) Players (andy: 10, bob: 20, cindy: 70)
[1] => 100
[2] => 1000
[3] => andy
[4] => 10
[5] => bob
[6] => 20
[7] => cindy
[8] => 70
)
I was able to come up with a regex which matches the string but it only returns the last player-rake contribution pair
^Rake \(([0-9]+)\) Pot \(([0-9]+)\) Players \((?:([a-z]*): ([0-9]*)(?:, )?)*\)$
Outputs:
Array
(
[0] => Rake (100) Pot (1000) Players (andy: 10, bob: 20, cindy: 70)
[1] => 100
[2] => 1000
[3] => cindy
[4] => 70
)
I've tried using preg_match_all and g modifiers but to no success. I know preg_match_all would be able to get me what I wanted if I ONLY wanted the player-rake contribution pairs but there is data before that I also require.
Obviously I can use explode and parse the data myself but before going down that route I need to know if/how this can be done with pure regex.

You could use the below regex,
(?:^Rake \(([0-9]+)\) Pot \(([0-9]+)\) Players \(|)(\w+):?\s*(\d+)(?=[^()]*\))
DEMO
| at the last of the first non-capturing group helps the regex engine to match the characters from the remaining string using the pattern which follows the non-capturing group.

I would use the following Regex to validate the input string:
^Rake \((?<Rake>\d+)\) Pot \((?<Pot>\d+)\) Players \(((?:\w*: \d*(?:, )?)+)\)$
And then just use the explode() function on the last capture group to split the players out:
preg_match($regex, $string, $matches);
$players = explode(', ', $matches[2]);

Regex (preg_split): how do I split based on a delimiter, excluding delimiters included in a pair of quotes?

I split this:
1 2 3 4/5/6 "7/8 9" 10
into this:
1
2
3
4
5
6
"7/8 9"
10
with preg_split()
So my question is, how do I split based on a delimiter, excluding delimiters inside a pair of quotes?
I kind of want to avoid capturing the things in quotes first and would ideally like it to be a one liner.

You can use the following.
$text = '1 2 3 4/5/6 "7/8 9" 10';
$results = preg_split('~"[^"]*"(*SKIP)(*F)|[ /]+~', $text);
print_r($results);
Explanation:
On the left side of the alternation operator we match anything in quotations making the subpattern fail, forcing the regular expression engine to not retry the substring using backtracking control with (*SKIP) and (*F). The right side of the alternation operator matches either a space character or a forward slash not in quotations.
Output
Array
(
[0] => 1
[1] => 2
[2] => 3
[3] => 4
[4] => 5
[5] => 6
[6] => "7/8 9"
[7] => 10
)

You can use:
$s = '1 2 3 4/5/6 "7/8 9" 10';
$arr = preg_split('~("[^"]*")|[ /]+~', $s, -1, PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY);
print_r( $arr );
OUTPUT:
Array
(
[0] => 1
[1] => 2
[2] => 3
[3] => 4
[4] => 5
[5] => 6
[6] => "7/8 9"
[7] => 10
)

An other way with an optional group:
$arr = preg_split('~(?:"[^"]*")?\K[/\s]+~', $s);
The pattern "[^"]*"[/\s]+ matches a quoted part followed by one or more spaces and slashes. But since you don't want to remove quoted parts, you put a \K after it. The \K removes all that have been matched on the left from the match result. With this trick, when a quoted part is found the regex engine returns only spaces or slashes after and split on them.
Since there are not always a quoted part before a space or a slash, you only need to make it optional with a non-capturing group (?:...) and a question mark ?

Look inside pattern if parent pattern matches and share chars between patterns

I have a string like this:
Tickets order: № 123123123. CED-MSW-RPG-MOW-CEK PODYLOVA/ALEMR 555
423578932 19OCT11 Tickets order: № 123123123. 346257.
CSK-MOW-PRG-MOW-CWQ PODYLOVA/ALEMR 555 45837043 19OCT11
I need to collect all codes that are CEK, MOW, PRG and so on. I tried this pattern firstly:
$pattern = '#[-|\s]([A-Z]{3})#';
As result a get all my codes (that's ok) and the first 3 chars of users surname: "POD" from "PODYLOVA". If i say "after my code must be an hyphen or free space char by changing my pattern to this:
$pattern = '#[-|\s]([A-Z]{3})[-|\s]#';
My $matches var has this:
array (
0 =>
array (
0 => ' CED-',
1 => '-RPG-',
2 => '-CEK ',
3 => ' CSK-',
4 => '-PRG-',
5 => '-CWQ ',
),
1 =>
array (
0 => 'CED',
1 => 'RPG',
2 => 'CEK',
3 => 'CSK',
4 => 'PRG',
5 => 'CWQ',
),
)
You can see, that my pattern doesn't "share" the hyphen between desired codes.
I see two solutions, but cannot imaging the pattern, which will suit:
Make the pattern to share the hyphen between codes
Make more complicated pattern: firstly collect the text which contains codes ("CED-MSW-RPG-MOW-CEK") and then get all #([A-Z]{3}# inside this pattern.
It seems, that solution#1 is the best in my case, but how it should look?

Try this:
\b([A-Z]{3})\b
HTH

does this give you what you want?
(?<=-|\s)[A-Z]{3}(?=-|\s)
tested with grep:
kent$ echo "Tickets order: № 123123123. CED-MSW-RPG-MOW-CEK PODYLOVA/ALEMR 555 423578932 19OCT11 Tickets order: № 123123123. 346257. CSK-MOW-PRG-MOW-CWQ PODYLOVA/ALEMR 555 45837043 19OCT11"|grep -Po '(?<=-|\s)[A-Z]{3}(?=-|\s)'
CED
MSW
RPG
MOW
CEK
CSK
MOW
PRG
MOW
CWQ

regex match between 2 strings

For example I have the text
a1aabca2aa3adefa4a
I want to extract 2 and 3 with a regex between abc and def, so 1 and 4 should be not included in the result.
I tried this
if(preg_match_all('#abc(?:a(\d)a)+def#is', file_get_contents('test.txt'), $m, PREG_SET_ORDER))
print_r($m);
I get this
> Array
(
[0] => Array
(
[0] => abca1aa2adef
[1] => 3
)
)
But I want this
Array
(
[0] => Array
(
[0] => abca1aa2adef
[1] => 2
[2] => 3
)
)
Is this possible with one preg_match_all call? How can I do it?
Thanks

preg_match_all(
'/\d # match a digit
(?=.*def) # only if followed by <anything> + def
(?!.*abc) # and not followed by <anything> + abc
/x',
$subject, $result, PREG_PATTERN_ORDER);
$result = $result[0];
works on your example. It assumes that there is exactly one instance of abc and def per line in your string.
The reason why your attempt didn't work is that your capturing group (\d) that matches the digit is within another, repeated group (?:a(\d)a)+. With every repetition, the result of the capture is overwritten. This is how regular expressions work.
In other words - see what's happening during the match:
Current position Current part of regex Capturing group 1
--------------------------------------------------------------
a1a no match, advancing... undefined
abc abc undefined
a2a (?:a(\d)a) 2
a3a (?:a(\d)a) (repeated) 3 (overwrites 2)
def def 3

You ask if it is possible with a single preg_match_all.
Indeed it is.
This code outputs exactly what you want.
<?php
$subject='a1aabca2aa3adefa4a';
$pattern='/abc(?:a(\d)a+(\d)a)def/m';
preg_match_all($pattern, $subject, $all_matches,PREG_OFFSET_CAPTURE | PREG_PATTERN_ORDER);
$res[0]=$all_matches[0][0][0];
$res[1]=$all_matches[1][0][0];
$res[2]=$all_matches[2][0][0];
var_dump($res);
?>
Here is the output:
array
0 => string 'abca2aa3adef' (length=12)
1 => string '2' (length=1)
2 => string '3' (length=1)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to capture multiple occurences of subpattern into one capture? - php

did you try this pattern: /\b(cat|dog) ((?: ?(?:likes|goes|to|his|her)\b)*) ?(mat|basket)\b/

Related

Parse strictly formatted text containing multiple entries with no delimiting character

Get all matches with pure regex?

Regex (preg_split): how do I split based on a delimiter, excluding delimiters included in a pair of quotes?

Look inside pattern if parent pattern matches and share chars between patterns

regex match between 2 strings

Categories

Resources