RegEx Named Capturing Groups in PHP

RegEx Named Capturing Groups in PHP - php

I have the following regex to capture a list of numbers (it will be more complex than this eventually):
$list = '10,9,8,7,6,5,4,3,2,1';
$regex =
<<<REGEX
/(?x)
(?(DEFINE)
(?<number> (\d+) )
(?<list> (?&number)(,(?&number))* )
)
^(?&list)/
REGEX;
$matches = array();
if (preg_match($regex,$list,$matches)==1) {
print_r($matches);
}
Which outputs:
Array ( [0] => 10,9,8,7,6,5,4,3,2,1 )
How do I capture the individual numbers in the list in the $matches array? I don't seem to be able to do it, despite putting a capturing group around the digits (\d+).
EDIT
Just to make it clearer, I want to eventually use recursion, so explode is not ideal:
$match =
<<<REGEX
/(?x)
(?(DEFINE)
(?<number> (\d+) )
(?<member> (?&number)|(?&list) )
(?<list> \( ((?&number)|(?&member))(,(?&member))* \) )
)
^(?&list)/
REGEX;

The purpose of a (?(DEFINE)...) section is only to define named sub-patterns you can use later in the define section itself or in the main pattern. Since these sub-patterns are not defined in the main pattern they don't capture anything, and a reference (?&number) is only a kind of alias for the sub-pattern \d+ and doesn't capture anything too.
Example with the string: 1abcde2
If I use this pattern: /^(?<num>\d).....(?&num)$/ only 1 is captured in the group num, (?&num) doesn't capture anything, it's only an alias for \d./^(?<num>\d).....\d$/ produces exactly the same result.
An other point to clarify. With PCRE (the PHP regex engine), a capture group (named or not) can only store one value, even if you repeat it.
The main problem of your approach is that you are trying to do two things at the same time:
you want to check the format of the string.
you want to extract an unknown number of items.
Doing this is only possible in particular situations, but impossible in general.
For example, with a flat list like: $list = '10,9,8,7,6,5,4,3,2,1'; where there are no nested elements, you can use a function like preg_match_all to reuse the same pattern several times in this way:
if (preg_match_all('~\G(\d+)(,|$)~', $list, $matches) && !end($matches[2])) {
// \G ensures that results are contiguous
// you have all the items in $matches[1]
// if the last item of $matches[2] is empty, this means
// that the end of the string is reached and the string
// format is correct
echo '<°)))))))>';
}
Now if you have a nested list like $list = '10,9,(8,(7,6),5),4,(3,2),1'; and you want for example to check the format and to produce a tree structure like:
[ 10, 9, [ 8, [ 7, 6 ], 5 ], 4 , [ 3, 2 ], 1 ]
You can't do it with a single pass. You need one pattern to check the whole string format and an other pattern to extract elements (and a recursive function to use it).
<<<FORGET_THIS_IMMEDIATELY
As an aside you can do it with eval and strtr, but it's a very dirty and dangerous way:
eval('$result=[' . strtr($list, '()', '[]') . '];');
FORGET_THIS_IMMEDIATELY;

If you mean to get an array of the comma delimited numbers, then explode:
$numbers = explode(',', $matches[0]); //first parameter is your delimiter what the string will be split up by. And the second parameter is the initial string
print_r($numbers);
output:
Array(
[0] => 10,
[1] => 9,
[2] => 8,
etc

For this simple list, this would be enough (if you have to use a regular expression):
$string = '10,9,8,7,6,5,4,3,2,1';
$pattern = '/([\d]+),?/';
preg_match_all($pattern, $string, $matches);
print_r($matches[1]);

Related

Trying to create a regex in PHP that matches patterns inside a pattern

I have seen some regex examples where the string is "Test string: Group1Group2", and using preg_match_all(), matching for patterns of text that exists inside the tags.
However, what I am trying to do is a bit different, where my string is something like this:
"some t3xt../s8fo=123,sij(variable1=123,variable2=743,variable3=535)"
What I want to do is match the sections such as 'variable=123' that exist inside the parenthesis.
What I have so far is this:
if( preg_match_all("/\(([^\)]*?)\)"), $string_value, $matches )
{
print_r( $matches[1] );
}
But this just captures everything that's inside the parenthesis, and doesn't match anything else.
Edit:
The desired output would be:
"variable1=123"
"variable2=743"
"variable3=535"
The output that I am getting is:
"variable1=123,variable2=743,variable3=535"

You can extract the matches you need with a single call to preg_match_all if the matches do not contain (, ) or ,:
$s = '"some t3xt../s8fo=123,sij(variable1=123,variable2=743,variable3=535)"';
if (preg_match_all('~(?:\G(?!\A),|\()\K[^,]+(?=[^()]*\))~', $s, $matches)) {
print_r($matches[0]);
}
See the regex demo and a PHP demo.
Details:
(?:\G(?!\A),|\() - either end of the preceding successful match and a comma, or a ( char
\K - match reset operator that discards all text matched so far from the current overall match memory buffer
[^,]+ - one or more chars other than a comma (use [^,]* if you expect empty matches, too)
(?=[^()]*\)) - a positive lookahead that requires zero or more chars other than ( and ) and then a ) immediately to the right of the current location.

I would do this:
preg_match("/\(([^\)]+)\)/", $string_value, $matches);
$result = explode(",", $matches[1]);
If your end result is an array of key => value then you can transform it into a query string:
preg_match("/\(([^\)]+)\)/", $string_value, $matches);
parse_str(str_replace(',', '&', $matches[1]), $result);
Which yields:
Array
(
[variable1] => 123
[variable2] => 743
[variable3] => 535
)
Or replace with a newline \n and use parse_ini_string().

Matching whole words between commas, or a comma at the beginning, or a comma at the end with Regex

I have a string like this:
page-9000,page-template,page-type,page-category-128,image-195,listing-latest,rss-latest,even-more-info,even-more-tags
I made this regex that I expect to get the whole tags with:
(?<=\,)(rss-latest|listing-latest-no-category|category-128|page-9000)(?=\,)
I want it to match all the ocurrences.
In this case:
page-9000 and rss-latest.
This regex checks whole words between commas just fine but it ignores the first and the last because it's not between commas (obviously).
I've also tried that it checks if it's between commas OR one comma at the beginning OR one comma to the end, however it would give me false positives, as it would match:
category-128
while the string contains:
page-category-128
Any help?

Try using the following pattern:
(?<=,|^)(rss-latest|listing-latest-no-category|category-128|page-9000)(?=,|$)
The only change I have made is to add boundary markers ^ and $ to the lookarounds to also match on the start and end of the input.
Script:
$input = "page-9000,page-template,page-type,page-category-128,image-195,listing-latest,rss-latest,even-more-info,even-more-tags";
preg_match_all("/(?<=,|^)(rss-latest|listing-latest-no-category|category-128|page-9000)(?=,|$)/", $input, $matches);
print_r($matches[1]);
This prints:
Array
(
[0] => page-9000
[1] => rss-latest
)

Here is a non-regex way using explode and array_intersect:
$arr1 = explode(',', 'page-9000,page-template,page-type,page-category-128,image-195,listing-latest,rss-latest,even-more-info,even-more-tags');
$arr2 = explode('|', 'rss-latest|listing-latest-no-category|category-128|page-9000');
print_r(array_intersect($arr1, $arr2));
Output:
Array
(
[0] => page-9000
[6] => rss-latest
)

The (?<=\,) and (?=,) require the presence of , on both sides of the matching pattern. You want to match also at the start/end of string, and this is where you need to either explicitly tell to match either , or start/end of string or use double-negating logic with negated character classes inside negative lookarounds.
You may use
(?<![^,])(?:rss-latest|listing-latest-no-category|category-128|page-9000)(?![^,])
See the regex demo
Here, (?<![^,]) matches the start of string position or a , and (?![^,]) matches the end of string position or ,.
Now, you do not even need a capturing group, you may get rid of its overhead using a non-capturing group, (?:...). preg_match_all won't have to allocate memory for the submatches and the resulting array will be much cleaner.
PHP demo:
$re = '/(?<![^,])(?:rss-latest|listing-latest-no-category|category-128|page-9000)(?![^,])/m';
$str = 'page-9000,page-template,page-type,page-category-128,image-195,listing-latest,rss-latest,even-more-info,even-more-tags';
if (preg_match_all($re, $str, $matches)) {
print_r($matches[0]);
}
// => Array ( [0] => page-9000 [1] => rss-latest )

How to separate string to number in single word with PHP?

I have the word AK747, I use regex to detect if a string (at least 2 chars ex: AK) is followed by a number (at least to digits ex: 747).
EDIT : (sorry that I wasn't clear on this guys)
I need to do this above because :
In some case I need to split to match search against AK-747. When I search for string 'AK-747' with keyword 'AK747' it won't find a match unless I use levenshtein in database, so I prefer splitting AK747 to AK and 747.
My code:
$strNumMatch = preg_match('/^[a-zA-Z]{2,}[0-9]{2,}$/',
$value, $match);
if(isset($match[0]))
echo $match[0];
How do I split to array ['AK', '747'] for example with preg_split() or any other way?

$input = 'AK-747';
if (preg_match('/^([a-z]{2,})-?([0-9]{2,})$/i', $input, $result)) {
unset($result[0]);
}
print_r($result);
The output:
Array
(
[1] => AK
[2] => 747
)

You may try this:
preg_match('/[0-9]{2,}/', $value, $matches, PREG_OFFSET_CAPTURE);
$position = $matches[0][1];
$letters = substr($value, 0, $position);
$numbers = substr($value, $position);
This way you get the position of the first number and split there.
EDIT:
Starting from your original approach this could look somewhat like this:
$strNumMatch = preg_match('/^([a-zA-Z]{2,})([0-9]{2,})$/', $value, $match, PREG_OFFSET_CAPTURE);
if($strNumMatch){
$position = $matches[2][1];
$letters = substr($value, 0, $position);
$numbers = substr($value, $position);
$alternative = $letters.'-'.$numbers;
}

preg_split() is a very sensible and direct call since you desire an indexed array containing the two substrings.
Code: (Demo)
$input = 'AK-747';
var_export(preg_split('/[a-z]{2,}\K-?/i',$input));
Output:
array (
0 => 'AK',
1 => '747',
)
The \K means "restart the fullstring match". Effectively, everything to the left of \K is retained as the first element in the result array and everything to right (the optional hyphen) is omitted because it is considered the delimiter. Pattern Demo
Code: (Demo)
I process a small battery of inputs to show what can be done and explain after the snippet.
$inputs=['AK747','AK-747','AK-','AK']; // variations as I understand them
foreach($inputs as $input){
echo "$input returns: ";
var_export(preg_split('/[a-z]{2,}\K-?/i',$input,2,PREG_SPLIT_NO_EMPTY));
echo "\n";
}
Output:
AK747 returns: array (
0 => 'AK',
1 => '747',
)
AK-747 returns: array (
0 => 'AK',
1 => '747',
)
AK- returns: array (
0 => 'AK',
)
AK returns: array (
0 => 'AK',
)
preg_split() takes a pattern that receives a pattern that will match a variable substring and use it as a delimiter. If - were present in every input string then explode('-',$input) would be most appropriate. However, - is optional in this task, so the pattern must allow - to be optional (this is what the ? quantifier does in all of the patterns on this page).
Now, you couldn't just use a pattern like /-?/, that would split the string on every character. To overcome this, you need to tell the regex engine the exact expected location for the optional -. You do this by referencing [a-z]{2,} before the -? (single intended delimiter).
The pattern /[a-z]{2,}-?/i does a fair job of finding the correct location for the optional hyphen, but now the trouble is, the leading letters in the string are included as part of the delimiting substring.
Sometimes, "lookarounds" can be used in regex patterns to match but not consume substrings. A "positive lookbehind" is used to match a preceding substring, however "variable length lookbehinds" are not permitted in php (and most other regex flavors). This is what the invalid pattern would look like: /(?<=[a-z]{2,})-?/i.
The way around this technicality is to "restart the fullstring match" using the \K token (aka a lookbehind alternative) just before the optional hyphen. To correctly target only the intended delimiter, the leading letters must be "matched/consumed" then "discarded" -- that's what \K does.
As for the inclusion of the 3rd and 4th parameter of preg_split()...
I've set the 3rd parameter to 2. This is just like the limit parameter that explode() has. It instructs the function to not make more than 2 output elements. For this case, I could have used NULL or -1 to mean "unlimited", but I could NOT leave the parameter empty -- it must be assigned to allow for the declaration of the 4th parameter.
I've set the 4th parameter to PREG_SPLIT_NO_EMPTY which instructs the function to not generate empty output elements.
Ta-Da!
p.s. a preg_match_all() solution is as easy as using a pipe and two anchors:
$inputs=['AK747','AK-747','AK-','AK']; // variations as I understand them
foreach($inputs as $input){
echo "$input returns: ";
var_export(preg_match_all('/^[a-z]{2,}|\d{2,}$/i',$input,$out)?$out[0]:[]);
echo "\n";
}
// same outputs as above

You can make the - optional with ?.
/([A-Za-z]{2,}-?[0-9]{2,})/
https://regex101.com/r/tIgM4F/1

Inject code after X paragraphs but avoiding tables

i would like to inject some code after X paragraphs, and this is pretty easy with php.
public function inject($text, $paragraph = 2) {
$exploded = explode("</p>", $text);
if (isset($exploded[$paragraph])) {
$exploded[$paragraph] = '
MYCODE
' . $exploded[$paragraph];
return implode("</p>", $exploded);
}
return $text;
}
But, I don't want to inject my $text inside a <table>, so how to avoid this?
Thanks

I'm sometimes a bit crazy, sometimes I go for patterns that are lazy, but this time I'm going for something hazy.
$input = 'test <table><p>wuuut</p><table><p>lolwut</p></table></table> <p>foo bar</p> test1 <p>baz qux</p> test3'; # Some input
$insertAfter = 2; # Insert after N p tags
$code = 'CODE'; # The code we want to insert
$regex = <<<'regex'
~
# let's define something
(?(DEFINE)
(?P<table> # To match nested table tags
<table\b[^>]*>
(?:
(?!</?table\b[^>]*>).
|
(?&table)
)*
</table\s*>
)
(?P<paragraph> # To match nested p tags
<p\b[^>]*>
(?:
(?!</?p\b[^>]*>).
|
(?&paragraph)
)*
</p\s*>
)
)
(?&table)(*SKIP)(*FAIL) # Let's skip table tags
|
(?&paragraph) # And match p tags
~xsi
regex;
$output = preg_replace_callback($regex, function($m)use($insertAfter, $code){
static $counter = 0; # A counter
$counter++;
if($counter === $insertAfter){ # Should I explain?
return $m[0] . $code;
}else{
return $m[0];
}
}, $input);
var_dump($output); # Let's see what we've got
Online regex demo
Online php demo
References:
Reference - What does this regex mean?
What does the "[^][]" regex mean?
Verbs that act after backtracking and failure
Is there a way to define custom shorthands in regular expressions?

EDIT: It was late last night.
The PREG_SPLIT_DELIM_CAPTURE was neat but I am now adding a better idea (Method 1).
Also improved Method 2 to replace the strstr with a faster substr
METHOD 1: preg_replace_callback with (*SKIP)(*FAIL) (better)
Let's do a direct replace on the text that is certifiably table-free using a callback to your inject function.
Here's a regex to match table-free text:
$regex = "~(?si)(?!<table>).*?(?=<table|</table)|<table.*?</table>(*SKIP)(*FAIL)~";
In short, this either matches text that is a complete non-table or matches a complete table and fails.
Here's your replacement:
$injectedString = preg_replace_callback($regex,
function($m){return inject($text,$m[0]);},
$data);
Much shorter!
And here's a demo of $regex showing you how it matches elements that don't contain a table.
$text = "<table> to
</table>not a table # 1<table> to
</table>NOT A TABLE # 2<table> to
</table>";
$regex = "~(?si)(?!<table>).*?(?=<table|</table)|<table.*?</table>(*SKIP)(*FAIL)~";
$a = preg_match_all($regex,$text,$m);
print_r($m);
The output: Array ( [0] => Array ( [0] => not a table # 1 [1] => NOT A TABLE # 2 ) )
Of course the html is not well formed and $data starts in the middle of a table, all bets are off. If that's a problem let me know and we can work on the regex.
METHOD 2
Here is the first solution that came to mind.
In short, I would look at using preg_split with the PREG_SPLIT_DELIM_CAPTURE flag.
The basic idea is to isolate the tables using a special preg_split, and to perform your injections on the elements that are certifiably table-free.
A. Step 1: split $data using an unusual delimiter: your delimiter will be a full table sequence: from <table to </table>
This is achieved with a delimiter specified by a regex pattern such as (?s)<table.*?</table>
Note that I am not closing <table in case you have a class there.
So you have something like
$tableseparator = preg_split( "~(?s)(<table.*?</table>)~", $data, -1, PREG_SPLIT_DELIM_CAPTURE );
The benefit of this PREG_SPLIT_DELIM_CAPTURE flag is that the whole delimiter, which we capture thanks to the parentheses in the regex pattern, becomes an element in the array, so that we can isolate the tables without losing them. [See demo of this at the bottom.] This way, your string is broken into clean "table-free" and "is-a-table" pieces.
B. Step 2: Iterate over the $tableseparator elements. For each element, do a
if(substr($tableseparator[$i],0,6)=="<table")
If <table is found, leave the element alone (don't inject). If it isn't found, that element is clean, and you can do your inject() magic on it.
C. Step 3: Put the elements of $tableseparator back together (implode just like you do in your inject function).
So you have a two-level explosion and implosion, first with preg_split, second with your explode!
Sorry that I don't have time to code everything in detail, but I'm certain that you can figure it out. :)
preg_split with PREG_SPLIT_DELIM_CAPTURE demo
Here's a demo of how the preg_split works:
$text = "Hi#There##Oscar####";
$regex = "~(#+)~";
$a = preg_split($regex,$text,-1,PREG_SPLIT_DELIM_CAPTURE);
print_r($a);
The Output: Array ( [0] => Hi [1] => # [2] => There [3] => ## [4] => Oscar [5] => #### [6] => )
See how in this example the delimiters (the # sequences) are preserved? You have surgically isolated them but not lost them, so you can work on the other strings then put everything back together.

Regular expression to parse pipe-delimited data enclosed in double braces

I'm trying to match a string like this:
{{name|arg1|arg2|...|argX}}
with a regular expression
I'm using preg_match with
/{{(\w+)\|(\w+)(?:\|(.+))*}}/
but I get something like this, whenever I use more than two args
Array
(
[0] => {{name|arg1|arg2|arg3|arg4}}
[1] => name
[2] => arg1
[3] => arg2|arg3|arg4
)
The first two items cannot contain spaces, the rest can.
Perhaps I'm working too long on this, but I can't find the error - any help would be greatly appreciated.
Thanks Jan

Don't use regular expressions for these kind of simple tasks. What you really need is:
$inner = substr($string, 2, -2);
$parts = explode('|', $inner);
# And if you want to make sure the string has opening/closing braces:
$length = strlen($string);
assert($inner[0] === '{');
assert($inner[1] === '{');
assert($inner[$length - 1] === '}');
assert($inner[$length - 2] === '}');

The problem is here: \|(.+)
Regular expressions, by default, match as many characters as possible. Since . is any character, other instances of | are happily matched too, which is not what you would like.
To prevent this, you should exclude | from the expression, saying "match anything except |", resulting in \|([^\|]+).

Should work for anywhere from 1 to N arguments
<?php
$pattern = "/^\{\{([a-z]+)(?:\}\}$|(?:\|([a-z]+))(?:\|([a-z ]+))*\}\}$)/i";
$tests = array(
"{{name}}" // should pass
, "{{name|argOne}}" // should pass
, "{{name|argOne|arg Two}}" // should pass
, "{{name|argOne|arg Two|arg Three}}" // should pass
, "{{na me}}" // should fail
, "{{name|arg One}}" // should fail
, "{{name|arg One|arg Two}}" // should fail
, "{{name|argOne|arg Two|arg3}}" // should fail
);
foreach ( $tests as $test )
{
if ( preg_match( $pattern, $test, $matches ) )
{
echo $test, ': Matched!<pre>', print_r( $matches, 1 ), '</pre>';
} else {
echo $test, ': Did not match =(<br>';
}
}

Of course you would get something like this :) There is no way in regular expression to return dynamic count of matches - in your case the arguments.
Looking at what you want to do, you should keep up with the current regular expression and just explode the extra args by '|' and add them to an args array.

indeed, this is from PCRE manual:
When a capturing subpattern is
repeated, the value captured is the
substring that matched the final
iteration. For example, after
(tweedle[dume]{3}\s*)+ has matched
"tweedledum tweedledee" the value of
the captured substring is
"tweedledee". However, if there are
nested capturing subpatterns, the
corresponding captured values may have
been set in previous iterations. For
example, after /(a|(b))+/ matches
"aba" the value of the second captured
substring is "b".

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

RegEx Named Capturing Groups in PHP - php

If you mean to get an array of the comma delimited numbers, then explode: $numbers = explode(',', $matches[0]); //first parameter is your delimiter what the string will be split up by. And the second parameter is the initial string print_r($numbers); output: Array( [0] => 10, [1] => 9, [2] => 8, etc

For this simple list, this would be enough (if you have to use a regular expression): $string = '10,9,8,7,6,5,4,3,2,1'; $pattern = '/([\d]+),?/'; preg_match_all($pattern, $string, $matches); print_r($matches[1]);

Related

Trying to create a regex in PHP that matches patterns inside a pattern

Matching whole words between commas, or a comma at the beginning, or a comma at the end with Regex

How to separate string to number in single word with PHP?

Inject code after X paragraphs but avoiding tables

Regular expression to parse pipe-delimited data enclosed in double braces

Categories

Resources