Capture multiple repetitive group in regex - php

I'm using /{(\w+)\s+((\w+="\w+")\s*)+/ pattern to capture all attributes.
The problem is that it matches the input but can't group attribute one by one and just groups the last attribute.
[person name="Jackson" family="Smith"]
or
[car brand="Benz" type="SUV"]

The \G (continue) metacharacter is the hero to call upon here.
Code: (PHP Demo) (Regex101 Demo)
$tag = '[person name="Jackson" family="Smith"]';
var_export(preg_match_all('~(?:\G|\[\w+) (\w+)="(\w+)"~', $tag, $out) ? array_combine($out[1], $out[2]) : []);
Output:
array (
'name' => 'Jackson',
'family' => 'Smith',
)
If you need to pool the attributes&values with the tag name, only one loop is necessary for this too.
Code: (Demo)
$text = 'some text [person name="Jackson" family="Smith"] text [vehicle brand="Benz" type="SUV" doors="4" seats="7"]';
foreach (preg_match_all('~(?:\G(?!^)|\[(\w+)) (\w+)="(\w+)"~', $text, $out, PREG_SET_ORDER) ? $out : [] as $matches) {
if ($matches[1]) {
$tag = $matches[1]; // cache the tag name for reuse with subsequent attr/val pairs
}
$result[$tag][$matches[2]] = $matches[3];
}
var_export($result);
Output:
array (
'person' =>
array (
'name' => 'Jackson',
'family' => 'Smith',
),
'vehicle' =>
array (
'brand' => 'Benz',
'type' => 'SUV',
'doors' => '4',
'seats' => '7',
),
)
Due to the concerns of #Thefourthbird and #Jan, I have included a lookahead to match the closing square brace. I have also built in accommodation for the possibility of zero attributes in the tag. If given more time (sorry, don't have more), I could probably refine the following snippet to be slightly cleaner, but I believe I am accurately validating and extracting.
Code: (Demo)
$text = 'some text [person name="Jackson" family="Smith"] text [vehicle brand="Benz" type="SUV" doors="4" seats="7"] and [invalid closed="false" monkeywrench [lonetag] text [single gender="female"]';
foreach (preg_match_all('~\[(\w+)(?=(?: \w+="\w+")*])(]?)|(?:\G(?!^) (\w+)="(\w+)")~', $text, $out, PREG_SET_ORDER) ? $out : [] as $matches) {
if ($matches[2]) {
$result[$matches[1]] = [];
} elseif (!isset($matches[3])) {
$tag = $matches[1];
} else {
$result[$tag][$matches[3]] = $matches[4];
}
}
var_export($result);
Output:
array (
'person' =>
array (
'name' => 'Jackson',
'family' => 'Smith',
),
'vehicle' =>
array (
'brand' => 'Benz',
'type' => 'SUV',
'doors' => '4',
'seats' => '7',
),
'lonetag' =>
array (
),
'single' =>
array (
'gender' => 'female',
),
)

You can try \[\S+ ((?:[^"]+"){2}) ((?:[^"]+"){2})\]
Explanation:
\[ - match [ literallly
\S+ - mach one or more of non-whitespace characters
(?...) - non-capturing group
[^"]+" - match one or more characters other from " and repeat pattern two times due to {2}
\] - match ] literally
In first capturing group will be your first attribute, in second there will be the second attribute.
Demo

Better use two expressions (or a parser altogether) instead. Consider the following:
<?php
$junk = <<<END
lorem ipsum lorem ipsum
[person name="Jackson" family="Smith"]
lorem ipsum
[car brand="Benz" type="SUV"]
lorem ipsum lorem ipsum
END;
$tag = "~\[(?P<tag>\w+)[^][]*\]~";
$key_values = '~(?P<key>\w+)="(?P<value>[^"]*)"~';
preg_match_all($tag, $junk, $matches, PREG_SET_ORDER);
foreach ($matches as $match) {
echo "Name: {$match["tag"]}\n";
preg_match_all($key_values, $match[0], $attributes, PREG_SET_ORDER);
print_r($attributes);
}
?>
Here we have
\[(?P<tag>\w+)[^][]*\]
for likely tags and
(?P<key>\w+)="(?P<value>[^"]*)"
for key/value pairs. The rest is a foreach loop.

Related

PHP regular expression, match the last occurence

I have a php function that splits product names from their color name in woocommerce.
The full string is generally of this form "product name - product color", like for example:
"Boxer Welbar - ligth grey" splits into "Boxer Welbar" and "light grey"
"Longjohn Gari - marine stripe" splits into "Longjohn Gari" and "marine stripe"
But in some cases it can be "Tee-shirt - product color"...and in this case the split doesn't work as I want, because the "-" in Tee-shirt is detected.
How to circumvent this problem? Should I use a "lookahead" statement in the regexp?
function product_name_split($prod_name) {
$currenttitle = strip_tags($prod_name);
$splitted = preg_split("/–|[\p{Pd}\xAD]|(–)/", $currenttitle);
return $splitted;
}
I'd go for a negative lookahead.
Something like this:
-(?!.*-)
that means to search for a - not followed by any other -
This works if in the color name there will never be a -
What about counting space characters that surround a dash?
For example:
function product_name_split($prod_name) {
$currenttitle = strip_tags($prod_name);
$splitted = preg_split("/\s(–|[\p{Pd}\xAD]|(–))\s/", $currenttitle);
return $splitted;
}
This automatically trims spaces from split parts as well.
If you have - as delimiter (note the spaces around the dash), you may simply use explode(...). If not, use
\s*-(?=[^-]+$)\s*
or
\w+-\w+(*SKIP)(*FAIL)|-
with preg_split(), see the demos on regex101.com (#2)
In PHP this could be:
<?php
$strings = ["Tee-shirt - product color", "Boxer Welbar - ligth grey", "Longjohn Gari - marine stripe"];
foreach ($strings as $string) {
print_r(explode(" - ", $string));
}
foreach ($strings as $string) {
print_r(preg_split("~\s*-(?=[^-]+$)\s*~", $string));
}
?>
Both approaches will yield
Array
(
[0] => Tee-shirt
[1] => product color
)
Array
(
[0] => Boxer Welbar
[1] => ligth grey
)
Array
(
[0] => Longjohn Gari
[1] => marine stripe
)
To collect the splitted items, use array_map(...):
$splitted = array_map( function($item) {return preg_split("~\s*-(?=[^-]+$)\s*~", $item); }, $strings);
Your sample inputs convey that the neighboring whitespace around the delimiting hyphen/dash is just as critical as the hyphen/dash itself.
I recommend doing all of the html and special entity decoding before executing your regex -- that's what these other functions are built for and it will make your regex pattern much simpler to read and maintain.
\p{Pd} will match any hyphen/dash. Reinforce the business logic in the code by declaring a maximum of 2 elements to be generated by the split.
As a general rule, I discourage declaring single-use variables.
Code: (Demo)
function product_name_split($prod_name) {
return preg_split(
"/ \p{Pd} /u",
strip_tags(
html_entity_decode(
$prod_name
)
),
2
);
}
$tests = [
'Tee-shirt - product color',
'Boxer Welbar - ligth grey',
'Longjohn Gari - marine stripe',
'En dash – green',
'Entity – blue',
];
foreach ($tests as $test) {
echo var_export(product_name_split($test, true)) . "\n";
}
Output:
array (
0 => 'Tee-shirt',
1 => 'product color',
)
array (
0 => 'Boxer Welbar',
1 => 'ligth grey',
)
array (
0 => 'Longjohn Gari',
1 => 'marine stripe',
)
array (
0 => 'En dash',
1 => 'green',
)
array (
0 => 'Entity',
1 => 'blue',
)
As usual, there are several options for this, this is one of them
explode — Split a string by a string
end — Set the internal pointer of an array to its last element
$currenttitle = 'Tee-shirt - product color';
$array = explode( '-', $currenttitle );
echo end( $array );

PHP - preg_replace on an array not working as expected

I have a PHP array that looks like this..
Array
(
[0] => post: 746
[1] => post: 2
[2] => post: 84
)
I am trying to remove the post: from each item in the array and return one that looks like this...
Array
(
[0] => 746
[1] => 2
[2] => 84
)
I have attempted to use preg_replace like this...
$array = preg_replace('/^post: *([0-9]+)/', $array );
print_r($array);
But this is not working for me, how should I be doing this?
You've missed the second argument of preg_replace function, which is with what should replace the match, also your regex has small problem, here is the fixed version:
preg_replace('/^post:\s*([0-9]+)$/', '$1', $array );
Demo: https://3v4l.org/64fO6
You don't have a pattern for the replacement, or a empty string place holder.
mixed preg_replace ( mixed $pattern , mixed $replacement , mixed $subject)
Is what you are trying to do (there are other args, but they are optional).
$array = preg_replace('/post: /', '', $array );
Should do it.
<?php
$array=array("post: 746",
"post: 2",
"post: 84");
$array = preg_replace('/^post: /', '', $array );
print_r($array);
?>
Array
(
[0] => 746
[1] => 2
[2] => 84
)
You could do this without using a regex using array_map and substr to check the prefix and return the string without the prefix:
$items = [
"post: 674",
"post: 2",
"post: 84",
];
$result = array_map(function($x){
$prefix = "post: ";
if (substr($x, 0, strlen($prefix)) == $prefix) {
return substr($x, strlen($prefix));
}
return $x;
}, $items);
print_r($result);
Result:
Array
(
[0] => 674
[1] => 2
[2] => 84
)
There are many ways to do this that don't involve regular expressions, which are really not needed for breaking up a simple string like this.
For example:
<?php
$input = Array( 'post: 746', 'post: 2', 'post: 84');
$output = array_map(function ($n) {
$o = explode(': ', $n);
return (int)$o[1];
}, $input);
var_dump($output);
And here's another one that is probably even faster:
<?php
$input = Array( 'post: 746', 'post: 2', 'post: 84');
$output = array_map(function ($n) {
return (int)substr($n, strpos($n, ':')+1);
}, $input);
var_dump($output);
If you don't need integers in the output just remove the cast to int.
Or just use str_replace, which in many cases like this is a drop in replacement for preg_replace.
<?php
$input = Array( 'post: 746', 'post: 2', 'post: 84');
$output = str_replace('post: ', '', $input);
var_dump($output);
You can use array_map() to iterate the array then strip out any non-digital characters via filter_var() with FILTER_SANITIZE_NUMBER_INT or trim() with a "character mask" containing the six undesired characters.
You can also let preg_replace() do the iterating for you. Using preg_replace() offers the most brief syntax, but regular expressions are often slower than non-preg_ techniques and it may be overkill for your seemingly simple task.
Codes: (Demo)
$array = ["post: 746", "post: 2", "post: 84"];
// remove all non-integer characters
var_export(array_map(function($v){return filter_var($v, FILTER_SANITIZE_NUMBER_INT);}, $array));
// only necessary if you have elements with non-"post: " AND non-integer substrings
var_export(preg_replace('~^post: ~', '', $array));
// I shuffled the character mask to prove order doesn't matter
var_export(array_map(function($v){return trim($v, ': opst');}, $array));
Output: (from each technique is the same)
array (
0 => '746',
1 => '2',
2 => '84',
)
p.s. If anyone is going to entertain the idea of using explode() to create an array of each element then store the second element of the array as the new desired string (and I wouldn't go to such trouble) be sure to:
split on or : (colon, space) or even post: (post, colon, space) because splitting on : (colon only) forces you to tidy up the second element's leading space and
use explode()'s 3rd parameter (limit) and set it to 2 because logically, you don't need more than two elements

Parsing parameters from command line with RegEx and PHP

I have this as an input to my command line interface as parameters to the executable:
-Parameter1=1234 -Parameter2=38518 -param3 "Test \"escaped\"" -param4 10 -param5 0 -param6 "TT" -param7 "Seven" -param8 "secret" "-SuperParam9=4857?--SuperParam10=123"
What I want to is to get all of the parameters in a key-value / associative array with PHP like this:
$result = [
'Parameter1' => '1234',
'Parameter2' => '1234',
'param3' => 'Test \"escaped\"',
'param4' => '10',
'param5' => '0',
'param6' => 'TT',
'param7' => 'Seven',
'param8' => 'secret',
'SuperParam9' => '4857',
'SuperParam10' => '123',
];
The problem here lies at the following:
parameter's prefix can be - or --
parameter's glue (value assignment operator) can be either an = sign or a whitespace ' '
some parameters may be inside a quote block and can also have different, both separators and glues and prefixes, ie. a ? mark for the separator.
So far, since I'm really bad with RegEx, and still learning it, is this:
/(-[a-zA-Z]+)/gui
With which I can get all the parameters starting with an -...
I can go to manually explode the entire thing and parse it manually, but there are way too many contingencies to think about.
You can try this that uses the branch reset feature (?|...|...) to deal with the different possible formats of the values:
$str = '-Parameter1=1234 -Parameter2=38518 -param3 "Test \"escaped\"" -param4 10 -param5 0 -param6 "TT" -param7 "Seven" -param8 "secret" "-SuperParam9=4857?--SuperParam10=123"';
$pattern = '~ --?(?<key> [^= ]+ ) [ =]
(?|
" (?<value> [^\\\\"]*+ (?s:\\\\.[^\\\\"]*)*+ ) "
|
([^ ?"]*)
)~x';
preg_match_all ($pattern, $str, $matches);
$result = array_combine($matches['key'], $matches['value']);
print_r($result);
demo
In a branch reset group, the capture groups have the same number or the same name in each branch of the alternation.
This means that (?<value> [^\\\\"]*+ (?s:\\\\.[^\\\\"]*)*+ ) is (obviously) the value named capture, but that ([^ ?"]*) is also the value named capture.
You could use
--?
(?P<key>\w+)
(?|
=(?P<value>[^-\s?"]+)
|
\h+"(?P<value>.*?)(?<!\\)"
|
\h+(?P<value>\H+)
)
See a demo on regex101.com.
Which in PHP would be:
<?php
$data = <<<DATA
-Parameter1=1234 -Parameter2=38518 -param3 "Test \"escaped\"" -param4 10 -param5 0 -param6 "TT" -param7 "Seven" -param8 "secret" "-SuperParam9=4857?--SuperParam10=123"
DATA;
$regex = '~
--?
(?P<key>\w+)
(?|
=(?P<value>[^-\s?"]+)
|
\h+"(?P<value>.*?)(?<!\\\\)"
|
\h+(?P<value>\H+)
)~x';
if (preg_match_all($regex, $data, $matches)) {
$result = array_combine($matches['key'], $matches['value']);
print_r($result);
}
?>
This yields
Array
(
[Parameter1] => 1234
[Parameter2] => 38518
[param3] => Test \"escaped\"
[param4] => 10
[param5] => 0
[param6] => TT
[param7] => Seven
[param8] => secret
[SuperParam9] => 4857
[SuperParam10] => 123
)

How to do a Regular Expression exclude comma or periods

I'm having issues with the line of PHP. I need to have it select all letters and numbers following the #, but ignore "," or "." (commas or periods). Currently it's including them and I can't seem to get them to exclude them.
Ex: #3431A or #4561AB (but ignore and , or . behind them)
preg_match_all( apply_filters( "wpht_regex_pattern", '/#(\S+)/u' ), strip_tags($content), $hashtags );
You can try "/#[0-9A-Za-z]+/", if you want to select hashtags only having letters and digits.
You may try "/#[^\s,\.]+/", if you want to grab hashtags starting with # and ending just before a whitespace (or tab), comma or period is encountered.
Below is sample PHP code and result:
$content="I need to have it select all letters and numbers following the #, but ignore ',' or '.' (commas or periods). Ex: #3431A D, #3431AB or #4561AB.";
echo "<h2>Regex-1:</h2>";
preg_match_all( "/#[0-9A-Za-z]+/", $content, $hashtags );
print_r($hashtags);
echo "<h2>Regex-2:</h2>";
preg_match_all( "/#[^\s,\.]+/", $content, $hashtags );
print_r($hashtags);
Result:
Regex-1:
Array ( [0] => Array ( [0] => #3431A [1] => #3431AB [2] => #4561AB ) )
Regex-2:
Array ( [0] => Array ( [0] => #3431A [1] => #3431AB [2] => #4561AB ) )
You are matching \S+ which is 1 or more of any non-whitespace character. In your question, you said you wanted sequences of numners and letters. To get letters and numbers, you need a different pattern.
function testFilter($test) {
$content = $test['test'];
echo "Testing {$content}\n";
preg_match_all( apply_filters( "wpht_regex_pattern", '/#([A-Za-z0-9]+)/u' ), strip_tags($content), $hashtags );
$expect = $test['expect'];
echo " ";
if ( ! empty($expect) ) {
$tmp = implode(',', $hashtags[1]);
if ( $tmp != $expect ) echo "FAIL ";
else echo "PASS ";
}
else {
echo " ";
}
echo 'Hashtags: '. implode(',', $hashtags[1]);
echo PHP_EOL;
}
$contentTest = [
['test' => '#shoes, #friends, #beach', 'expect' => 'shoes,friends,beach'],
['test' => '#shoes, #friends6, #2beach', 'expect' => 'shoes,friends6,2beach'],
['test' => '#shoes, #frie_nds, #be^ach', 'expect' => 'shoes,frie,be'],
['test' => 'blah blah #shoes, #friends, #beach', 'expect' => 'shoes,friends,beach'],
['test' => '#shoes, #friends, #beach,', 'expect' => 'shoes,friends,beach'],
['test' => '#shoes, #friends, #beach,#', 'expect' => 'shoes,friends,beach'],
['test' => '#shoes, #friends, #beach som trailing text', 'expect' => 'shoes,friends,beach'],
['test' => '#3431A, #345ADF', 'expect' => '3431A,345ADF'],
['test' => 'The quick brown #fox gave the #99dogs codes #A00BZ90A #45678blah #0569509 #09XX09', 'expect' => 'fox,99dogs,A00BZ90A,45678blah,0569509,09XX09'],
];
foreach ($contentTest as $t) {
testFilter($t);
}
Output:
Testing #shoes, #friends, #beach
PASS Hashtags: shoes,friends,beach
Testing #shoes, #friends6, #2beach
PASS Hashtags: shoes,friends6,2beach
Testing #shoes, #frie_nds, #be^ach
PASS Hashtags: shoes,frie,be
Testing blah blah #shoes, #friends, #beach
PASS Hashtags: shoes,friends,beach
Testing #shoes, #friends, #beach,
PASS Hashtags: shoes,friends,beach
Testing #shoes, #friends, #beach,#
PASS Hashtags: shoes,friends,beach
Testing #shoes, #friends, #beach som trailing text
PASS Hashtags: shoes,friends,beach
Testing #3431A, #345ADF
PASS Hashtags: 3431A,345ADF
Testing The quick brown #fox gave the #99dogs codes #A00BZ90A #45678blah #0569509 #09XX09
PASS Hashtags: fox,99dogs,A00BZ90A,45678blah,0569509,09XX09

Make unique array from different preg_match_all applied to array of string

i've a logical problem that i don't know how to solve, i'm a lot confused about it.
I've an array composed in this way:
titoli[
1 => 'NFL'
2 => 'Johnny Depp'
3 => 'Institute of Technology'
4 => 'Another text'
]
Now, I need to apply different regex to that array, how can i do that and have a single final array?
For now i've written this:
for($i=0;$i<sizeof($titoli);$i++)
{
if(str_word_count($titoli[$i]) > preg_match_all('/([A-Z][a-zA-Z0-9-]*)([\s][A-Z][a-zA-Z0-9-]*)+/', $titoli[$i]))
{
preg_match('/([A-Z][a-zA-Z0-9-]*)([\s][A-Z][a-zA-Z0-9-]*)+/', $titoli[$i], $result[$i]);
$i++;
}
if(str_word_count($my_array[$i]) > preg_match_all('/^[A-Z][a-z]* [a-z]+ [A-Z][a-z]*$/', $titoli[$x]) && preg_match_all('/^[A-Z][a-z]* [a-z]+ [A-Z][a-z]*$/', $titoli[$i]) > 0) //controlla che nel titolo non siano state messe tutte le parole con l'iniziale maiuscola
{
preg_match('/^[A-Z][a-z]* [a-z]+ [A-Z][a-z]*$/', $titoli[$x], $result_b[$y], PREG_PATTERN_ORDER);
$y++;
}
}
Well, you could merge the arrays then extract the unique values:
$merged = array_merge($array1, $array2, $array3);
$unique = array_unique($merged);
Where $array1, $array2 and $array3 are the results of the preg_match_all functions.
I am going to post an answer, but I have very little confidence that it will give you what you are looking for. I tried to construct a method that mirrors your intent -- but I could be dead wrong.
Input:
$titoli=[
1 => 'NFL',
2 => 'Johnny Depp',
3 => 'Institute of Technology',
4 => 'Another text'
];
Method:
foreach($titoli as $t){
if($t==strtoupper($t)){ // every letter is uppercase
$result['acronyms'][]=$t;
}elseif($t==ucwords(strtolower($t))){ // every word starts with an uppercase letter
$result['names'][]=$t;
}else{ // has at least one word begins with a lowercase letter
$result['other'][]=$t;
}
}
var_export($result);
Output:
array (
'acronyms' =>
array (
0 => 'NFL',
),
'names' =>
array (
0 => 'Johnny Depp',
),
'other' =>
array (
0 => 'Institute of Technology',
1 => 'Another text',
),
)

Categories