Regular Expression to extract php code partially (( array definition )) - php

I have php code stored (( array definition )) in a string like this
$code=' array(
0 => "a",
"a" => $GlobalScopeVar,
"b" => array("nested"=>array(1,2,3)),
"c" => function() use (&$VAR) { return isset($VAR) ? "defined" : "undefined" ; },
); ';
there is a regular expression to extract this array??, i mean i want something like
$array=(
0 => '"a"',
'a' => '$GlobalScopeVar',
'b' => 'array("nested"=>array(1,2,3))',
'c' => 'function() use (&$VAR) { return isset($VAR) ? "defined" : "undefined" ; }',
);
pD :: i do research trying to find a regular expression but nothing was found.
pD2 :: gods of stackoverflow, let me bounty this now and i will offer 400 :3
pD3 :: this will be used in a internal app, where i need extract an array of some php file to be 'processed' in parts, i try explain with this codepad.org/td6LVVme

Regex
So here's the MEGA regex I came up with:
\s* # white spaces
########################## KEYS START ##########################
(?: # We\'ll use this to make keys optional
(?P<keys> # named group: keys
\d+ # match digits
| # or
"(?(?=\\\\")..|[^"])*" # match string between "", works even 4 escaped ones "hello \" world"
| # or
\'(?(?=\\\\\')..|[^\'])*\' # match string between \'\', same as above :p
| # or
\$\w+(?:\[(?:[^[\]]|(?R))*\])* # match variables $_POST, $var, $var["foo"], $var["foo"]["bar"], $foo[$bar["fail"]]
) # close group: keys
########################## KEYS END ##########################
\s* # white spaces
=> # match =>
)? # make keys optional
\s* # white spaces
########################## VALUES START ##########################
(?P<values> # named group: values
\d+ # match digits
| # or
"(?(?=\\\\")..|[^"])*" # match string between "", works even 4 escaped ones "hello \" world"
| # or
\'(?(?=\\\\\')..|[^\'])*\' # match string between \'\', same as above :p
| # or
\$\w+(?:\[(?:[^[\]]|(?R))*\])* # match variables $_POST, $var, $var["foo"], $var["foo"]["bar"], $foo[$bar["fail"]]
| # or
array\s*\((?:[^()]|(?R))*\) # match an array()
| # or
\[(?:[^[\]]|(?R))*\] # match an array, new PHP array syntax: [1, 3, 5] is the same as array(1,3,5)
| # or
(?:function\s+)?\w+\s* # match functions: helloWorld, function name
(?:\((?:[^()]|(?R))*\)) # match function parameters (wut), (), (array(1,2,4))
(?:(?:\s*use\s*\((?:[^()]|(?R))*\)\s*)? # match use(&$var), use($foo, $bar) (optionally)
\{(?:[^{}]|(?R))*\} # match { whatever}
)?;? # match ; (optionally)
) # close group: values
########################## VALUES END ##########################
\s* # white spaces
I've put some comments, note that you need to use 3 modifiers:
x : let's me make comments
s : match newlines with dots
i : match case insensitive
PHP
$code='array(0 => "a", 123 => 123, $_POST["hello"][\'world\'] => array("is", "actually", "An array !"), 1234, \'got problem ?\',
"a" => $GlobalScopeVar, $test_further => function test($noway){echo "this works too !!!";}, "yellow" => "blue",
"b" => array("nested"=>array(1,2,3), "nested"=>array(1,2,3),"nested"=>array(1,2,3)), "c" => function() use (&$VAR) { return isset($VAR) ? "defined" : "undefined" ; }
"bug", "fixed", "mwahahahaa" => "Yeaaaah"
);'; // Sample data
$code = preg_replace('#(^\s*array\s*\(\s*)|(\s*\)\s*;?\s*$)#s', '', $code); // Just to get ride of array( at the beginning, and ); at the end
preg_match_all('~
\s* # white spaces
########################## KEYS START ##########################
(?: # We\'ll use this to make keys optional
(?P<keys> # named group: keys
\d+ # match digits
| # or
"(?(?=\\\\")..|[^"])*" # match string between "", works even 4 escaped ones "hello \" world"
| # or
\'(?(?=\\\\\')..|[^\'])*\' # match string between \'\', same as above :p
| # or
\$\w+(?:\[(?:[^[\]]|(?R))*\])* # match variables $_POST, $var, $var["foo"], $var["foo"]["bar"], $foo[$bar["fail"]]
) # close group: keys
########################## KEYS END ##########################
\s* # white spaces
=> # match =>
)? # make keys optional
\s* # white spaces
########################## VALUES START ##########################
(?P<values> # named group: values
\d+ # match digits
| # or
"(?(?=\\\\")..|[^"])*" # match string between "", works even 4 escaped ones "hello \" world"
| # or
\'(?(?=\\\\\')..|[^\'])*\' # match string between \'\', same as above :p
| # or
\$\w+(?:\[(?:[^[\]]|(?R))*\])* # match variables $_POST, $var, $var["foo"], $var["foo"]["bar"], $foo[$bar["fail"]]
| # or
array\s*\((?:[^()]|(?R))*\) # match an array()
| # or
\[(?:[^[\]]|(?R))*\] # match an array, new PHP array syntax: [1, 3, 5] is the same as array(1,3,5)
| # or
(?:function\s+)?\w+\s* # match functions: helloWorld, function name
(?:\((?:[^()]|(?R))*\)) # match function parameters (wut), (), (array(1,2,4))
(?:(?:\s*use\s*\((?:[^()]|(?R))*\)\s*)? # match use(&$var), use($foo, $bar) (optionally)
\{(?:[^{}]|(?R))*\} # match { whatever}
)?;? # match ; (optionally)
) # close group: values
########################## VALUES END ##########################
\s* # white spaces
~xsi', $code, $m); // Matching :p
print_r($m['keys']); // Print keys
print_r($m['values']); // Print values
// Since some keys may be empty in case you didn't specify them in the array, let's fill them up !
foreach($m['keys'] as $index => &$key){
if($key === ''){
$key = 'made_up_index_'.$index;
}
}
$results = array_combine($m['keys'], $m['values']);
print_r($results); // printing results
Output
Array
(
[0] => 0
[1] => 123
[2] => $_POST["hello"]['world']
[3] =>
[4] =>
[5] => "a"
[6] => $test_further
[7] => "yellow"
[8] => "b"
[9] => "c"
[10] =>
[11] =>
[12] => "mwahahahaa"
[13] => "this is"
)
Array
(
[0] => "a"
[1] => 123
[2] => array("is", "actually", "An array !")
[3] => 1234
[4] => 'got problem ?'
[5] => $GlobalScopeVar
[6] => function test($noway){echo "this works too !!!";}
[7] => "blue"
[8] => array("nested"=>array(1,2,3), "nested"=>array(1,2,3),"nested"=>array(1,2,3))
[9] => function() use (&$VAR) { return isset($VAR) ? "defined" : "undefined" ; }
[10] => "bug"
[11] => "fixed"
[12] => "Yeaaaah"
[13] => "a test"
)
Array
(
[0] => "a"
[123] => 123
[$_POST["hello"]['world']] => array("is", "actually", "An array !")
[made_up_index_3] => 1234
[made_up_index_4] => 'got problem ?'
["a"] => $GlobalScopeVar
[$test_further] => function test($noway){echo "this works too !!!";}
["yellow"] => "blue"
["b"] => array("nested"=>array(1,2,3), "nested"=>array(1,2,3),"nested"=>array(1,2,3))
["c"] => function() use (&$VAR) { return isset($VAR) ? "defined" : "undefined" ; }
[made_up_index_10] => "bug"
[made_up_index_11] => "fixed"
["mwahahahaa"] => "Yeaaaah"
["this is"] => "a test"
)
Online regex demo
Online php demo
Known bug (fixed)
$code='array("aaa", "sdsd" => "dsdsd");'; // fail
$code='array(\'aaa\', \'sdsd\' => "dsdsd");'; // fail
$code='array("aaa", \'sdsd\' => "dsdsd");'; // succeed
// Which means, if a value with no keys is followed
// by key => value and they are using the same quotation
// then it will fail (first value gets merged with the key)
Online bug demo
Credits
Goes to Bart Kiers for his recursive pattern to match nested brackets.
Advice
You maybe should go with a parser since regexes are sensitive. #bwoebi has done a great job in his answer.

Even when you asked for a regex, it works also with pure PHP. token_get_all is here the key function. For a regex check #HamZa's answer out.
The advantage here is that it is more dynamic than a regex. A regex has a static pattern, while with token_get_all, you can decide after every single token what to do. It even escapes single quotes and backslashes where necessary, what a regex wouldn't do.
Also, in regex, you have, even when commented, problems to imagine what it should do; what code does is much easier to understand when you look at PHP code.
$code = ' array(
0 => "a",
"a" => $GlobalScopeVar,
"b" => array("nested"=>array(1,2,3)),
"c" => function() use (&$VAR) { return isset($VAR) ? "defined" : "undefined" ; },
"string_literal",
12345
); ';
$token = token_get_all("<?php ".$code);
$newcode = "";
$i = 0;
while (++$i < count($token)) { // enter into array; then start.
if (is_array($token[$i]))
$newcode .= $token[$i][1];
else
$newcode .= $token[$i];
if ($token[$i] == "(") {
$ending = ")";
break;
}
if ($token[$i] == "[") {
$ending = "]";
break;
}
}
// init variables
$escape = 0;
$wait_for_non_whitespace = 0;
$parenthesis_count = 0;
$entry = "";
// main loop
while (++$i < count($token)) {
// don't match commas in func($a, $b)
if ($token[$i] == "(" || $token[$i] == "{") // ( -> normal parenthesis; { -> closures
$parenthesis_count++;
if ($token[$i] == ")" || $token[$i] == "}")
$parenthesis_count--;
// begin new string after T_DOUBLE_ARROW
if (!$escape && $wait_for_non_whitespace && (!is_array($token[$i]) || $token[$i][0] != T_WHITESPACE)) {
$escape = 1;
$wait_for_non_whitespace = 0;
$entry .= "'";
}
// here is a T_DOUBLE_ARROW, there will be a string after this
if (is_array($token[$i]) && $token[$i][0] == T_DOUBLE_ARROW && !$escape) {
$wait_for_non_whitespace = 1;
}
// entry ended: comma reached
if (!$parenthesis_count && $token[$i] == "," || ($parenthesis_count == -1 && $token[$i] == ")" && $ending == ")") || ($ending == "]" && $token[$i] == "]")) {
// go back to the first non-whitespace
$whitespaces = "";
if ($parenthesis_count == -1 || ($ending == "]" && $token[$i] == "]")) {
$cut_at = strlen($entry);
while ($cut_at && ord($entry[--$cut_at]) <= 0x20); // 0x20 == " "
$whitespaces = substr($entry, $cut_at + 1, strlen($entry));
$entry = substr($entry, 0, $cut_at + 1);
}
// $escape == true means: there was somewhere a T_DOUBLE_ARROW
if ($escape) {
$escape = 0;
$newcode .= $entry."'";
} else {
$newcode .= "'".addcslashes($entry, "'\\")."'";
}
$newcode .= $whitespaces.($parenthesis_count?")":(($ending == "]" && $token[$i] == "]")?"]":","));
// reset
$entry = "";
} else {
// add actual token to $entry
if (is_array($token[$i])) {
$addChar = $token[$i][1];
} else {
$addChar = $token[$i];
}
if ($entry == "" && $token[$i][0] == T_WHITESPACE) {
$newcode .= $addChar;
} else {
$entry .= $escape?str_replace(array("'", "\\"), array("\\'", "\\\\"), $addChar):$addChar;
}
}
}
//append remaining chars like whitespaces or ;
$newcode .= $entry;
print $newcode;
Demo at: http://3v4l.org/qe4Q1
Should output:
array(
0 => '"a"',
"a" => '$GlobalScopeVar',
"b" => 'array("nested"=>array(1,2,3))',
"c" => 'function() use (&$VAR) { return isset($VAR) ? "defined" : "undefined" ; }',
'"string_literal"',
'12345'
)
You can, to get the array's data, print_r(eval("return $newcode;")); to get the entries of the array:
Array
(
[0] => "a"
[a] => $GlobalScopeVar
[b] => array("nested"=>array(1,2,3))
[c] => function() use (&$VAR) { return isset($VAR) ? "defined" : "undefined" ; }
[1] => "string_literal"
[2] => 12345
)

The clean way to do this is obviously to use the tokenizer (but keep in mind that the tokenizer alone doesn't solve the problem).
For the challenge, I purpose a regex approach.
The idea is not to describe the PHP syntax, but more to describe it in a negative way (in other words, I describe only basic and needed PHP structures to obtain the result). The advantage of this basic description is to deal with more complex objects than functions, strings, integers or booleans. The result is a more flexible pattern that can deal for example with multi/single line comments, heredoc/nowdoc syntaxes:
<pre><?php
$code=' array(
0 => "a",
"a" => $GlobalScopeVar,
"b" => array("nested"=>array(1,2,3)),
"c" => function() use (&$VAR) { return isset($VAR) ? "defined" : "undefined" ; },
); ';
$pattern = <<<'EOD'
~
# elements
(?(DEFINE)
# comments
(?<comMulti> /\* .*? (?:\*/|\z) ) # multiline comment
(?<comInlin> (?://|\#) \N* $ ) # inline comment
(?<comments> \g<comMulti> | \g<comInlin> )
# strings
(?<strDQ> " (?>[^"\\]+|\\.)* ") # double quote string
(?<strSQ> ' (?>[^'\\]+|\\.)* ') # single quote string
(?<strHND> <<<(["']?)([a-zA-Z]\w*)\g{-2} (?>\R \N*)*? \R \g{-1} ;? (?=\R|$) ) # heredoc and nowdoc syntax
(?<string> \g<strDQ> | \g<strSQ> | \g<strHND> )
# brackets
(?<braCrl> { (?> \g<nobracket> | \g<brackets> )* } )
(?<braRnd> \( (?> \g<nobracket> | \g<brackets> )* \) )
(?<braSqr> \[ (?> \g<nobracket> | \g<brackets> )* ] )
(?<brackets> \g<braCrl> | \g<braRnd> | \g<braSqr> )
# nobracket: content between brackets except other brackets
(?<nobracket> (?> [^][)(}{"'</\#]+ | \g<comments> | / | \g<string> | <+ )+ )
# ignored elements
(?<s> \s+ | \g<comments> )
)
# array components
(?(DEFINE)
# key
(?<key> [0-9]+ | \g<string> )
# value
(?<value> (?> [^][)(}{"'</\#,\s]+ | \g<s> | / | \g<string> | <+ | \g<brackets> )+? (?=\g<s>*[,)]) )
)
(?J)
(?: \G (?!\A)(?<!\)) | array \g<s>* \( ) \g<s>* \K
(?: (?<key> \g<key> ) \g<s>* => \g<s>* )? (?<value> \g<value> ) \g<s>* (?:,|,?\g<s>*(?<stop> \) ))
~xsm
EOD;
if (preg_match_all($pattern, $code, $m, PREG_SET_ORDER)) {
foreach($m as $v) {
echo "\n<strong>Whole match:</strong> " . $v[0]
. "\n<strong>Key</strong>:\t" . $v['key']
. "\n<strong>Value</strong>:\t" . $v['value'] . "\n";
if (isset($v['stop']))
echo "\n<strong>done</strong>\n\n";
}
}

Here is what you asked for, very compact.
Please let me know if you'd like any tweaks.
THE CODE (you can run this straight in php)
$code=' array(
0 => "a",
"a" => $GlobalScopeVar,
"b" => array("nested"=>array(1,2,3)),
"c" => function() use (&$VAR) { return isset($VAR) ? "defined" : "undefined" ; },
); ';
$regex = "~(?xm)
^[\s'\"]*([^'\"\s]+)['\"\s]*
=>\s*+
(.*?)\s*,?\s*$~";
if(preg_match_all($regex,$code,$matches,PREG_SET_ORDER)) {
$array=array();
foreach($matches as $match) {
$array[$match[1]] = $match[2];
}
echo "<pre>";
print_r($array);
echo "</pre>";
} // END IF
THE OUTPUT
Array
(
[0] => "a"
[a] => $GlobalScopeVar
[b] => array("nested"=>array(1,2,3))
[c] => function() use (&$VAR) { return isset($VAR) ? "defined" : "undefined" ; }
)
$array contains your array.
You like?
Please let me know if you have any questions or require tweaks. :)

Just for this situation:
$code=' array(
0=>"a",
"a"=>$GlobalScopeVar,
"b"=>array("nested"=>array(1,2,3)),
"c"=>function() use (&$VAR) { return isset($VAR) ? "defined" : "undefined" ; },
); ';
preg_match_all('#\s*(.*?)\s*=>\s*(.*?)\s*,?\s*$#m', $code, $m);
$array = array_combine($m[1], $m[2]);
print_r($array);
Output:
Array
(
[0] => "a"
["a"] => $GlobalScopeVar
["b"] => array("nested"=>array(1,2,3))
["c"] => function() use (&$VAR) { return isset($VAR) ? "defined" : "undefined" ; }
)

Related

Find Logic from String using Regex

i want to get logic from string that what from input string. For example:
Input:
pageType == "static" OR pageType == "item" AND pageRef == "index"
How to i get logic like:
0 => pageType == "static"
1 => pageType == "item"
2 => pageRef == "index"
The logic clause must be complete based on what is entered.
I place like this:
$input = 'pageType == "static" OR pageType == "item" AND pageRef == "index"';
preg_match_all('/(.+)(?!AND|OR)(.+)/s', $input, $loX);
var_dump($loX);
but the array just show:
0 => pageType == "static" OR pageType == "item" AND pageRef == "index"
1 => pageType == "static" OR pageType == "item" AND pageRef == "index
2 => "
Please help me, thanks ^_^
One option is to make use of the \G anchor and a capture group:
\G(\w+\h*==\h*"[^"]*")(?:\h+(?:OR|AND)\h+|$)
The pattern matches:
\G Get continuous matches asserting the position at the end of the previous match from the start of the string
(\w+\h*==\h*"[^"]*") Match 1+ word characters == and the value between double quotes
(?: Non capture group for the alternatives
\h+(?:OR|AND)\h+ Match either OR or AND between spaces
| Or
$ Assert the end of the string
) Close the group
Regex demo | Php demo
$re = '/\G(\w+\h*==\h*"[^"]*")(?:\h+(?:OR|AND)\h+|$)/';
$str = 'pageType == "static" OR pageType == "item" AND pageRef == "index"';
preg_match_all($re, $str, $matches);
print_r($matches[1]);
Output
Array
(
[0] => pageType == "static"
[1] => pageType == "item"
[2] => pageRef == "index"
)
Another option to get the results is to split on the AND or OR surrounded by spaces:
$result = preg_split('/\h+(?:OR|AND)\h+/', $str);

Get content in parentheses following right after string using regex in php

I have a php file as a string, I am looking for places where certain functions are called and I want to extract the passed arguments to the function.
I need to match the following cases:
some_function_name("abc123", ['key' => 'value'])
some_function_name("abc123", array("key" => 'value'))
So far I have this, but it breaks as soon as I have any nesting conditions:
(function_name)\(([^()]+)\)
$text = "test test test test some_function_name('abc123', ['key' => 'value']) sdohjsh dsfkjh spkdo sdfopmsdfohp some_function_name('abc123', array('key' => 'value'))";
preg_match_all('/\w+\(.*?\)(\)|!*)/', $text, $matches);
var_dump($matches[0]);
Is this the desired result you want?
$text = "blah some_function_name('abc123', ['key' => 'value']) blah some_function_name('abc123', array('key' => 'value')) blah";
preg_match_all('/\w+\(.+?(?:array\(.+?\)|\[.+?\])\)/', $text, $matches);
var_dump($matches);
Output:
array(1) {
[0]=>
array(2) {
[0]=>
string(48) "some_function_name('abc123', ['key' => 'value'])"
[1]=>
string(53) "some_function_name('abc123', array('key' => 'value'))"
}
}
Explanation:
\w+ # 1 or more word character (i.e. [a-zA-Z0-9_])
\( # opening parenthesis
.+? # 1 or more any character, not greedy
(?: # non capture group
array\(.+?\) # array(, 1 or more any character, )
| # OR
\[.+?\] # [, 1 or more any character, ]
) # end group
\) # closing parenthesis
I managed to solve it using the following pattern:
((\'.*?\'|\".*?\")(\s*,\s*.*?)*?\);?
Thanks everyone for your suggestions!

Extracting javascript object from html using regex & php

I am trying to extract a specific JavaScript object from a page containing the usual HTML markup.
I have tried to use regex but i don't seem to be able to get it to parse the HTML correctly when the HTML contains a line break.
An example can be seen here: https://regex101.com/r/b8zN8u/2
The HTML i am trying to extract looks like this:
<script>
DATA.tracking.user = {
age: "19",
name: "John doe"
}
</script>
Using the following regex: DATA.tracking.user=(.*?)}
<?php
$re = '/DATA.tracking.user = (.*?)\}/m';
$str = '<script>
DATA.tracking.user = { age: "19", name: "John doe" }
</script>';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
If i parse DATA.tracking.user = { age: "19", name: "John doe" } without any linebreaks, Then it works fine but if i try to parse:
DATA.tracking.user = {
age: "19",
name: "John doe"
}
It does not like dealing with the line breaks.
Any help would be greatly appreciated.
Thanks.
You will need to specify whitespaces (\s) in your pattern in order to parse the javascript code containing linebreaks.
For example, if you use the following code:
<?php
$re = '/DATA.tracking.user = \{\s*.*\s*.*\s*\}/';
$str = '<script>
DATA.tracking.user = {
age: "19",
name: "John doe"
}
</script>';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
print_r($matches[0]);
?>
You will get the following output:
Array
(
[0] => DATA.tracking.user = {
age: "19",
name: "John doe"
}
)
The simple solution to your problem is to use the s pattern modifier to command the . (any character) to also match newline characters -- which it does not by default.
And you should:
escape your literal dots.
write the \{ outside of your capture group.
omit the m pattern modifier because you aren't using anchors.
...BUT...
If this was my task and I was going to be processing the data from the extracted string, I would probably start breaking up the components at extraction-time with the power of \G.
Code: (Demo) (Pattern Demo)
$htmls[] = <<<HTML
DATA.tracking.user = { age: "19", name: "John doe", int: 55 } // This works
HTML;
$htmls[] = <<<HTML
DATA.tracking.user = {
age: "20",
name: "Jane Doe",
int: 49
} // This does not works
HTML;
foreach ($htmls as $html) {
var_export(preg_match_all('~(?:\G(?!^),|DATA\.tracking\.user = \{)\s+([^:]+): (\d+|"[^"]*")~', $html, $out, PREG_SET_ORDER) ? $out : []);
echo "\n --- \n";
}
Output:
array (
0 =>
array (
0 => 'DATA.tracking.user = { age: "19"',
1 => 'age',
2 => '"19"',
),
1 =>
array (
0 => ', name: "John doe"',
1 => 'name',
2 => '"John doe"',
),
2 =>
array (
0 => ', int: 55',
1 => 'int',
2 => '55',
),
)
---
array (
0 =>
array (
0 => 'DATA.tracking.user = {
age: "20"',
1 => 'age',
2 => '"20"',
),
1 =>
array (
0 => ',
name: "Jane Doe"',
1 => 'name',
2 => '"Jane Doe"',
),
2 =>
array (
0 => ',
int: 49',
1 => 'int',
2 => '49',
),
)
---
Now you can simply iterate the matches and work with [1] (the keys) and [2] (the values). This is a basic solution, that can be further tailored to suit your project data. Admittedly, this doesn't account for values that contain an escaped double-quote. Adding this feature would be no trouble. Accounting for more complex value types may be more of a challenge.
You need to add the 's' modifier to the end of your regex - otherwise, "." does not include newlines. See this:
s (PCRE_DOTALL)
If this modifier is set, a dot metacharacter in the pattern matches all characters, including newlines. Without it, newlines are excluded. This modifier is equivalent to Perl's /s modifier. A negative class such as [^a] always matches a newline character, independent of the setting of this modifier.
So basically change your regex to be:
'/DATA.tracking.user = (.*?)\}/ms'
Also, you should quote your other dots (otherwise you will match "DATAYtrackingzZuser". So...
'/DATA\.tracking\.user = (.*?)\}/ms'
I'd also add in the open curly bracket and not enforce the single space around the equal sign, so:
'/DATA\.tracking\.user\s*=\s*\{(.*?)\}/ms'
Since you seem to be scraping/reading the page anyway (so you have a local copy), you can simply replace all the newline characters in the HTML page with whitespace characters, then it should work perfectly without even changing your script.
Refer to this for the ascii values:
https://www.techonthenet.com/ascii/chart.php

Parsing parameters from command line with RegEx and PHP

I have this as an input to my command line interface as parameters to the executable:
-Parameter1=1234 -Parameter2=38518 -param3 "Test \"escaped\"" -param4 10 -param5 0 -param6 "TT" -param7 "Seven" -param8 "secret" "-SuperParam9=4857?--SuperParam10=123"
What I want to is to get all of the parameters in a key-value / associative array with PHP like this:
$result = [
'Parameter1' => '1234',
'Parameter2' => '1234',
'param3' => 'Test \"escaped\"',
'param4' => '10',
'param5' => '0',
'param6' => 'TT',
'param7' => 'Seven',
'param8' => 'secret',
'SuperParam9' => '4857',
'SuperParam10' => '123',
];
The problem here lies at the following:
parameter's prefix can be - or --
parameter's glue (value assignment operator) can be either an = sign or a whitespace ' '
some parameters may be inside a quote block and can also have different, both separators and glues and prefixes, ie. a ? mark for the separator.
So far, since I'm really bad with RegEx, and still learning it, is this:
/(-[a-zA-Z]+)/gui
With which I can get all the parameters starting with an -...
I can go to manually explode the entire thing and parse it manually, but there are way too many contingencies to think about.
You can try this that uses the branch reset feature (?|...|...) to deal with the different possible formats of the values:
$str = '-Parameter1=1234 -Parameter2=38518 -param3 "Test \"escaped\"" -param4 10 -param5 0 -param6 "TT" -param7 "Seven" -param8 "secret" "-SuperParam9=4857?--SuperParam10=123"';
$pattern = '~ --?(?<key> [^= ]+ ) [ =]
(?|
" (?<value> [^\\\\"]*+ (?s:\\\\.[^\\\\"]*)*+ ) "
|
([^ ?"]*)
)~x';
preg_match_all ($pattern, $str, $matches);
$result = array_combine($matches['key'], $matches['value']);
print_r($result);
demo
In a branch reset group, the capture groups have the same number or the same name in each branch of the alternation.
This means that (?<value> [^\\\\"]*+ (?s:\\\\.[^\\\\"]*)*+ ) is (obviously) the value named capture, but that ([^ ?"]*) is also the value named capture.
You could use
--?
(?P<key>\w+)
(?|
=(?P<value>[^-\s?"]+)
|
\h+"(?P<value>.*?)(?<!\\)"
|
\h+(?P<value>\H+)
)
See a demo on regex101.com.
Which in PHP would be:
<?php
$data = <<<DATA
-Parameter1=1234 -Parameter2=38518 -param3 "Test \"escaped\"" -param4 10 -param5 0 -param6 "TT" -param7 "Seven" -param8 "secret" "-SuperParam9=4857?--SuperParam10=123"
DATA;
$regex = '~
--?
(?P<key>\w+)
(?|
=(?P<value>[^-\s?"]+)
|
\h+"(?P<value>.*?)(?<!\\\\)"
|
\h+(?P<value>\H+)
)~x';
if (preg_match_all($regex, $data, $matches)) {
$result = array_combine($matches['key'], $matches['value']);
print_r($result);
}
?>
This yields
Array
(
[Parameter1] => 1234
[Parameter2] => 38518
[param3] => Test \"escaped\"
[param4] => 10
[param5] => 0
[param6] => TT
[param7] => Seven
[param8] => secret
[SuperParam9] => 4857
[SuperParam10] => 123
)

Regular expression to extract annotation parameter

i want to extract parameter like exist in annotation
i have done this far
$str = "(action=bla,arg=2,test=15,op)";
preg_match_all('/([^\(,]+)=([^,\)]*)/', $str, $m);
$data = array_combine($m[1], $m[2]);
var_dump($data);
this gives following out put
array (size=3)
'action' => string 'bla' (length=3)
'arg' => string '2' (length=1)
'test' => string '15' (length=2)
this is ignoring op (but i want it having null or empty value)
but i want to improve this so it can extract these also
(action='val',abc) in this case value inside single quote will assign to action
(action="val",abc) same as above but it also extract value between double quote
(action=[123,11,23]) now action action will contain array 123,11,23 (this also need to extract with or without quotation)
i don't want complete solution(if you can do it then most welcome) but i need at least first two
EDIT
(edit as per disucssion with r3mus)
output should be like
array (size=3)
'action' => string 'bla' (length=3)
'arg' => string '2' (length=1)
'test' => string '15' (length=2)
'op' => NULL
Edit:
This ended up being a lot more complex than just a simple regex. It ended up looking (first pass) like this:
function validate($str)
{
if (preg_match('/=\[(.*)\]/', $str, $m))
{
$newstr = preg_replace("/,/", "+", $m[1]);
$str = preg_replace("/".$m[1]."/", $newstr, $str);
}
preg_match('/\((.*)\)/', $str, $m);
$array = explode(",", $m[1]);
$output = array();
foreach ($array as $value)
{
$pair = explode("=", $value);
if (preg_match('/\[(.*)\]/', $pair[1]))
$pair[1] = explode("+", $pair[1]);
$output[$pair[0]] = $pair[1];
}
if (!isset($output['op']))
return $output;
else
return false;
}
print_r(validate("(action=[123,11,23],arg=2,test=15)"));
Old stuff that wasn't adequate:
How about:
([^\(,]+)=(\[.*\]|['"]?(\w*)['"]?)
Working example/sandbox: http://regex101.com/r/bZ8qE6
Or if you need to capture only the array within the []:
([^\(,]+)=(\[(.*)\]|['"]?(\w*)['"]?)
I know it's answered but you could do this which I think is what you wanted:
$str = '(action=bla,arg=2,test=15,op)';
preg_match_all('/([^=,()]+)(?:=([^,)]+))?/', $str, $m);
$data = array_combine($m[1], $m[2]);
echo '<pre>' . print_r($data, true) . '</pre>';
OUTPUTS
Array
(
[action] => bla
[arg] => 2
[test] => 15
[op] =>
)
You can use this code:
<pre><?php
$subject = '(action=bla,arg=2,test=15,op, arg2=[1,2,3],arg3 = "to\\"t,o\\\\", '
. 'arg4 = \'titi\',arg5=) blah=312';
$pattern = <<<'LOD'
~
(?: \(\s* | \G(?<!^) ) # a parenthesis or contiguous to a precedent match
(?<param> \w+ )
(?: \s* = \s*
(?| (?<value> \[ [^]]* ] ) # array
| "(?<value> (?> [^"\\]++ | \\{2} | \\. )* )" # double quotes
| '(?<value> (?> [^'\\]++ | \\{2} | \\. )* )' # single quotes
| (?<value> [^\s,)]++ ) # other value
)? # the value can be empty
)? # the parameter can have no value
\s*
(?:
, \s* # a comma
| # OR
(?= (?<control> \) ) ) # followed by the closing parenthesis
)
~xs
LOD;
preg_match_all($pattern, $subject, $matches, PREG_SET_ORDER);
foreach($matches as $match) {
printf("<br>%s\t%s", $match['param'], $match['value']);
if (isset($match['control'])) echo '<br><br>#closing parenthesis#';
}
?></pre>

Categories