Split log data with php preg_match - php

my task is to analyze log files with PHP-script.
I'm going to use REGEX in order to split log records for further analyses.
Log records are like following:
param1=val1;param2=val2;param3=val3;[int1Param1=int1Val1;int1Param2=int1Val2;][int2Param1=int2Val1;int2Param2=int2Val2;][int3Param1=int3Val1;int3Param2=int3Val2;]param4=val4;
so, I have set of parameters and values I have to analyze, and I have no problem with this part.
My concern is "session data" which is inside series of square brackets between param3 and param4. The issue is that I have no idea how much records I'll have in this part (it can be 0 or more of such records in this part).
I'm identifying this part with following regex:
(\[[^\]\[]+\])*
It perfectly identifies complete string between "param3=val3;" and "param4=val4;" and returns it as "0" element of preg_match's $matches array. What I need is to get also all this brackets as array elements, for further analyses of its content, but $matches contains only 2 elements: "0" - whale string; "1" - last "brackets".
Any ideas?
Thanks Dennis.

You can use preg_match_all on the string like so:
preg_match_all("/\[[^][]+\]/", $log, $results);
print_r($results);
This results in:
Array
(
[0] => Array
(
[0] => [int1Param1=int1Val1;int1Param2=int1Val2;]
[1] => [int2Param1=int2Val1;int2Param2=int2Val2;]
[2] => [int3Param1=int3Val1;int3Param2=int3Val2;]
)
)
Demo here.

What you can do:
$pattern = '~(?:(?<new>\[)|\G(?!^))(?<key>[^]=]++)=(?<val>[^][;]++);~';
$subject = 'param1=val1;param2=val2;param3=val3;[int1Param1=int1Val1;int1Param2=int1Val2;][int2Param1=int2Val1;int2Param2=int2Val2;][int3Param1=int3Val1;int3Param2=int3Val2;]param4=val4;';
if (preg_match_all($pattern, $subject, $matches, PREG_SET_ORDER)) {
$i=0;
foreach ($matches as $match) {
if ($match['new']) $i++;
$result[$i][$match['key']]=$match['val'];
}
print_r($result);
}
pattern explanation:
~ # pattern delimiter
(?: # open a non-capturing group
(?<new>\[) # the named group "new" contains a possible "[". It's useful
# to know when a new content in square brackets begins.
| # or
\G(?!^) # a match (that can't be at the start of the string)
# contiguous (\G) to a precedent match
) # close the atomic group
(?<key>[^]=]++) # named group "key"
=
(?<val>[^][;]++) # named group "val"
;
~
the alternative in the atomic group describe to possibilities. The first is [ to match the first pair key/value inside square brackets. Then the second (and others) which is forced to be contiguous to a precedent match can succeed.

Related

Matching whole words between commas, or a comma at the beginning, or a comma at the end with Regex

I have a string like this:
page-9000,page-template,page-type,page-category-128,image-195,listing-latest,rss-latest,even-more-info,even-more-tags
I made this regex that I expect to get the whole tags with:
(?<=\,)(rss-latest|listing-latest-no-category|category-128|page-9000)(?=\,)
I want it to match all the ocurrences.
In this case:
page-9000 and rss-latest.
This regex checks whole words between commas just fine but it ignores the first and the last because it's not between commas (obviously).
I've also tried that it checks if it's between commas OR one comma at the beginning OR one comma to the end, however it would give me false positives, as it would match:
category-128
while the string contains:
page-category-128
Any help?
Try using the following pattern:
(?<=,|^)(rss-latest|listing-latest-no-category|category-128|page-9000)(?=,|$)
The only change I have made is to add boundary markers ^ and $ to the lookarounds to also match on the start and end of the input.
Script:
$input = "page-9000,page-template,page-type,page-category-128,image-195,listing-latest,rss-latest,even-more-info,even-more-tags";
preg_match_all("/(?<=,|^)(rss-latest|listing-latest-no-category|category-128|page-9000)(?=,|$)/", $input, $matches);
print_r($matches[1]);
This prints:
Array
(
[0] => page-9000
[1] => rss-latest
)
Here is a non-regex way using explode and array_intersect:
$arr1 = explode(',', 'page-9000,page-template,page-type,page-category-128,image-195,listing-latest,rss-latest,even-more-info,even-more-tags');
$arr2 = explode('|', 'rss-latest|listing-latest-no-category|category-128|page-9000');
print_r(array_intersect($arr1, $arr2));
Output:
Array
(
[0] => page-9000
[6] => rss-latest
)
The (?<=\,) and (?=,) require the presence of , on both sides of the matching pattern. You want to match also at the start/end of string, and this is where you need to either explicitly tell to match either , or start/end of string or use double-negating logic with negated character classes inside negative lookarounds.
You may use
(?<![^,])(?:rss-latest|listing-latest-no-category|category-128|page-9000)(?![^,])
See the regex demo
Here, (?<![^,]) matches the start of string position or a , and (?![^,]) matches the end of string position or ,.
Now, you do not even need a capturing group, you may get rid of its overhead using a non-capturing group, (?:...). preg_match_all won't have to allocate memory for the submatches and the resulting array will be much cleaner.
PHP demo:
$re = '/(?<![^,])(?:rss-latest|listing-latest-no-category|category-128|page-9000)(?![^,])/m';
$str = 'page-9000,page-template,page-type,page-category-128,image-195,listing-latest,rss-latest,even-more-info,even-more-tags';
if (preg_match_all($re, $str, $matches)) {
print_r($matches[0]);
}
// => Array ( [0] => page-9000 [1] => rss-latest )

PHP Regex to interpret a string as a command line attributes/options

let's say i have a string of
"Insert Post -title Some PostTitle -category 2 -date-posted 2013-02:02 10:10:10"
what i've been trying to do is to convert this string into actions, the string is very readable and what i'm trying to achieve is making posting a little bit easier instead of navigating to new pages every time. Now i'm okay with how the actions are going to work but i've had many failed attempts to process it the way i want, i simple want the values after the attributes (options) to be put into arrays, or simple just extract the values then ill be dealing with them the way i want.
the string above should give me an array of keys=>values, e.g
$Processed = [
'title'=> 'Some PostTitle',
'category'=> '2',
....
];
getting a processed data like this is what i'm looking for.
i've been tryin to write a regex for this but with no hope.
for example this:
/\-(\w*)\=?(.+)?/
that should be close enought to what i want.
note the spaces in title and dates, and that some value can have dashes as well, and maybe i can add a list of allowed attributes
$AllowedOptions = ['-title','-category',...];
i'm just not good at this and would like to have your help!
appreciated !
You can use this lookahead based regex to match your name-value pairs:
/-(\S+)\h+(.*?(?=\h+-|$))/
RegEx Demo
RegEx Breakup:
- # match a literal hyphen
(\S+) # match 1 or more of any non-whitespace char and capture it as group #1
\h+ # match 1 or more of any horizontal whitespace char
( # capture group #2 start
.*? # match 0 or more of any char (non-greedy)
(?=\h+-|$) # lookahead to assert next char is 1+ space and - or it is end of line
) # capture group #2 end
PHP Code:
$str = 'Insert Post -title Some PostTitle -category 2 -date-posted 2013-02:02 10:10:10';
if (preg_match_all('/-(\S+)\h+(.*?(?=\h+-|$))/', $str, $m)) {
$output = array_combine ( $m[1], $m[2] );
print_r($output);
}
Output:
Array
(
[title] => Some PostTitle
[category] => 2
[date-posted] => 2013-02:02 10:10:10
)

RegEx Named Capturing Groups in PHP

I have the following regex to capture a list of numbers (it will be more complex than this eventually):
$list = '10,9,8,7,6,5,4,3,2,1';
$regex =
<<<REGEX
/(?x)
(?(DEFINE)
(?<number> (\d+) )
(?<list> (?&number)(,(?&number))* )
)
^(?&list)/
REGEX;
$matches = array();
if (preg_match($regex,$list,$matches)==1) {
print_r($matches);
}
Which outputs:
Array ( [0] => 10,9,8,7,6,5,4,3,2,1 )
How do I capture the individual numbers in the list in the $matches array? I don't seem to be able to do it, despite putting a capturing group around the digits (\d+).
EDIT
Just to make it clearer, I want to eventually use recursion, so explode is not ideal:
$match =
<<<REGEX
/(?x)
(?(DEFINE)
(?<number> (\d+) )
(?<member> (?&number)|(?&list) )
(?<list> \( ((?&number)|(?&member))(,(?&member))* \) )
)
^(?&list)/
REGEX;
The purpose of a (?(DEFINE)...) section is only to define named sub-patterns you can use later in the define section itself or in the main pattern. Since these sub-patterns are not defined in the main pattern they don't capture anything, and a reference (?&number) is only a kind of alias for the sub-pattern \d+ and doesn't capture anything too.
Example with the string: 1abcde2
If I use this pattern: /^(?<num>\d).....(?&num)$/ only 1 is captured in the group num, (?&num) doesn't capture anything, it's only an alias for \d./^(?<num>\d).....\d$/ produces exactly the same result.
An other point to clarify. With PCRE (the PHP regex engine), a capture group (named or not) can only store one value, even if you repeat it.
The main problem of your approach is that you are trying to do two things at the same time:
you want to check the format of the string.
you want to extract an unknown number of items.
Doing this is only possible in particular situations, but impossible in general.
For example, with a flat list like: $list = '10,9,8,7,6,5,4,3,2,1'; where there are no nested elements, you can use a function like preg_match_all to reuse the same pattern several times in this way:
if (preg_match_all('~\G(\d+)(,|$)~', $list, $matches) && !end($matches[2])) {
// \G ensures that results are contiguous
// you have all the items in $matches[1]
// if the last item of $matches[2] is empty, this means
// that the end of the string is reached and the string
// format is correct
echo '<°)))))))>';
}
Now if you have a nested list like $list = '10,9,(8,(7,6),5),4,(3,2),1'; and you want for example to check the format and to produce a tree structure like:
[ 10, 9, [ 8, [ 7, 6 ], 5 ], 4 , [ 3, 2 ], 1 ]
You can't do it with a single pass. You need one pattern to check the whole string format and an other pattern to extract elements (and a recursive function to use it).
<<<FORGET_THIS_IMMEDIATELY
As an aside you can do it with eval and strtr, but it's a very dirty and dangerous way:
eval('$result=[' . strtr($list, '()', '[]') . '];');
FORGET_THIS_IMMEDIATELY;
If you mean to get an array of the comma delimited numbers, then explode:
$numbers = explode(',', $matches[0]); //first parameter is your delimiter what the string will be split up by. And the second parameter is the initial string
print_r($numbers);
output:
Array(
[0] => 10,
[1] => 9,
[2] => 8,
etc
For this simple list, this would be enough (if you have to use a regular expression):
$string = '10,9,8,7,6,5,4,3,2,1';
$pattern = '/([\d]+),?/';
preg_match_all($pattern, $string, $matches);
print_r($matches[1]);

"Optional" substring matching with regex

I am writing a regular expression in PHP that will need to extract data from strings that look like:
Naujasis Salemas, Šiaurės Dakota
Jungtinės Valstijos (Centras, Šiaurės Dakota)
I would like to extract:
Naujasis Salemas
Centras
For the first case, I have written [^-]*(?=,), which works quite well. I would like to modify the expression so that if there are parenthesis ( and ) , it should search between those parenthesis and then extract everything before the comma.
Is it possible to do something like this with just 1 expression? If so, how can I make it search within parenthesis if they exist?
A conditional might help you here:
$stra = 'Naujasis Salemas, Šiaurės Dakota';
$strb = 'Jungtinės Valstijos (Centras, Šiaurės Dakota)';
$regex = '
/^ # Anchor at start of string.
(?(?=.*\(.+,.*\)) # Condition to check for: presence of text in parenthesis.
.*\(([^,]+) # If condition matches, match inside parenthesis to first comma.
| ([^,]+) # Else match start of string to first comma.
)
/x
';
preg_match($regex, $stra, $matches) and print_r($matches);
/*
Array
(
[0] => Naujasis Salemas
[1] =>
[2] => Naujasis Salemas
)
*/
preg_match($regex, $strb, $matches) and print_r($matches);
/*
Array
(
[0] => Jungtinės Valstijos (Centras
[1] => Centras
)
*/
Note that the index in $matches changes slightly above, but you might be able to work around that using named subpatterns.
I think this one could do it:
[^-(]+(?=,)
This is the same regex as your, but it doesn't allow a parenthesis in the matched string. It will still match on the first subject, and on the second it will match just after the opening parenthesis.
Try it here: http://ideone.com/Crhzz
You could use
[^(),]+(?=,)
That would match any text except commas or parentheses, followed by a comma.

Getting contents of square brackets with regex, including nested ones

Is there any way to have this:
[one[two]][three]
And extract this with a regex?
Array (
[0] => one[two]
[1] => two
[2] => three
For PHP you can use recursion in regular expressions that nearly gives you what you want:
$s = 'abc [one[two]][three] def';
$matches = array();
preg_match_all('/\[(?:[^][]|(?R))*\]/', $s, $matches);
print_r($matches);
Result:
Array
(
[0] => Array
(
[0] => [one[two]]
[1] => [three]
)
)
For something more advanced than this, it's probably best not to use regular expressions.
You can apply the regex with a loop, for example,
Match all \[([^\]]*)\].
For each match, replace \x01 with [ and \x02 with ] and output the result.
Replace all of \[([^\]]*)\] into \x01$1\x02 (warning: assumes \x01 and \x02 are not used by the string.)
Repeat 1 until there's no match.
But I'd write a string scanner for this problem :).
#!/usr/bin/perl
use Data::Dumper;
#a = ();
$re = qr/\[((?:[^][]|(??{$re}))*)\](?{push#a,$^N})/;
'[one[two]][three]' =~ /$re*/;
print Dumper \#a;
# $VAR1 = [
# 'two',
# 'one[two]',
# 'three'
# ];
Not exactly what you asked for, but it's kinda doable with (ir)regular expression extensions. (Perl 5.10's (?PARNO) can replace the usage of (??{CODE}).)
In Perl 5.10 regex, you can use named backtracking and a recursive subroutine to do that:
#!/usr/bin/perl
$re = qr /
( # start capture buffer 1
\[ # match an opening brace
( # capture buffer 2
(?: # match one of:
(?> # don't backtrack over the inside of this group
[^\[\]]+ # one or more non braces
) # end non backtracking group
| # ... or ...
(?1) # recurse to bracket 1 and try it again
)* # 0 or more times.
) # end buffer 2
\] # match a closing brace
) # end capture buffer one
/x;
print "\n\n";
sub strip {
my ($str) = #_;
while ($str=~/$re/g) {
$match=$1; $striped=$2;
print "$striped\n";
strip($striped) if $striped=~/\[/;
return $striped;
}
}
$str="[one[two]][three][[four]five][[[six]seven]eight]";
print "start=$str\n";
while ($str=~/$re/g) {
strip($1) ;
}
Output:
start=[one[two]][three][[four]five][[[six]seven]eight]
one[two]
two
three
[four]five
four
[[six]seven]eight
[six]seven
six

Categories