Regular expression for matching between text - php

I have a file, which contains automatically generated statistical data from apache http logs.
I'm really struggling on how to match lines between 2 sections of text. This is a portion of the stat file I have:
jpg 6476 224523785 0 0
Unknown 31200 248731421 0 0
gif 197 408771 0 0
END_FILETYPES
# OS ID - Hits
BEGIN_OS 12
linuxandroid 1034
winlong 752
winxp 1320
win2008 204250
END_OS
# Browser ID - Hits
BEGIN_BROWSER 79
mnuxandroid 1034
winlong 752
winxp 1320
What I'm trying to do, is write a regex which will only search between the tags BEGIN_OS 12 and END_OS.
I want to create a PHP array that contains the OS and the hits, for example (I know the actual array won't actually be exactly like this, but as long as I have this data in it):
array(
[0] => array(
[0] => linuxandroid
[1] => winlong
[2] => winxp
[3] => win2008
)
[1] => array(
[0] => 1034
[1] => 752
[2] => 1320
[3] => 204250
)
)
I've been trying for a good couple of hours now with gskinner regex tester to test regular expressions, but regex is far from my strong point.
I would post what I've got so far, but I've tried loads, and the closest one I've got is:
^[BEGIN_OS\s12]+([a-zA-Z0-9]+)\s([0-9]+)
which is pathetically awful!
Any help would be appreciated, even if its a 'It cant be done'.

A regular expression may not be the best tool for this job. You can use a regex to get the required substring and then do the further processing with PHP's string manipulation functions.
$string = preg_replace('/^.*BEGIN_OS \d+\s*(.*?)\s*END_OS.*/s', '$1', $text);
foreach (explode(PHP_EOL, $string) as $line) {
list($key, $value) = explode(' ', $line);
$result[$key] = $value;
}
print_r($result);
Should give you the following output:
Array
(
[linuxandroid] => 1034
[winlong] => 752
[winxp] => 1320
[win2008] => 204250
)

You might try something like:
/BEGIN_OS 12\s(?:([\w\d]+)\s([\d]+\s))*END_OS/gm
You'll have to parse the match still for your results, You may also simplify it with something like:
/BEGIN_OS 12([\s\S]*)END_OS/gm
And then just parse the first group (the text between them) and split on '\n' then ' ' to get the parts you desire.
Edit
Regexs with comments:
/BEGIN_OS 12 // Match "BEGIN_OS 12" exactly
\s // Match a whitespace character after
(?: // Begin a non-capturing group
([\w\d]+) // Match any word or digit character, at least 1 or more
\s // Match a whitespace character
([\d]+\s) // Match a digit character, at least one or more
)* // End non-capturing group, repeate group 0 or more times
END_OS // Match "END_OS" exactly
/gm // global search (g) and multiline (m)
And the simple version:
/BEGIN_OS 12 // Match "BEGIN_OS 12" exactly
( // Begin group
[\s\S]* // Match any whitespace/non-whitespace character (works like the '.' but captures newlines
) // End group
END_OS // Match "END_OS" exactly
/gm // global search (g) and multiline (m)
Secondary Edit
Your attempt:
^[BEGIN_OS\s12]+([a-zA-Z0-9]+)\s([0-9]+)
Won't give you the results you expect. If you break it apart:
^ // Match the start of a line, without 'm' this means the beginning of the string.
[BEGIN_OS\s12]+ // This means, match a character that is any [B, E, G, I, N, _, O, S, \s, 1, 2]
// where there is at least 1 or more. While this matches "BEGIN_OS 12"
// it also matches any other lines that contains a combination of those
// characters or just a line of whitespace thanks to \s).
([a-zA-Z0-9]+) // This should match the part you expect, but potentially not with the previous rules in place.
\s
([0-9]+) // This is the same as [\d]+ or \d+ but should match what you expect (again, potentially not with the first rule)

Related

Regex for find value between curly braces which have pipe separator

$str = ({max_w} * {max_h} * {key|value}) / {key_1|value}
I have the above formula, I want to match the value with curly braces and which has a pipe separator. Right now the issue is it's giving me the values which have not pipe separator. I am new in regex so not have much idea about that. I tried below one
preg_match_all("^\{(|.*?|)\}^",$str, PREG_PATTERN_ORDER);
It gives below output
Array
(
[0] => key|value
[1] => max_w
[2] => max_h
[3] => key_1|value
)
Expected output
Array
(
[0] => key|value
[1] => key_1|value
)
Not sure about PHP. Here's the general regex that will do this.
{([^{}]*\|[^{}]*)}
Here is the demo.
You can use
(?<={)[^}]*\|[^}]*(?=})
For the given string the two matches are shown by the pointy characters:
({max_w} * {max_h} * {key|value}) / {key_1|value}
^^^^^^^^^ ^^^^^^^^^^^
Demo
(?<={) is a positive lookbehind. Arguably, the positive lookahead (?=}) is not be needed if it is known that all braces appear in matching, non-overlapping pairs.
The pattern \{(|.*?|)\} has 2 alternations | that can be omitted as the alternatives on the left and right of it are not really useful.
That leaves \{(.*?)} where the . can match any char including a pipe char, and therefore does not make sure that it is matched in between.
You can use a pattern that does not crosses matching a curly or a pipe char to match a single pipe in between.
{\K[^{}|]*\|[^{}|]*(?=})
{ Match opening {
\K Forget what is matches until now
[^{}|]* Match any char except the listed
\| Match a | char
[^{}|]* Match any char except the listed
(?=}) Assert a closing } to the right
Regex demo | PHP demo
$str = "({max_w} * {max_h} * {key|value}) / {key_1|value}";
$pattern = "/{\K[^{}|]*\|[^{}|]*(?=})/";
preg_match_all($pattern, $str, $matches);
print_r($matches[0]);
Output
Array
(
[0] => key|value
[1] => key_1|value
)
Or using a capture group:
{([^{}|]*\|[^{}|]*)}
Regex demo

PHP Regex to interpret a string as a command line attributes/options

let's say i have a string of
"Insert Post -title Some PostTitle -category 2 -date-posted 2013-02:02 10:10:10"
what i've been trying to do is to convert this string into actions, the string is very readable and what i'm trying to achieve is making posting a little bit easier instead of navigating to new pages every time. Now i'm okay with how the actions are going to work but i've had many failed attempts to process it the way i want, i simple want the values after the attributes (options) to be put into arrays, or simple just extract the values then ill be dealing with them the way i want.
the string above should give me an array of keys=>values, e.g
$Processed = [
'title'=> 'Some PostTitle',
'category'=> '2',
....
];
getting a processed data like this is what i'm looking for.
i've been tryin to write a regex for this but with no hope.
for example this:
/\-(\w*)\=?(.+)?/
that should be close enought to what i want.
note the spaces in title and dates, and that some value can have dashes as well, and maybe i can add a list of allowed attributes
$AllowedOptions = ['-title','-category',...];
i'm just not good at this and would like to have your help!
appreciated !
You can use this lookahead based regex to match your name-value pairs:
/-(\S+)\h+(.*?(?=\h+-|$))/
RegEx Demo
RegEx Breakup:
- # match a literal hyphen
(\S+) # match 1 or more of any non-whitespace char and capture it as group #1
\h+ # match 1 or more of any horizontal whitespace char
( # capture group #2 start
.*? # match 0 or more of any char (non-greedy)
(?=\h+-|$) # lookahead to assert next char is 1+ space and - or it is end of line
) # capture group #2 end
PHP Code:
$str = 'Insert Post -title Some PostTitle -category 2 -date-posted 2013-02:02 10:10:10';
if (preg_match_all('/-(\S+)\h+(.*?(?=\h+-|$))/', $str, $m)) {
$output = array_combine ( $m[1], $m[2] );
print_r($output);
}
Output:
Array
(
[title] => Some PostTitle
[category] => 2
[date-posted] => 2013-02:02 10:10:10
)

Matching all characters except spaces in regex

Right now I have a regex, and I want to change one part of the regex.
(.{3,}?) ~
^---This part of the code, where it says, (any characters that are 3 or more in length, and matches up to the nearest space), I want to change it to (any characters, except spaces , that are 3 or more in length, and matches up to the nearest space). How would I say that in regex?
$text = "my name is to habert";
$regex = "~(?:my name is |my name\\\'s |i am |i\\\'m |it is |it\\\'s |call me )?(.{3,}?) ~i";
preg_match($regex, $text, $match);
print_r($match);
Result:
Array ( [0] => my name [1] => my name )
Need Result:
Array ( [0] => name [1] => name )
Gravedigger here... Since this question does not have an answer yet, I'll post mine.
(\S{3,}) will work for your needs
Regex Explanation:
( Open capture group
\S Everything but whitespaces (same as [^\s], you can use [^ ] too, but the latter works only for spaces.)
{3,} Must contain three or more characters
) Close capture group
Test it here!

Split string on non-alphanumeric characters and on positions between digits and non-digits

I'm trying to split a string by non-alphanumeric delimiting characters AND between alternations of digits and non-digits. The end result should be a flat array of consisting of alphabetic strings and numeric strings.
I'm working in PHP, and would like to use REGEX.
Examples:
ES-3810/24MX should become ['ES', '3810', '24', 'MX']
CISCO1538M should become ['CISCO' , '1538', 'M']
The input file sequence can be indifferently DIGITS or ALPHA.
The separators can be non-ALPHA and non-DIGIT chars, as well as a change between a DIGIT sequence to an APLHA sequence, and vice versa.
The command to match all occurrances of a regex is preg_match_all() which outputs a multidimensional array of results. The regex is very simple... any digit ([0-9]) one or more times (+) or (|) any letter ([A-z]) one or more times (+). Note the capital A and lowercase z to include all upper and lowercase letters.
The textarea and php tags are inluded for convenience, so you can drop into your php file and see the results.
<textarea style="width:400px; height:400px;">
<?php
foreach( array(
"ES-3810/24MX",
"CISCO1538M",
"123ABC-ThatsHowEasy"
) as $string ){
// get all matches into an array
preg_match_all("/[0-9]+|[[:upper:][:lower:]]+/",$string,$matches);
// it is the 0th match that you are interested in...
print_r( $matches[0] );
}
?>
</textarea>
Which outputs in the textarea:
Array
(
[0] => ES
[1] => 3810
[2] => 24
[3] => MX
)
Array
(
[0] => CISCO
[1] => 1538
[2] => M
)
Array
(
[0] => 123
[1] => ABC
[2] => ThatsHowEasy
)
$str = "ES-3810/24MX35 123 TEST 34/TEST";
$str = preg_replace(array("#[^A-Z0-9]+#i","#\s+#","#([A-Z])([0-9])#i","#([0-9])([A-Z])#i"),array(" "," ","$1 $2","$1 $2"),$str);
echo $str;
$data = explode(" ",$str);
print_r($data);
I could not think on a more 'cleaner' way.
The most direct preg_ function to produce the desired flat output array is preg_split().
Because it doesn't matter what combination of alphanumeric characters are on either side of a sequence of non-alphanumeric characters, you can greedily split on non-alphanumeric substrings without "looking around".
After that preliminary obstacle is dealt with, then split on the zero-length positions between a digit and a non-digit OR between a non-digit and a digit.
/ #starting delimiter
[^a-z\d]+ #match one or more non-alphanumeric characters
| #OR
\d\K(?=\D) #match a number, then forget it, then lookahead for a non-number
| #OR
\D\K(?=\d) #match a non-number, then forget it, then lookahead for a number
/ #ending delimiter
i #case-insensitive flag
Code: (Demo)
var_export(
preg_split('/[^a-z\d]+|\d\K(?=\D)|\D\K(?=\d)/i', $string, 0, PREG_SPLIT_NO_EMPTY)
);
preg_match_all() isn't a silly technique, but it doesn't return the array, it returns the number of matches and generates a reference variable containing a two dimensional array of which the first element needs to be accessed. Admittedly, the pattern is shorter and easier to follow. (Demo)
var_export(
preg_match_all('/[a-z]+|\d+/i', $string, $m) ? $m[0] : []
);

How to get all captures of subgroup matches with preg_match_all()? [duplicate]

This question already has answers here:
Get repeated matches with preg_match_all()
(6 answers)
Closed 4 years ago.
Update/Note:
I think what I'm probably looking for is to get the captures of a group in PHP.
Referenced: PCRE regular expressions using named pattern subroutines.
(Read carefully:)
I have a string that contains a variable number of segments (simplified):
$subject = 'AA BB DD '; // could be 'AA BB DD CC EE ' as well
I would like now to match the segments and return them via the matches array:
$pattern = '/^(([a-z]+) )+$/i';
$result = preg_match_all($pattern, $subject, $matches);
This will only return the last match for the capture group 2: DD.
Is there a way that I can retrieve all subpattern captures (AA, BB, DD) with one regex execution? Isn't preg_match_all suitable for this?
This question is a generalization.
Both the $subject and $pattern are simplified. Naturally with such the general list of AA, BB, .. is much more easy to extract with other functions (e.g. explode) or with a variation of the $pattern.
But I'm specifically asking how to return all of the subgroup matches with the preg_...-family of functions.
For a real life case imagine you have multiple (nested) level of a variant amount of subpattern matches.
Example
This is an example in pseudo code to describe a bit of the background. Imagine the following:
Regular definitions of tokens:
CHARS := [a-z]+
PUNCT := [.,!?]
WS := [ ]
$subject get's tokenized based on these. The tokenization is stored inside an array of tokens (type, offset, ...).
That array is then transformed into a string, containing one character per token:
CHARS -> "c"
PUNCT -> "p"
WS -> "s"
So that it's now possible to run regular expressions based on tokens (and not character classes etc.) on the token stream string index. E.g.
regex: (cs)?cp
to express one or more group of chars followed by a punctuation.
As I now can express self-defined tokens as regex, the next step was to build the grammar. This is only an example, this is sort of ABNF style:
words = word | (word space)+ word
word = CHARS+
space = WS
punctuation = PUNCT
If I now compile the grammar for words into a (token) regex I would like to have naturally all subgroup matches of each word.
words = (CHARS+) | ( (CHARS+) WS )+ (CHARS+) # words resolved to tokens
words = (c+)|((c+)s)+c+ # words resolved to regex
I could code until this point. Then I ran into the problem that the sub-group matches did only contain their last match.
So I have the option to either create an automata for the grammar on my own (which I would like to prevent to keep the grammar expressions generic) or to somewhat make preg_match working for me somehow so I can spare that.
That's basically all. Probably now it's understandable why I simplified the question.
Related:
pcrepattern man page
Get repeated matches with preg_match_all()
Similar thread: Get repeated matches with preg_match_all()
Check the chosen answer plus mine might be useful I will duplicate there:
From http://www.php.net/manual/en/regexp.reference.repetition.php :
When a capturing subpattern is repeated, the value captured is the substring that matched the final iteration.
I personally give up and going to do this in 2 steps.
EDIT:
I see in that other thread someone claimed that lookbehind method is able doing it.
Try this:
preg_match_all("'[^ ]+'i",$text,$n);
$n[0] will contain an array of all non-space character groups in the text.
Edit: with subgroups:
preg_match_all("'([^ ]+)'i",$text,$n);
Now $n[1] will contain the subgroup matches, that are exactly the same as $n[0]. This is pointless actually.
Edit2: nested subgroups example:
$test = "Hello I'm Joe! Hi I'm Jane!";
preg_match_all("/(H(ello|i)) I'm (.*?)!/i",$test,$n);
And the result:
Array
(
[0] => Array
(
[0] => Hello I'm Joe!
[1] => Hi I'm Jane!
)
[1] => Array
(
[0] => Hello
[1] => Hi
)
[2] => Array
(
[0] => ello
[1] => i
)
[3] => Array
(
[0] => Joe
[1] => Jane
)
)
Is there a way that I can retrieve all matches (AA, BB, DD) with one regex execution? Isn't preg_match_all not suitable for this?
Your current regex seems to be for a preg_match() call. Try this instead:
$pattern = '/[a-z]+/i';
$result = preg_match_all($pattern, $subject, $matches);
Per comments, the ruby regex I mentioned:
sentence = %r{
(?<subject> cat | dog ){0}
(?<verb> eats | drinks ){0}
(?<object> water | bones ){0}
(?<adjective> big | smelly ){0}
(?<obj_adj> (\g<adjective>\s)? ){0}
The\s\g<obj_adj>\g<subject>\s\g<verb>\s\g<opt_adj>\g<object>
}x
md = sentence.match("The cat drinks water");
md = sentence.match("The big dog eats smelly bones");
But I think you'll need a lexer/parser/tokenizer to do the same kind of thing in PHP. :-|
You can't extract the subpatterns because the way you wrote your regex returns only one match (using ^ and $ at the same time, and + on the main pattern).
If you write it this way, you'll see that your subgroups are correctly there:
$pattern = '/(([a-z]+) )/i';
(this still has an unnecessary set of parentheses, I just left it there for illustration)
Edit
I didn't realize what you had originally asked for. Here is the new solution:
$result = preg_match_all('/[a-z]+/i', $subject, $matches);
$resultArr = ($result) ? $matches[0] : array();
How about:
$str = 'AA BB CC';
$arr = preg_split('/\s+/', $str);
print_r($arr);
output:
(
[0] => AA
[1] => BB
[2] => CC
)
I may have misunderstood what you're describing. Are you just looking for a pattern for groups of letters with whitespace between?
// any subject containing words:
$subject = 'AfdfdfdA BdfdfdB DdD';
$subject = 'AA BB CC';
$subject = 'Af df dfdA Bdf dfdB DdD';
$pattern = '/(([a-z]+)\s)+[a-z]+/i';
$result = preg_match_all($pattern, $subject, $matches);
print_r($matches);
echo "<br/>";
print_r($matches[0]); // this matches $subject
echo "<br/>".$result;
Yes your right your solution is by using preg_match_all preg_match_all is recursive, so dont use start-with^ and end-with$, so that preg_match_all put all found patterns in an array.
Each new pair of parenthesis will add a New arrays indicating the different matches
use ? for optional matches
You can Separate different groups of patterns reported with the parenthesis () to ask for a group to be found and added in a new array (can allow you to count matches, or to categorize each matches from the returned array )
Clarification required
Let me try to understand you question, so that my answer match what you ask.
Your $subject is not a good exemple of what your looking for?
You would like the pregmatch search, to split what you provided in $subject in to 4 categories , Words, Characters, Punctuation and white spaces ? and what about numbers?
As well you would like The returned matches, to have the offsets of the matches specified ?
Does $subject = 'aa.bb cc.dd EE FFF,GG'; better fit a real life exemple?
I will take your basic exemple in $subject and make it work to give your exactly what your asked.
So can you edit your $subject so that i better fit all the cases that you want to match
Original '/^(([a-z]+) )+$/i';
Keep me posted,
you can test your regexes here http://www.spaweditor.com/scripts/regex/index.php
Partial answer
/([a-z])([a-z]+)/i
AA BB DD CD
Array
(
[0] => Array
(
[0] => AA
[1] => BB
[2] => DD
[3] => CD
)
[1] => Array
(
[0] => A
[1] => B
[2] => D
[3] => C
)
[2] => Array
(
[0] => A
[1] => B
[2] => D
[3] => D
)
)

Categories