There are quite a few questions on SO asking about how to parse a regex pattern and output all possible matches to that pattern. For some reason, though, every single one of them I can find (1, 2, 3, 4, 5, 6, 7, probably more) are either for Java or some variety of C (and just one for JavaScript), and I currently need to do this in PHP.
I’ve Googled to my heart’s (dis)content, but whatever I do, pretty much the only thing that Google gives me is links to the docs for preg_match() and pages about how to use regex, which is the opposite of what I want here.
My regex patterns are all very simple and guaranteed to be finite; the only syntax used is:
[] for character classes
() for subgroups (capturing not required)
| (pipe) for alternative matches within subgroups
? for zero-or-one matches
So an example might be [ct]hun(k|der)(s|ed|ing)? to match all possible forms of the verbs chunk, thunk, chunder and thunder, for a total of sixteen permutations.
Ideally, there’d be a library or tool for PHP which will iterate through (finite) regex patterns and output all possible matches, all ready to go. Does anyone know if such a library/tool already exists?
If not, what is an optimised way to approach making one? This answer for JavaScript is the closest I’ve been able to find to something I should be able to adapt, but unfortunately I just can’t wrap my head around how it actually works, which makes adapting it more tricky. Plus there may well be better ways of doing it in PHP anyway. Some logical pointers as to how the task would best be broken down would be greatly appreciated.
Edit: Since apparently it wasn’t clear how this would look in practice, I am looking for something that will allow this type of input:
$possibleMatches = parseRegexPattern('[ct]hun(k|der)(s|ed|ing)?');
– and printing $possibleMatches should then give something like this (the order of the elements is not important in my case):
Array
(
[0] => chunk
[1] => thunk
[2] => chunks
[3] => thunks
[4] => chunked
[5] => thunked
[6] => chunking
[7] => thunking
[8] => chunder
[9] => thunder
[10] => chunders
[11] => thunders
[12] => chundered
[13] => thundered
[14] => chundering
[15] => thundering
)
Method
You need to strip out the variable patterns; you can use preg_match_all to do this
preg_match_all("/(\[\w+\]|\([\w|]+\))/", '[ct]hun(k|der)(s|ed|ing)?', $matches);
/* Regex:
/(\[\w+\]|\([\w|]+\))/
/ : Pattern delimiter
( : Start of capture group
\[\w+\] : Character class pattern
| : OR operator
\([\w|]+\) : Capture group pattern
) : End of capture group
/ : Pattern delimiter
*/
You can then expand the capture groups to letters or words (depending on type)
$array = str_split($cleanString, 1); // For a character class
$array = explode("|", $cleanString); // For a capture group
Recursively work your way through each $array
Code
function printMatches($pattern, $array, $matchPattern)
{
$currentArray = array_shift($array);
foreach ($currentArray as $option) {
$patternModified = preg_replace($matchPattern, $option, $pattern, 1);
if (!count($array)) {
echo $patternModified, PHP_EOL;
} else {
printMatches($patternModified, $array, $matchPattern);
}
}
}
function prepOptions($matches)
{
foreach ($matches as $match) {
$cleanString = preg_replace("/[\[\]\(\)\?]/", "", $match);
if ($match[0] === "[") {
$array = str_split($cleanString, 1);
} elseif ($match[0] === "(") {
$array = explode("|", $cleanString);
}
if ($match[-1] === "?") {
$array[] = "";
}
$possibilites[] = $array;
}
return $possibilites;
}
$regex = '[ct]hun(k|der)(s|ed|ing)?';
$matchPattern = "/(\[\w+\]|\([\w|]+\))\??/";
preg_match_all($matchPattern, $regex, $matches);
printMatches(
$regex,
prepOptions($matches[0]),
$matchPattern
);
Additional functionality
Expanding nested groups
In use you would put this before the "preg_match_all".
$regex = 'This happen(s|ed) to (be(come)?|hav(e|ing)) test case 1?';
echo preg_replace_callback("/(\(|\|)(\w+)(?:\(([\w\|]+)\)\??)/", function($array){
$output = explode("|", $array[3]);
if ($array[0][-1] === "?") {
$output[] = "";
}
foreach ($output as &$option) {
$option = $array[2] . $option;
}
return $array[1] . implode("|", $output);
}, $regex), PHP_EOL;
Output:
This happen(s|ed) to (become|be|have|having) test case 1?
Matching single letters
The bones of this would be to update the regex:
$matchPattern = "/(?:(\[\w+\]|\([\w|]+\))\??|(\w\?))/";
and add an else to the prepOptions function:
} else {
$array = [$cleanString];
}
Full working example
function printMatches($pattern, $array, $matchPattern)
{
$currentArray = array_shift($array);
foreach ($currentArray as $option) {
$patternModified = preg_replace($matchPattern, $option, $pattern, 1);
if (!count($array)) {
echo $patternModified, PHP_EOL;
} else {
printMatches($patternModified, $array, $matchPattern);
}
}
}
function prepOptions($matches)
{
foreach ($matches as $match) {
$cleanString = preg_replace("/[\[\]\(\)\?]/", "", $match);
if ($match[0] === "[") {
$array = str_split($cleanString, 1);
} elseif ($match[0] === "(") {
$array = explode("|", $cleanString);
} else {
$array = [$cleanString];
}
if ($match[-1] === "?") {
$array[] = "";
}
$possibilites[] = $array;
}
return $possibilites;
}
$regex = 'This happen(s|ed) to (be(come)?|hav(e|ing)) test case 1?';
$matchPattern = "/(?:(\[\w+\]|\([\w|]+\))\??|(\w\?))/";
$regex = preg_replace_callback("/(\(|\|)(\w+)(?:\(([\w\|]+)\)\??)/", function($array){
$output = explode("|", $array[3]);
if ($array[0][-1] === "?") {
$output[] = "";
}
foreach ($output as &$option) {
$option = $array[2] . $option;
}
return $array[1] . implode("|", $output);
}, $regex);
preg_match_all($matchPattern, $regex, $matches);
printMatches(
$regex,
prepOptions($matches[0]),
$matchPattern
);
Output:
This happens to become test case 1
This happens to become test case
This happens to be test case 1
This happens to be test case
This happens to have test case 1
This happens to have test case
This happens to having test case 1
This happens to having test case
This happened to become test case 1
This happened to become test case
This happened to be test case 1
This happened to be test case
This happened to have test case 1
This happened to have test case
This happened to having test case 1
This happened to having test case
Related
I have tags in a html file like this, placed throughout;
*|SUBJECT|*
*|SUBJECT|*
*|IFNOT:ARCHIVE_PAGE|*
*|ARCHIVE|*
*|END:IF|*
*|FACEBOOK:PROFILEURL|*
*|TWITTER:PROFILEURL|*
*|FORWARD|*
*|IF:REWARDS*
*|REWARDS|*
*|END:IF|*
Using this PHP function and regex i can get the results of all the tags
preg_match_all("/\*\|(.*?)\|\*/", $this->template, $elements);
$this->elements["Tags"] = $elements[0];
$this->elements["TagNames"] = $elements[1];
What i want, is to find a way to capture the IF:(TAG) statements and IFNOT:(TAG) statements as well and the content.
What i have so far is
ergex=> /\*\|IF(([A-Z{0-3}]):([A-Z_]+))\|\*(.*?)\*\|END:IF\|\*|\*\|(.*?)\|\*/g
But it only catches the tags them self as a whole, can anyone point me in the right direction or help me out.
As I mentioned in the comments you approach is to simplistic, I can get you started using the method I use for these things. It's more a tokenizer/lexer/parser methodology.
That sounds big and scary but it actually makes it simpler
<?php
function parse($subject, $tokens)
{
$types = array_keys($tokens);
$patterns = [];
$lexer_stream = [];
$result = false;
foreach ($tokens as $k=>$v){
$patterns[] = "(?P<$k>$v)";
}
$pattern = "/".implode('|', $patterns)."/i";
if (preg_match_all($pattern, $subject, $matches, PREG_OFFSET_CAPTURE)) {
//print_r($matches);
foreach ($matches[0] as $key => $value) {
$match = [];
foreach ($types as $type) {
$match = $matches[$type][$key];
if (is_array($match) && $match[1] != -1) {
break;
}
}
$tok = [
'content' => $match[0],
'type' => $type,
'offset' => $match[1]
];
$lexer_stream[] = $tok;
}
$result = parseTokens( $lexer_stream );
}
return $result;
}
function parseTokens( array &$lexer_stream ){
$result = [];
$mode = 'none';
while($current = current($lexer_stream)){
$content = $current['content'];
$type = $current['type'];
switch($type){
case 'T_WHITESPACE':
next($lexer_stream);
break;
case 'T_TAG_START':
$mode = 'start';
next($lexer_stream);
break;
case 'T_WORD':
if($mode == 'start') echo "Tag $content\n";
if($mode == 'ifnot') echo "IfNot $content\n";
next($lexer_stream);
break;
case 'T_TAG_END':
$mode = 'none';
next($lexer_stream);
break;
case 'T_IFNOT':
$mode = 'ifnot';
next($lexer_stream);
break;
case 'T_EOF': return;
case 'T_UNKNOWN':
default:
print_r($current);
trigger_error("Unknown token $type value $content", E_USER_ERROR);
}
}
if( !$current ) return;
print_r($current);
trigger_error("Unclosed item $mode for $type value $content", E_USER_ERROR);
}
$subject = '*|SUBJECT|*
*|SUBJECT|*
*|IFNOT:ARCHIVE_PAGE|*
*|ARCHIVE|*
*|END:IF|*
*|FACEBOOK:PROFILEURL|*
*|TWITTER:PROFILEURL|*
*|FORWARD|*
*|IF:REWARDS*
*|REWARDS|*
*|END:IF|*';
$tokens = [
'T_WHITESPACE' => '[\r\n\s\t]+',
'T_TAG_START' => '\*\|',
'T_TAG_END' => '\|\*',
'T_IF' => 'IF:',
'T_IFNOT' => 'IFNOT:',
'T_ENDIF' => 'END:IF',
'T_WORD' => '\w+',
'T_EOF' => '\Z',
'T_UNKNOWN' => '.+?'
];
parse($subject,$tokens);
So this you can see here
And it outputs:
Tag SUBJECT
Tag SUBJECT
IfNot ARCHIVE_PAGE
Tag ARCHIVE
Array
(
[content] => END:IF
[type] => T_ENDIF
[offset] => 69
)
<br />
<b>Fatal error</b>: Unknown token T_ENDIF value END:IF in <b>[...][...]</b> on line <b>67</b><br />
The error is because I only worked it out to the End if tag (have to leave something for you to do).
For a parser that I did using this for another question you can find that on my github
https://github.com/ArtisticPhoenix/MISC/blob/master/JasonDecoder.php
It should give you some ideas on how to handle nested array like structures etc..
The basic idea is you can just add one tag at a time, and then do the parsing for that one tag. You can do as much or as little error checking on different mode types in the wrong place, and so on. It just gives everything a nice structure to work from. Essential it works the same basic way you method works, using preg_match_all and a string of regx's. The main difference is that it builds the full regx from an array, and then using the array keys and named capture groups (and a bit of code magic) it lets you reference them in a more intuitive way. It also uses the PREG_OFFSET_CAPTURE flag, which I found to be faster then the other flags.
One note is that the order of the tags is important, if you put the T_UNKNOWN tag first it matches everything so it won't go to the tags below it. Therefor they should be the more specific the match the higher in the list. For example you could do a tag like this
'T_IFNOT' => '\*\|IFNOT:',
Instead of the one I have, but it would likely have to go before the:
'T_TAG_START' => '\*\|',
Because that tag will match it first.
Also don't forget to put next($lexer_stream); or it will be an infinite loop. It's necessary to use while and next to control the array pointer when in nested structures like arrays.
Good Luck and happy parsing!
How would I go about ordering 1 array into 2 arrays depending on what each part of the array starts with using preg_match() ?
I know how to do this using 1 expression but I don't know how to use 2 expressions.
So far I can have done (don't ask why I'm not using strpos() - I need to use regex):
$gen = array(
'F1',
'BBC450',
'BBC566',
'F2',
'F31',
'SOMETHING123',
'SOMETHING456'
);
$f = array();
$bbc = array();
foreach($gen as $part) {
if(preg_match('/^F/', $part)) {
// Add to F array
array_push($f, $part);
} else if(preg_match('/^BBC/', $part)) {
// Add to BBC array
array_push($bbc, $part);
} else {
// Not F or BBC
}
}
So my question is: is it possible to do this using 1 preg_match() function?
Please ignore the SOMETHING part in the array, it's to show that using just one if else statement wouldn't solve this.
Thanks.
You can use an alternation along with the third argument to preg_match, which contains the part of the regexp that matched.
preg_match('/^(?:F|BBC)/', $part, $match);
switch ($match) {
case 'F':
$f[] = $part;
break;
case 'BBC':
$bbc[] = $part;
break;
default:
// Not F or BBC
}
It is even possible without any loop, switch, or anything else (which is faster and more efficient then the accepted answer's solution).
<?php
preg_match_all("/(?:(^F.*$)|(^BBC.*$))/m", implode(PHP_EOL, $gen), $matches);
$f = isset($matches[1]) ? $matches[1] : array();
$bbc = isset($matches[2]) ? $matches[2] : array();
You can find an interactive explanation of the regular expression at regex101.com which I created for you.
The (not desired) strpos approach is nearly five times faster.
<?php
$c = count($gen);
$f = $bbc = array();
for ($i = 0; $i < $c; ++$i) {
if (strpos($gen[$i], "F") === 0) {
$f[] = $gen[$i];
}
elseif (strpos($gen[$i], "BBC") === 0) {
$bbc[] = $gen[$i];
}
}
Regular expressions are nice, but the are no silver bullet for everything.
Suppose I would like to expand a string by replacing placeholders from a dictionary. The replacement strings can also contain placeholders:
$pattern = "#a# #b#";
$dict = array("a" => "foo", "b" => "bar #c#", "c" => "baz");
while($match_count = preg_match_all('/#([^#])+#/', $pattern, $matches)) {
for($i=0; $i<$match_count; $i++) {
$key = $matches[1][$i];
if(!isset($dict[$key])) { throw new Exception("'$key' not found!"); }
$pattern = str_replace($matches[0][$i], $dict[$key], $pattern);
}
}
echo $pattern;
This works fine, as long as there are no circular replacement patterns, for example "c" => "#b#". Then the program will be thrown into an endless loop until the memory is exhausted.
Is there an easy way to detect such patterns? I'm looking for a solution where the distance between the replacements can be arbitrarily long, eg. a->b->c->d->f->a
Ideally, the solution would also happen while in the loop and not with a separate analysis.
Single character keys
This is quite easy if the keys are single characters: simply check if a string at the value side contains a character that is a key.
foreach ($your_array as $key => $value) {
foreach(str_split($value) as $ch) {
if(array_key_exists ($ch,$your_array) {
#Problem, cycle is possible
}
}
}
#We're fine
Now even if there is cycle, that does not mean it is fired on every string (for instance in the empty string, no patterns will be fired thus no cycles). In that case you can incorporate it into your checker: if a rule is fired for the second time, there is a problem. Simply because if this is the case, the previous pattern has generated an occasion for this, thus the ocasion will be generated over and over again.
String keys
In case the keys are strings as well, this is probably the Post Correspondence Problem which is undecidable...
Thanks to the comment by georg and this post, I came up with a solution that converts the patterns into a graph and uses topological sort to check for circular replacement.
Here is my solution:
$dict = array("a" => "foo", "b" => "bar #c#", "c" => "baz #b#");
# Store incoming and outgoing "connections" for each key => pattern replacement
$nodes = array();
foreach($dict as $patternName => $pattern) {
if (!isset($nodes[$patternName])) {
$nodes[$patternName] = array("in" => array(), "out" => array());
}
$match_count = preg_match_all('/#([^#])+#/', $pattern, $matches);
for ($i=0; $i<$match_count; $i++) {
$key = $matches[1][$i];
if (!isset($dict[$key])) { throw new Exception("'$key' not found!"); }
if (!isset($nodes[$key])) {
$nodes[$key] = array("in" => array(), "out" => array());
}
$nodes[$key]["in"][] = $patternName;
$nodes[$patternName]["out"][] = $key;
}
}
# collect leaf nodes (no incoming connections)
$leafNodes = array();
foreach ($nodes as $key => $connections) {
if (empty($connections["in"])) {
$leafNodes[] = $key;
}
}
# Remove leaf nodes until none are left
while (!empty($leafNodes)) {
$nodeID = array_shift($leafNodes);
foreach ($nodes[$nodeID]["out"] as $outNode) {
$nodes[$outNode]['in'] = array_diff($nodes[$outNode]['in'], array($nodeID));
if (empty($nodes[$outNode]['in'])) {
$leafNodes[] = $outNode;
}
}
$nodes[$nodeID]['out'] = array();
}
# Check for non-leaf nodes. If any are left, there is a circular pattern
foreach ($nodes as $key => $node) {
if (!empty($node["in"]) || !empty($node["out"]) ) {
throw new Exception("Circular replacement pattern for '$key'!");
}
}
# Now we can safely do replacement
$pattern = "#a# #b#";
while ($match_count = preg_match_all('/#([^#])+#/', $pattern, $matches)) {
$key = $matches[1][$i];
$pattern = str_replace($matches[0][$i], $dict[$key], $pattern);
}
echo $pattern;
With PHP if you have a string which may or may not have spaces after the dot, such as:
"1. one 2.too 3. free 4. for 5.five "
What function can you use to create an array as follows:
array(1 => "one", 2 => "too", 3 => "free", 4 => "for", 5 => "five")
with the key being the list item number (e.g the array above has no 0)
I presume a regular expression is needed and perhaps use of preg_split or similar? I'm terrible at regular expressions so any help would be greatly appreciated.
What about:
$str = "1. one 2.too 3. free 4. for 5.five ";
$arr = preg_split('/\d+\./', $str, -1, PREG_SPLIT_NO_EMPTY);
print_r($arr);
I got a quick hack and it seems to be working fine for me
$string = "1. one 2.too 3. free 4. for 5.five ";
$text_only = preg_replace("/[^A-Z,a-z]/",".",$string);
$num_only = preg_replace("/[^0-9]/",".",$string);
$explode_nums = explode('.',$num_only);
$explode_text = explode('.',$text_only);
foreach($explode_text as $key => $value)
{
if($value !== '' && $value !== ' ')
{
$text_array[] = $value;
}
}
foreach($explode_nums as $key => $value)
{
if($value !== '' && $value !== ' ')
{
$num_array[] = $value;
}
}
foreach($num_array as $key => $value)
{
$new_array[$value] = $text_array[$key];
}
print_r($new_array);
Test it out and let me know if works fine
Two days ago I started working on a code parser and I'm stuck.
How can I split a string by commas that are not inside brackets, let me show you what I mean:
I have this string to parse:
one, two, three, (four, (five, six), (ten)), seven
I would like to get this result:
array(
"one";
"two";
"three";
"(four, (five, six), (ten))";
"seven"
)
but instead I get:
array(
"one";
"two";
"three";
"(four";
"(five";
"six)";
"(ten))";
"seven"
)
How can I do this in PHP RegEx.
Thank you in advance !
You can do that easier:
preg_match_all('/[^(,\s]+|\([^)]+\)/', $str, $matches)
But it would be better if you use a real parser. Maybe something like this:
$str = 'one, two, three, (four, (five, six), (ten)), seven';
$buffer = '';
$stack = array();
$depth = 0;
$len = strlen($str);
for ($i=0; $i<$len; $i++) {
$char = $str[$i];
switch ($char) {
case '(':
$depth++;
break;
case ',':
if (!$depth) {
if ($buffer !== '') {
$stack[] = $buffer;
$buffer = '';
}
continue 2;
}
break;
case ' ':
if (!$depth) {
continue 2;
}
break;
case ')':
if ($depth) {
$depth--;
} else {
$stack[] = $buffer.$char;
$buffer = '';
continue 2;
}
break;
}
$buffer .= $char;
}
if ($buffer !== '') {
$stack[] = $buffer;
}
var_dump($stack);
Hm... OK already marked as answered, but since you asked for an easy solution I will try nevertheless:
$test = "one, two, three, , , ,(four, five, six), seven, (eight, nine)";
$split = "/([(].*?[)])|(\w)+/";
preg_match_all($split, $test, $out);
print_r($out[0]);
Output
Array
(
[0] => one
[1] => two
[2] => three
[3] => (four, five, six)
[4] => seven
[5] => (eight, nine)
)
You can't, directly. You'd need, at minimum, variable-width lookbehind, and last I knew PHP's PCRE only has fixed-width lookbehind.
My first recommendation would be to first extract parenthesized expressions from the string. I don't know anything about your actual problem, though, so I don't know if that will be feasible.
I can't think of a way to do it using a single regex, but it's quite easy to hack together something that works:
function process($data)
{
$entries = array();
$filteredData = $data;
if (preg_match_all("/\(([^)]*)\)/", $data, $matches)) {
$entries = $matches[0];
$filteredData = preg_replace("/\(([^)]*)\)/", "-placeholder-", $data);
}
$arr = array_map("trim", explode(",", $filteredData));
if (!$entries) {
return $arr;
}
$j = 0;
foreach ($arr as $i => $entry) {
if ($entry != "-placeholder-") {
continue;
}
$arr[$i] = $entries[$j];
$j++;
}
return $arr;
}
If you invoke it like this:
$data = "one, two, three, (four, five, six), seven, (eight, nine)";
print_r(process($data));
It outputs:
Array
(
[0] => one
[1] => two
[2] => three
[3] => (four, five, six)
[4] => seven
[5] => (eight, nine)
)
Clumsy, but it does the job...
<?php
function split_by_commas($string) {
preg_match_all("/\(.+?\)/", $string, $result);
$problem_children = $result[0];
$i = 0;
$temp = array();
foreach ($problem_children as $submatch) {
$marker = '__'.$i++.'__';
$temp[$marker] = $submatch;
$string = str_replace($submatch, $marker, $string);
}
$result = explode(",", $string);
foreach ($result as $key => $item) {
$item = trim($item);
$result[$key] = isset($temp[$item])?$temp[$item]:$item;
}
return $result;
}
$test = "one, two, three, (four, five, six), seven, (eight, nine), ten";
print_r(split_by_commas($test));
?>
I feel that its worth noting, that you should always avoid regular expressions when you possibly can. To that end, you should know that for PHP 5.3+ you could use str_getcsv(). However, if you're working with files (or file streams), such as CSV files, then the function fgetcsv() might be what you need, and its been available since PHP4.
Lastly, I'm surprised nobody used preg_split(), or did it not work as needed?
Maybe a bit late but I've made a solution without regex which also supports nesting inside brackets. Anyone let me know what you guys think:
$str = "Some text, Some other text with ((95,3%) MSC)";
$arr = explode(",",$str);
$parts = [];
$currentPart = "";
$bracketsOpened = 0;
foreach ($arr as $part){
$currentPart .= ($bracketsOpened > 0 ? ',' : '').$part;
if (stristr($part,"(")){
$bracketsOpened ++;
}
if (stristr($part,")")){
$bracketsOpened --;
}
if (!$bracketsOpened){
$parts[] = $currentPart;
$currentPart = '';
}
}
Gives me the output:
Array
(
[0] => Some text
[1] => Some other text with ((95,3%) MSC)
)
I am afraid that it could be very difficult to parse nested brackets like
one, two, (three, (four, five))
only with RegExp.