I have this in a function which is supposed to replace any sequence of parentheses with what is enclosed in it like (abc) becomes abc any where it appears even recursively because parens can be nested.
$return = preg_replace_callback(
'|(\((.+)\))+|',
function ($matches) {
return $matches[2];
},
$s
);
when the above regex is fed this string "a(bcdefghijkl(mno)p)q" as input it returns "ap)onm(lkjihgfedcbq". This shows the regex is matched once. What can I do to make it continue to match even inside already made matches and produce this `abcdefghijklmnopq'"
To match balanced parenthetical substrings you may use a well-known \((?:[^()]++|(?R))*\) pattern (described in Matching Balanced Constructs), inside a preg_replace_callback method, where the match value can be further manipulated (just remove all ( and ) symbols from the match that is easy to do even without a regex:
$re = '/\((?:[^()]++|(?R))*\)/';
$str = 'a(bcdefghijkl(mno)p)q((('; // Added three ( at the end
$result = preg_replace_callback($re, function($m) {
return str_replace(array('(',')'), '', $m[0]);
}, $str);
echo $result; // => abcdefghijklmnopq(((
See the PHP demo
To get overlapping matches, you need to use a known technique, capturing inside a positive lookahead, but you won't be able to perform two operations at once (replacing and matching), you can run matching first, and then replace:
$re = '/(?=(\((?:[^()]++|(?1))*\)))/';
$str = 'a(bcdefghijkl(mno)p)q(((';
preg_match_all($re, $str, $m);
print_r($m[1]);
// => Array ( [0] => (bcdefghijkl(mno)p) [1] => (mno) )
See the PHP demo.
Try this one,
preg_match('/\((?:[^\(\)]*+|(?0))*\)/', $str )
https://regex101.com/r/NsQSla/1
It will match everything inside of the ( ) as long as they are matched pairs.
Example
(abc) (abc (abc))
will have the following matches
Match 1
Full match 0-5 `(abc)`
Match 2
Full match 6-17 `(abc (abc))`
It is slightly unclear exactly what the postcondition of the algorithm is supposed to be. It seems to me that you are wanting to strip out matching pairs of ( ). The assumption here is that unmatched parentheses are left alone (otherwise you just strip out all of the ('s and )'s).
So I guess this means the input string a(bcdefghijkl(mno)p)q becomes abcdefghijklmnopq but the input string a(bcdefghijkl(mno)pq becomes a(bcdefghijklmnopq. Likewise an input string (a)) would become a).
It may be possible to do this using pcre since it does provide some non-regular features but I'm doubtful about it. The language of the input strings is not regular; it's context-free. What #ArtisticPhoenix's answer does is match complete pairs of matched parentheses. What it does not do is match all nested pairs. This nested matching is inherently non-regular in my humble understanding of language theory.
I suggest writing a parser to strip out the matching pairs of parentheses. It gets a little wordy having to account for productions that fail to match:
<?php
// Parse the punctuator sub-expression (i.e. anything within ( ... ) ).
function parse_punc(array $tokens,&$iter) {
if (!isset($tokens[$iter])) {
return;
}
$inner = parse_punc_seq($tokens,$iter);
if (!isset($tokens[$iter]) || $tokens[$iter] != ')') {
// Leave unmatched open parentheses alone.
$inner = "($inner";
}
$iter += 1;
return $inner;
}
// Parse a sequence (inside punctuators).
function parse_punc_seq(array $tokens,&$iter) {
if (!isset($tokens[$iter])) {
return;
}
$tok = $tokens[$iter];
if ($tok == ')') {
return;
}
$iter += 1;
if ($tok == '(') {
$tok = parse_punc($tokens,$iter);
}
$tok .= parse_punc_seq($tokens,$iter);
return $tok;
}
// Parse a sequence (outside punctuators).
function parse_seq(array $tokens,&$iter) {
if (!isset($tokens[$iter])) {
return;
}
$tok = $tokens[$iter++];
if ($tok == '(') {
$tok = parse_punc($tokens,$iter);
}
$tok .= parse_seq($tokens,$iter);
return $tok;
}
// Wrapper for parser.
function parse(array $tokens) {
$iter = 0;
return strval(parse_seq($tokens,$iter));
}
// Grab input from stdin and run it through the parser.
$str = trim(stream_get_contents(STDIN));
$tokens = preg_split('/([\(\)])/',$str,-1,PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
var_dump(parse($tokens));
I know this is a lot more code than a regex one-liner but it does solve the problem as I understand it. I'd be interested to know if anyone can solve this problem with a regular expression.
Related
I am still playing around for one project with matching words.
Let assume that I have a given string, say maxmuster . Then I want to mark this part of my random word maxs which are in maxmuster in the proper order, like the letters are.
I wil give some examples and then I tell what I already did. Lets keep the string maxmuster. The bold part is the matched one by regex (best would be in php, however could be python, bash, javascript,...)
maxs
Mymaxmuis
Lemu
muster
Of course also m, u, ... will be matched then. I know that, I am going to fix that later. However, the solution, I though, should not so difficult, so I try to divide the word in groups like this:
/(maxmuster)?|(maxmuste)?|(maxmust)?|(maxmus)?|(maxmu)?|(maxm)?|(max)?|(ma)?|(m)?/gui
But then I forgot of course the other combinations, like:
(axmuster)(xmus) and so on. Did I really have to do that, or exist there a simple regex trick, to solve this question, like I explained above?
Thank you very much
Sounds like you need string intersection. If you don't mind non regex idea, have a look in Wikibooks Algorithm Implementation/Strings/Longest common substring PHP section.
foreach(["maxs", "Mymaxmuis", "Lemu", "muster"] AS $str)
echo get_longest_common_subsequence($str, "maxmuster") . "\n";
max
maxmu
mu
muster
See this PHP demo at tio.run (caseless comparison).
If you need a regex idea, I would join both strings with space and use a pattern like this demo.
(?=(\w+)(?=\w* \w*?\1))\w
It will capture inside a lookahead at each position before a word character in the first string the longest substring that also matches the second string. Then by PHP matches of the first group need to be sorted by length and the longest match will be returned. See the PHP demo at tio.run.
function get_longest_common_subsequence($w1="", $w2="")
{
$test_str = preg_quote($w1,'/')." ".preg_quote($w2,'/');
if(preg_match_all('/(?=(\w+)(?=\w* \w*?\1))\w/i', $test_str, $out) > 0)
{
usort($out[1], function($a, $b) { return strlen($b) - strlen($a); });
return $out[1][0];
}
}
TL;DR
Using Regular Expressions:
longestSubstring(['Mymaxmuis', 'axmuis', 'muster'], buildRegexFrom('maxmuster'));
Full snippet
Using below regex you are able to match all true sub-strings of string maxmuster:
(?|((?:
m(?=a)
|(?<=m)a
|a(?=x)
|(?<=a)x
|x(?=m)
|(?<=x)m
|m(?=u)
|(?<=m)u
|u(?=s)
|(?<=u)s
|s(?=t)
|(?<=s)t
|t(?=e)
|(?<=t)e
|e(?=r)
|(?<=e)r
)+)|([maxmuster]))
Live demo
You have to cook such a regex from a word like maxmuster so you need a function to call it:
function buildRegexFrom(string $word): string {
// Split word to letters
$letters = str_split($word);
// Creating all side of alternations in our regex
foreach ($letters as $key => $letter)
if (end($letters) != $letter)
$regex[] = "$letter(?={$letters[$key + 1]})|(?<=$letter){$letters[$key + 1]}";
// Return whole cooked pattern
return "~(?|((?>".implode('|', $regex).")+)|([$word]))~i";
}
To return longest match you need to sort results according to matches length from longest to shortest. It means writing another piece of code for it:
function longestSubstring(array $array, string $regex): array {
foreach ($array as $value) {
preg_match_all($regex, $value, $matches);
usort($matches[1], function($a, $b) {
return strlen($b) <=> strlen($a);
});
// Store longest match being sorted
$substrings[] = $matches[1][0];
}
return $substrings;
}
Putting all things together:
print_r(longestSubstring(['Mymaxmuis', 'axmuis', 'muster'], buildRegexFrom('maxmuster')));
Outputs:
Array
(
[0] => maxmu
[1] => axmu
[2] => muster
)
PHP live demo
Here is my take on this problem using regex.
<?php
$subject="maxmuster";
$str="Lemu";
$comb=str_split($subject); // Split into single characters.
$len=strlen($subject);
for ($i=2; $i<=$len; $i++){
for($start=0; $start<$len; $start++){
$temp="";
$inc=$start;
for($j=0; $j<$i; $j++){
$temp=$temp.$subject[$inc];
$inc++;
}
array_push($comb,$temp);
}
}
echo "Matches are:\n";
for($i=0; $i<sizeof($comb); $i++){
$pattern = "/".$comb[$i]."/";
if(preg_match($pattern,$str, $matches)){
print_r($matches);
};
}
?>
And here is an Ideone Demo.
I need an regular expression to find all starting brace has an ending brace.
Suppose
([(([[(([]))]]))]) -- this one will return true. but
[](()()[[]])[][[([]) --- this one will return false
for this, I've tried below:-
function check($str) {
$output = "";
$pattern = "/(\{[^}]*)([^{]*\})/im";
preg_match($pattern, $str, $match);
print_r($match[0]);
}
assert(check("[](()()[[]])[][[([])") === FALSE);
any help please...
The easiest way to do this (in my opinion) would be to implement a stack data structure and pass through your string. Essentially something like so:
Traverse the string left to right
If you find an opening parenthesis, add it to the stack
else (you find a closing parenthesis) make sure that the top most item in the stack is the same type of parenthesis (so make sure that if you found a }, the top most item in the stack is {). This should help scenarios where you have something like so: ({)}. If it matches, pop from the stack.
If you repeat the above operation throughout the entire string, you should end up with an empty stack. This would mean that you have managed to match all open parenthesis with a close parenthesis.
You can use this:
$pattern = '~^(\((?1)*\)|\[(?1)*]|{(?1)*})+$~';
(?1) is a reference to the subpattern (not the matched content) in the capture group 1. Since I put it in the capture group 1 itself, I obtain a recursion.
I added anchors for the start ^ and the end $ of the string to be sure to check all the string.
Note: If you need to check a string that contains not only brackets, you can replace each (?1)* with:
(?>[^][}{)(]++|(?1))*
Note 2: If you want that an empty string return true, you must replace the last quantifier + with *
Working example:
function check($str, $display = false) {
if (preg_match('~^(\((?1)*\)|\[(?1)*]|{(?1)*})+$~', $str, $match)) {
if ($display) echo $match[0];
return true;
}
elseif (preg_last_error() == PREG_RECURSION_LIMIT_ERROR) {
if ($display) echo "The recursion limit has been reached\n";
return -1;
} else return false;
}
assert(check(')))') === false);
check('{[[()()]]}', true);
var_dump(check('{[[()()]]}'));
I have a problem where I have no idea how to solve this and if regular expression are the best way.
My idea is to get the name,variables,content of functions in a file.
This is my regular expression:
preg_match_all('/function (.*?)\((.*?)\)(.*?)\{(.*?)\}/s',$content,$funcs,PREG_SET_ORDER);
And I have this testfile:
function testfunc($text)
{
if ($text)
{
return 1;
}
return 0;
}
Of course I will get everything until "}" before return 0;
Is there a way to get everything in the function so find the right "}".
Contrary to many beliefs PHP (PCRE) has something called Recursive Pattern Regex that lets you find matching nested brackets. Consider this code:
$str = <<<'EOF'
function testfunc($text) {
if ($text) {
return 1;
}
return 0;
}
EOF;
if ( preg_match('/ \{ ( (?: [^{}]* | (?0) )+ ) \} /x', $str, $m) )
echo $m[0];
OUTPUT:
{
if ($text) {
return 1;
}
return 0;
}
UPDATE: To capture function name and arguments as well try this code:
$str = <<<'EOF'
function testfunc($text) {
if ($text) {
return 1;
}
return 0;
}
EOF;
if ( preg_match('/ (function [^{]+ ) ( \{ (?: [^{}]* | (?-1) )* \} ) /x', $str, $m) )
print_r ($m);
OUTPUT
Array
(
[0] => function testfunc($text) {
if ($text) {
return 1;
}
return 0;
}
[1] => function testfunc($text)
[2] => {
if ($text) {
return 1;
}
return 0;
}
)
Working Online Demo: http://ideone.com/duQw9c
Regular expressions are not the best tool for that job. Parsers are.
No doubt you can use regexp callbacks to eventually manage what you intend, but this would be ungodly obfuscated and fragile.
A parser can easily do the same job. Better still, if you are planning on parsing PHP with PHP, you can use the Zend parser that does the job for you.
Not in general, (you can of course define a regex for two levels deep parsing that would be something like function (.*)\((.*)\)(.*)\{([^}]*(\{[^}]*\})*)\} but since you can nest such structures arbitrarily deep, you will eventually run out of regex :D ). One needs a context free grammar to do this.
You can generate such grammar parsers for instance with Yacc, Bison, Gppg,...
Furthermore you don't need to state .*?, .* means zero or more times, .+ means one time or more.
Is there a way to get everything in the function so find the right "}".
Short Answer: no.
Long Answer:
This can not be handled with a single Expression. { and } can also appear inside a method body, making it hard to find the correct ending }. You would need to process (iterative or recursive) ALL pairs of {} and manually sort out ALL Pairs, that have a "method name" in front of it.
This, however isn't simple either, because you need to exclude all the Statements, that look like a function but are valid inside the method body.
I don't think, that Regex is the way to go for such a task. EVEN if you would manage to create all the required Regex-Pattern - Performance would be worse compared to any dedicated parser.
Ok this is really difficult to explain in English, so I'll just give an example.
I am going to have strings in the following format:
key-value;key1-value;key2-...
and I need to extract the data to be an array
array('key'=>'value','key1'=>'value1', ... )
I was planning to use regexp to achieve (most of) this functionality, and wrote this regular expression:
/^(\w+)-([^-;]+)(?:;(\w+)-([^-;]+))*;?$/
to work with preg_match and this code:
for ($l = count($matches),$i = 1;$i<$l;$i+=2) {
$parameters[$matches[$i]] = $matches[$i+1];
}
However the regexp obviously returns only 4 backreferences - first and last key-value pairs of the input string. Is there a way around this? I know I can use regex just to test the correctness of the string and use PHP's explode in loops with perfect results, but I'm really curious whether it's possible with regular expressions.
In short, I need to capture an arbitrary number of these key-value; pairs in a string by means of regular expressions.
You can use a lookahead to validate the input while you extract the matches:
/\G(?=(?:\w++-[^;-]++;?)++$)(\w++)-([^;-]++);?/
(?=(?:\w++-[^;-]++;?)++$) is the validation part. If the input is invalid, matching will fail immediately, but the lookahead still gets evaluated every time the regex is applied. In order to keep it (along with the rest of the regex) in sync with the key-value pairs, I used \G to anchor each match to the spot where the previous match ended.
This way, if the lookahead succeeds the first time, it's guaranteed to succeed every subsequent time. Obviously it's not as efficient as it could be, but that probably won't be a problem--only your testing can tell for sure.
If the lookahead fails, preg_match_all() will return zero (false). If it succeeds, the matches will be returned in an array of arrays: one for the full key-value pairs, one for the keys, one for the values.
regex is powerful tool, but sometimes, its not the best approach.
$string = "key-value;key1-value";
$s = explode(";",$string);
foreach($s as $k){
$e = explode("-",$k);
$array[$e[0]]=$e[1];
}
print_r($array);
Use preg_match_all() instead. Maybe something like:
$matches = $parameters = array();
$input = 'key-value;key1-value1;key2-value2;key123-value123;';
preg_match_all("/(\w+)-([^-;]+)/", $input, $matches, PREG_SET_ORDER);
foreach ($matches as $match) {
$parameters[$match[1]] = $match[2];
}
print_r($parameters);
EDIT:
to first validate if the input string conforms to the pattern, then just use:
if (preg_match("/^((\w+)-([^-;]+);)+$/", $input) > 0) {
/* do the preg_match_all stuff */
}
EDIT2: the final semicolon is optional
if (preg_match("/^(\w+-[^-;]+;)*\w+-[^-;]+$/", $input) > 0) {
/* do the preg_match_all stuff */
}
No. Newer matches overwrite older matches. Perhaps the limit argument of explode() would be helpful when exploding.
what about this solution:
$samples = array(
"good" => "key-value;key1-value;key2-value;key5-value;key-value;",
"bad1" => "key-value-value;key1-value;key2-value;key5-value;key-value;",
"bad2" => "key;key1-value;key2-value;key5-value;key-value;",
"bad3" => "k%ey;key1-value;key2-value;key5-value;key-value;"
);
foreach($samples as $name => $value) {
if (preg_match("/^(\w+-\w+;)+$/", $value)) {
printf("'%s' matches\n", $name);
} else {
printf("'%s' not matches\n", $name);
}
}
I don't think you can do both validation and extraction of data with one single regexp, as you need anchors (^ and $) for validation and preg_match_all() for the data, but if you use anchors with preg_match_all() it will only return the last set matched.
Suppose I have the following two strings containing regular expressions. How do I coalesce them? More specifically, I want to have the two expressions as alternatives.
$a = '# /[a-z] #i';
$b = '/ Moo /x';
$c = preg_magic_coalesce('|', $a, $b);
// Desired result should be equivalent to:
// '/ \/[a-zA-Z] |Moo/'
Of course, doing this as string operations isn't practical because it would involve parsing the expressions, constructing syntax trees, coalescing the trees and then outputting another regular expression equivalent to the tree. I'm completely happy without this last step. Unfortunately, PHP doesn't have a RegExp class (or does it?).
Is there any way to achieve this? Incidentally, does any other language offer a way? Isn't this a pretty normal scenario? Guess not. :-(
Alternatively, is there a way to check efficiently if either of the two expressions matches, and which one matches earlier (and if they match at the same position, which match is longer)? This is what I'm doing at the moment. Unfortunately, I do this on long strings, very often, for more than two patterns. The result is slow (and yes, this is definitely the bottleneck).
EDIT:
I should have been more specific – sorry. $a and $b are variables, their content is outside of my control! Otherwise, I would just coalesce them manually. Therefore, I can't make any assumptions about the delimiters or regex modifiers used. Notice, for example, that my first expression uses the i modifier (ignore casing) while the second uses x (extended syntax). Therefore, I can't just concatenate the two because the second expression does not ignore casing and the first doesn't use the extended syntax (and any whitespace therein is significant!
I see that porneL actually described a bunch of this, but this handles most of the problem. It cancels modifiers set in previous sub-expressions (which the other answer missed) and sets modifiers as specified in each sub-expression. It also handles non-slash delimiters (I could not find a specification of what characters are allowed here so I used ., you may want to narrow further).
One weakness is it doesn't handle back-references within expressions. My biggest concern with that is the limitations of back-references themselves. I'll leave that as an exercise to the reader/questioner.
// Pass as many expressions as you'd like
function preg_magic_coalesce() {
$active_modifiers = array();
$expression = '/(?:';
$sub_expressions = array();
foreach(func_get_args() as $arg) {
// Determine modifiers from sub-expression
if(preg_match('/^(.)(.*)\1([eimsuxADJSUX]+)$/', $arg, $matches)) {
$modifiers = preg_split('//', $matches[3]);
if($modifiers[0] == '') {
array_shift($modifiers);
}
if($modifiers[(count($modifiers) - 1)] == '') {
array_pop($modifiers);
}
$cancel_modifiers = $active_modifiers;
foreach($cancel_modifiers as $key => $modifier) {
if(in_array($modifier, $modifiers)) {
unset($cancel_modifiers[$key]);
}
}
$active_modifiers = $modifiers;
} elseif(preg_match('/(.)(.*)\1$/', $arg)) {
$cancel_modifiers = $active_modifiers;
$active_modifiers = array();
}
// If expression has modifiers, include them in sub-expression
$sub_modifier = '(?';
$sub_modifier .= implode('', $active_modifiers);
// Cancel modifiers from preceding sub-expression
if(count($cancel_modifiers) > 0) {
$sub_modifier .= '-' . implode('-', $cancel_modifiers);
}
$sub_modifier .= ')';
$sub_expression = preg_replace('/^(.)(.*)\1[eimsuxADJSUX]*$/', $sub_modifier . '$2', $arg);
// Properly escape slashes
$sub_expression = preg_replace('/(?<!\\\)\//', '\\\/', $sub_expression);
$sub_expressions[] = $sub_expression;
}
// Join expressions
$expression .= implode('|', $sub_expressions);
$expression .= ')/';
return $expression;
}
Edit: I've rewritten this (because I'm OCD) and ended up with:
function preg_magic_coalesce($expressions = array(), $global_modifier = '') {
if(!preg_match('/^((?:-?[eimsuxADJSUX])+)$/', $global_modifier)) {
$global_modifier = '';
}
$expression = '/(?:';
$sub_expressions = array();
foreach($expressions as $sub_expression) {
$active_modifiers = array();
// Determine modifiers from sub-expression
if(preg_match('/^(.)(.*)\1((?:-?[eimsuxADJSUX])+)$/', $sub_expression, $matches)) {
$active_modifiers = preg_split('/(-?[eimsuxADJSUX])/',
$matches[3], -1, PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE);
}
// If expression has modifiers, include them in sub-expression
if(count($active_modifiers) > 0) {
$replacement = '(?';
$replacement .= implode('', $active_modifiers);
$replacement .= ':$2)';
} else {
$replacement = '$2';
}
$sub_expression = preg_replace('/^(.)(.*)\1(?:(?:-?[eimsuxADJSUX])*)$/',
$replacement, $sub_expression);
// Properly escape slashes if another delimiter was used
$sub_expression = preg_replace('/(?<!\\\)\//', '\\\/', $sub_expression);
$sub_expressions[] = $sub_expression;
}
// Join expressions
$expression .= implode('|', $sub_expressions);
$expression .= ')/' . $global_modifier;
return $expression;
}
It now uses (?modifiers:sub-expression) rather than (?modifiers)sub-expression|(?cancel-modifiers)sub-expression but I've noticed that both have some weird modifier side-effects. For instance, in both cases if a sub-expression has a /u modifier, it will fail to match (but if you pass 'u' as the second argument of the new function, that will match just fine).
Strip delimiters and flags from each. This regex should do it:
/^(.)(.*)\1([imsxeADSUXJu]*)$/
Join expressions together. You'll need non-capturing parenthesis to inject flags:
"(?$flags1:$regexp1)|(?$flags2:$regexp2)"
If there are any back references, count capturing parenthesis and update back references accordingly (e.g. properly joined /(.)x\1/ and /(.)y\1/ is /(.)x\1|(.)y\2/ ).
EDIT
I’ve rewritten the code! It now contains the changes that are listed as follows. Additionally, I've done extensive tests (which I won’t post here because they’re too many) to look for errors. So far, I haven’t found any.
The function is now split into two parts: There’s a separate function preg_split which takes a regular expression and returns an array containing the bare expression (without delimiters) and an array of modifiers. This might come in handy (it already has, in fact; this is why I made this change).
The code now correctly handles back-references. This was necessary for my purpose after all. It wasn’t difficult to add, the regular expression used to capture the back-references just looks weird (and may actually be extremely inefficient, it looks NP-hard to me – but that’s only an intuition and only applies in weird edge cases). By the way, does anyone know a better way of checking for an uneven number of matches than my way? Negative lookbehinds won't work here because they only accept fixed-length strings instead of regular expressions. However, I need the regex here to test whether the preceeding backslash is actually escaped itself.
Additionally, I don’t know how good PHP is at caching anonymous create_function use. Performance-wise, this might not be the best solution but it seems good enough.
I’ve fixed a bug in the sanity check.
I’ve removed the cancellation of obsolete modifiers since my tests show that it isn't necessary.
By the way, this code is one of the core components of a syntax highlighter for various languages that I’m working on in PHP since I’m not satisfied with the alternatives listed elsewhere.
Thanks!
porneL, eyelidlessness, amazing work! Many, many thanks. I had actually given up.
I've built upon your solution and I'd like to share it here. I didn't implement re-numbering back-references since this isn't relevant in my case (I think …). Perhaps this will become necessary later, though.
Some Questions …
One thing, #eyelidlessness: Why do you feel the necessity to cancel old modifiers? As far as I see it, this isn't necessary since the modifiers are only applied locally anyway.
Ah yes, one other thing. Your escaping of the delimiter seems overly complicated. Care to explain why you think this is needed? I believe my version should work as well but I could be very wrong.
Also, I've changed the signature of your function to match my needs. I also thing that my version is more generally useful. Again, I might be wrong.
BTW, you should now realize the importance of real names on SO. ;-) I can't give you real credit in the code. :-/
The Code
Anyway, I'd like to share my result so far because I can't believe that nobody else ever needs something like that. The code seems to work very well. Extensive tests are yet to be done, though. Please comment!
And without further ado …
/**
* Merges several regular expressions into one, using the indicated 'glue'.
*
* This function takes care of individual modifiers so it's safe to use
* <em>different</em> modifiers on the individual expressions. The order of
* sub-matches is preserved as well. Numbered back-references are adapted to
* the new overall sub-match count. This means that it's safe to use numbered
* back-refences in the individual expressions!
* If {#link $names} is given, the individual expressions are captured in
* named sub-matches using the contents of that array as names.
* Matching pair-delimiters (e.g. <code>"{…}"</code>) are currently
* <strong>not</strong> supported.
*
* The function assumes that all regular expressions are well-formed.
* Behaviour is undefined if they aren't.
*
* This function was created after a {#link https://stackoverflow.com/questions/244959/
* StackOverflow discussion}. Much of it was written or thought of by
* “porneL” and “eyelidlessness”. Many thanks to both of them.
*
* #param string $glue A string to insert between the individual expressions.
* This should usually be either the empty string, indicating
* concatenation, or the pipe (<code>|</code>), indicating alternation.
* Notice that this string might have to be escaped since it is treated
* like a normal character in a regular expression (i.e. <code>/</code>)
* will end the expression and result in an invalid output.
* #param array $expressions The expressions to merge. The expressions may
* have arbitrary different delimiters and modifiers.
* #param array $names Optional. This is either an empty array or an array of
* strings of the same length as {#link $expressions}. In that case,
* the strings of this array are used to create named sub-matches for the
* expressions.
* #return string An string representing a regular expression equivalent to the
* merged expressions. Returns <code>FALSE</code> if an error occurred.
*/
function preg_merge($glue, array $expressions, array $names = array()) {
// … then, a miracle occurs.
// Sanity check …
$use_names = ($names !== null and count($names) !== 0);
if (
$use_names and count($names) !== count($expressions) or
!is_string($glue)
)
return false;
$result = array();
// For keeping track of the names for sub-matches.
$names_count = 0;
// For keeping track of *all* captures to re-adjust backreferences.
$capture_count = 0;
foreach ($expressions as $expression) {
if ($use_names)
$name = str_replace(' ', '_', $names[$names_count++]);
// Get delimiters and modifiers:
$stripped = preg_strip($expression);
if ($stripped === false)
return false;
list($sub_expr, $modifiers) = $stripped;
// Re-adjust backreferences:
// We assume that the expression is correct and therefore don't check
// for matching parentheses.
$number_of_captures = preg_match_all('/\([^?]|\(\?[^:]/', $sub_expr, $_);
if ($number_of_captures === false)
return false;
if ($number_of_captures > 0) {
// NB: This looks NP-hard. Consider replacing.
$backref_expr = '/
( # Only match when not escaped:
[^\\\\] # guarantee an even number of backslashes
(\\\\*?)\\2 # (twice n, preceded by something else).
)
\\\\ (\d) # Backslash followed by a digit.
/x';
$sub_expr = preg_replace_callback(
$backref_expr,
create_function(
'$m',
'return $m[1] . "\\\\" . ((int)$m[3] + ' . $capture_count . ');'
),
$sub_expr
);
$capture_count += $number_of_captures;
}
// Last, construct the new sub-match:
$modifiers = implode('', $modifiers);
$sub_modifiers = "(?$modifiers)";
if ($sub_modifiers === '(?)')
$sub_modifiers = '';
$sub_name = $use_names ? "?<$name>" : '?:';
$new_expr = "($sub_name$sub_modifiers$sub_expr)";
$result[] = $new_expr;
}
return '/' . implode($glue, $result) . '/';
}
/**
* Strips a regular expression string off its delimiters and modifiers.
* Additionally, normalize the delimiters (i.e. reformat the pattern so that
* it could have used '/' as delimiter).
*
* #param string $expression The regular expression string to strip.
* #return array An array whose first entry is the expression itself, the
* second an array of delimiters. If the argument is not a valid regular
* expression, returns <code>FALSE</code>.
*
*/
function preg_strip($expression) {
if (preg_match('/^(.)(.*)\\1([imsxeADSUXJu]*)$/s', $expression, $matches) !== 1)
return false;
$delim = $matches[1];
$sub_expr = $matches[2];
if ($delim !== '/') {
// Replace occurrences by the escaped delimiter by its unescaped
// version and escape new delimiter.
$sub_expr = str_replace("\\$delim", $delim, $sub_expr);
$sub_expr = str_replace('/', '\\/', $sub_expr);
}
$modifiers = $matches[3] === '' ? array() : str_split(trim($matches[3]));
return array($sub_expr, $modifiers);
}
PS: I've made this posting community wiki editable. You know what this means …!
I'm pretty sure it's not possible to just put regexps together like that in any language - they could have incompatible modifiers.
I'd probably just put them in an array and loop through them, or combine them by hand.
Edit: If you're doing them one at a time as described in your edit, you maybe be able to run the second one on a substring (from the start up to the earliest match). That might help things.
function preg_magic_coalasce($split, $re1, $re2) {
$re1 = rtrim($re1, "\/#is");
$re2 = ltrim($re2, "\/#");
return $re1.$split.$re2;
}
You could do it the alternative way like this:
$a = '# /[a-z] #i';
$b = '/ Moo /x';
$a_matched = preg_match($a, $text, $a_matches);
$b_matched = preg_match($b, $text, $b_matches);
if ($a_matched && $b_matched) {
$a_pos = strpos($text, $a_matches[1]);
$b_pos = strpos($text, $b_matches[1]);
if ($a_pos == $b_pos) {
if (strlen($a_matches[1]) == strlen($b_matches[1])) {
// $a and $b matched the exact same string
} else if (strlen($a_matches[1]) > strlen($b_matches[1])) {
// $a and $b started matching at the same spot but $a is longer
} else {
// $a and $b started matching at the same spot but $b is longer
}
} else if ($a_pos < $b_pos) {
// $a matched first
} else {
// $b matched first
}
} else if ($a_matched) {
// $a matched, $b didn't
} else if ($b_matched) {
// $b matched, $a didn't
} else {
// neither one matched
}