Coalescing regular expressions in PHP

Coalescing regular expressions in PHP - php

Suppose I have the following two strings containing regular expressions. How do I coalesce them? More specifically, I want to have the two expressions as alternatives.
$a = '# /[a-z] #i';
$b = '/ Moo /x';
$c = preg_magic_coalesce('|', $a, $b);
// Desired result should be equivalent to:
// '/ \/[a-zA-Z] |Moo/'
Of course, doing this as string operations isn't practical because it would involve parsing the expressions, constructing syntax trees, coalescing the trees and then outputting another regular expression equivalent to the tree. I'm completely happy without this last step. Unfortunately, PHP doesn't have a RegExp class (or does it?).
Is there any way to achieve this? Incidentally, does any other language offer a way? Isn't this a pretty normal scenario? Guess not. :-(
Alternatively, is there a way to check efficiently if either of the two expressions matches, and which one matches earlier (and if they match at the same position, which match is longer)? This is what I'm doing at the moment. Unfortunately, I do this on long strings, very often, for more than two patterns. The result is slow (and yes, this is definitely the bottleneck).
EDIT:
I should have been more specific – sorry. $a and $b are variables, their content is outside of my control! Otherwise, I would just coalesce them manually. Therefore, I can't make any assumptions about the delimiters or regex modifiers used. Notice, for example, that my first expression uses the i modifier (ignore casing) while the second uses x (extended syntax). Therefore, I can't just concatenate the two because the second expression does not ignore casing and the first doesn't use the extended syntax (and any whitespace therein is significant!

I see that porneL actually described a bunch of this, but this handles most of the problem. It cancels modifiers set in previous sub-expressions (which the other answer missed) and sets modifiers as specified in each sub-expression. It also handles non-slash delimiters (I could not find a specification of what characters are allowed here so I used ., you may want to narrow further).
One weakness is it doesn't handle back-references within expressions. My biggest concern with that is the limitations of back-references themselves. I'll leave that as an exercise to the reader/questioner.
// Pass as many expressions as you'd like
function preg_magic_coalesce() {
$active_modifiers = array();
$expression = '/(?:';
$sub_expressions = array();
foreach(func_get_args() as $arg) {
// Determine modifiers from sub-expression
if(preg_match('/^(.)(.*)\1([eimsuxADJSUX]+)$/', $arg, $matches)) {
$modifiers = preg_split('//', $matches[3]);
if($modifiers[0] == '') {
array_shift($modifiers);
}
if($modifiers[(count($modifiers) - 1)] == '') {
array_pop($modifiers);
}
$cancel_modifiers = $active_modifiers;
foreach($cancel_modifiers as $key => $modifier) {
if(in_array($modifier, $modifiers)) {
unset($cancel_modifiers[$key]);
}
}
$active_modifiers = $modifiers;
} elseif(preg_match('/(.)(.*)\1$/', $arg)) {
$cancel_modifiers = $active_modifiers;
$active_modifiers = array();
}
// If expression has modifiers, include them in sub-expression
$sub_modifier = '(?';
$sub_modifier .= implode('', $active_modifiers);
// Cancel modifiers from preceding sub-expression
if(count($cancel_modifiers) > 0) {
$sub_modifier .= '-' . implode('-', $cancel_modifiers);
}
$sub_modifier .= ')';
$sub_expression = preg_replace('/^(.)(.*)\1[eimsuxADJSUX]*$/', $sub_modifier . '$2', $arg);
// Properly escape slashes
$sub_expression = preg_replace('/(?<!\\\)\//', '\\\/', $sub_expression);
$sub_expressions[] = $sub_expression;
}
// Join expressions
$expression .= implode('|', $sub_expressions);
$expression .= ')/';
return $expression;
}
Edit: I've rewritten this (because I'm OCD) and ended up with:
function preg_magic_coalesce($expressions = array(), $global_modifier = '') {
if(!preg_match('/^((?:-?[eimsuxADJSUX])+)$/', $global_modifier)) {
$global_modifier = '';
}
$expression = '/(?:';
$sub_expressions = array();
foreach($expressions as $sub_expression) {
$active_modifiers = array();
// Determine modifiers from sub-expression
if(preg_match('/^(.)(.*)\1((?:-?[eimsuxADJSUX])+)$/', $sub_expression, $matches)) {
$active_modifiers = preg_split('/(-?[eimsuxADJSUX])/',
$matches[3], -1, PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE);
}
// If expression has modifiers, include them in sub-expression
if(count($active_modifiers) > 0) {
$replacement = '(?';
$replacement .= implode('', $active_modifiers);
$replacement .= ':$2)';
} else {
$replacement = '$2';
}
$sub_expression = preg_replace('/^(.)(.*)\1(?:(?:-?[eimsuxADJSUX])*)$/',
$replacement, $sub_expression);
// Properly escape slashes if another delimiter was used
$sub_expression = preg_replace('/(?<!\\\)\//', '\\\/', $sub_expression);
$sub_expressions[] = $sub_expression;
}
// Join expressions
$expression .= implode('|', $sub_expressions);
$expression .= ')/' . $global_modifier;
return $expression;
}
It now uses (?modifiers:sub-expression) rather than (?modifiers)sub-expression|(?cancel-modifiers)sub-expression but I've noticed that both have some weird modifier side-effects. For instance, in both cases if a sub-expression has a /u modifier, it will fail to match (but if you pass 'u' as the second argument of the new function, that will match just fine).

Strip delimiters and flags from each. This regex should do it:
/^(.)(.*)\1([imsxeADSUXJu]*)$/
Join expressions together. You'll need non-capturing parenthesis to inject flags:
"(?$flags1:$regexp1)|(?$flags2:$regexp2)"
If there are any back references, count capturing parenthesis and update back references accordingly (e.g. properly joined /(.)x\1/ and /(.)y\1/ is /(.)x\1|(.)y\2/ ).

EDIT
I’ve rewritten the code! It now contains the changes that are listed as follows. Additionally, I've done extensive tests (which I won’t post here because they’re too many) to look for errors. So far, I haven’t found any.
The function is now split into two parts: There’s a separate function preg_split which takes a regular expression and returns an array containing the bare expression (without delimiters) and an array of modifiers. This might come in handy (it already has, in fact; this is why I made this change).
The code now correctly handles back-references. This was necessary for my purpose after all. It wasn’t difficult to add, the regular expression used to capture the back-references just looks weird (and may actually be extremely inefficient, it looks NP-hard to me – but that’s only an intuition and only applies in weird edge cases). By the way, does anyone know a better way of checking for an uneven number of matches than my way? Negative lookbehinds won't work here because they only accept fixed-length strings instead of regular expressions. However, I need the regex here to test whether the preceeding backslash is actually escaped itself.
Additionally, I don’t know how good PHP is at caching anonymous create_function use. Performance-wise, this might not be the best solution but it seems good enough.
I’ve fixed a bug in the sanity check.
I’ve removed the cancellation of obsolete modifiers since my tests show that it isn't necessary.
By the way, this code is one of the core components of a syntax highlighter for various languages that I’m working on in PHP since I’m not satisfied with the alternatives listed elsewhere.
Thanks!
porneL, eyelidlessness, amazing work! Many, many thanks. I had actually given up.
I've built upon your solution and I'd like to share it here. I didn't implement re-numbering back-references since this isn't relevant in my case (I think …). Perhaps this will become necessary later, though.
Some Questions …
One thing, #eyelidlessness: Why do you feel the necessity to cancel old modifiers? As far as I see it, this isn't necessary since the modifiers are only applied locally anyway.
Ah yes, one other thing. Your escaping of the delimiter seems overly complicated. Care to explain why you think this is needed? I believe my version should work as well but I could be very wrong.
Also, I've changed the signature of your function to match my needs. I also thing that my version is more generally useful. Again, I might be wrong.
BTW, you should now realize the importance of real names on SO. ;-) I can't give you real credit in the code. :-/
The Code
Anyway, I'd like to share my result so far because I can't believe that nobody else ever needs something like that. The code seems to work very well. Extensive tests are yet to be done, though. Please comment!
And without further ado …
/**
* Merges several regular expressions into one, using the indicated 'glue'.
*
* This function takes care of individual modifiers so it's safe to use
* <em>different</em> modifiers on the individual expressions. The order of
* sub-matches is preserved as well. Numbered back-references are adapted to
* the new overall sub-match count. This means that it's safe to use numbered
* back-refences in the individual expressions!
* If {#link $names} is given, the individual expressions are captured in
* named sub-matches using the contents of that array as names.
* Matching pair-delimiters (e.g. <code>"{…}"</code>) are currently
* <strong>not</strong> supported.
*
* The function assumes that all regular expressions are well-formed.
* Behaviour is undefined if they aren't.
*
* This function was created after a {#link https://stackoverflow.com/questions/244959/
* StackOverflow discussion}. Much of it was written or thought of by
* “porneL” and “eyelidlessness”. Many thanks to both of them.
*
* #param string $glue A string to insert between the individual expressions.
* This should usually be either the empty string, indicating
* concatenation, or the pipe (<code>|</code>), indicating alternation.
* Notice that this string might have to be escaped since it is treated
* like a normal character in a regular expression (i.e. <code>/</code>)
* will end the expression and result in an invalid output.
* #param array $expressions The expressions to merge. The expressions may
* have arbitrary different delimiters and modifiers.
* #param array $names Optional. This is either an empty array or an array of
* strings of the same length as {#link $expressions}. In that case,
* the strings of this array are used to create named sub-matches for the
* expressions.
* #return string An string representing a regular expression equivalent to the
* merged expressions. Returns <code>FALSE</code> if an error occurred.
*/
function preg_merge($glue, array $expressions, array $names = array()) {
// … then, a miracle occurs.
// Sanity check …
$use_names = ($names !== null and count($names) !== 0);
if (
$use_names and count($names) !== count($expressions) or
!is_string($glue)
)
return false;
$result = array();
// For keeping track of the names for sub-matches.
$names_count = 0;
// For keeping track of *all* captures to re-adjust backreferences.
$capture_count = 0;
foreach ($expressions as $expression) {
if ($use_names)
$name = str_replace(' ', '_', $names[$names_count++]);
// Get delimiters and modifiers:
$stripped = preg_strip($expression);
if ($stripped === false)
return false;
list($sub_expr, $modifiers) = $stripped;
// Re-adjust backreferences:
// We assume that the expression is correct and therefore don't check
// for matching parentheses.
$number_of_captures = preg_match_all('/\([^?]|\(\?[^:]/', $sub_expr, $_);
if ($number_of_captures === false)
return false;
if ($number_of_captures > 0) {
// NB: This looks NP-hard. Consider replacing.
$backref_expr = '/
( # Only match when not escaped:
[^\\\\] # guarantee an even number of backslashes
(\\\\*?)\\2 # (twice n, preceded by something else).
)
\\\\ (\d) # Backslash followed by a digit.
/x';
$sub_expr = preg_replace_callback(
$backref_expr,
create_function(
'$m',
'return $m[1] . "\\\\" . ((int)$m[3] + ' . $capture_count . ');'
),
$sub_expr
);
$capture_count += $number_of_captures;
}
// Last, construct the new sub-match:
$modifiers = implode('', $modifiers);
$sub_modifiers = "(?$modifiers)";
if ($sub_modifiers === '(?)')
$sub_modifiers = '';
$sub_name = $use_names ? "?<$name>" : '?:';
$new_expr = "($sub_name$sub_modifiers$sub_expr)";
$result[] = $new_expr;
}
return '/' . implode($glue, $result) . '/';
}
/**
* Strips a regular expression string off its delimiters and modifiers.
* Additionally, normalize the delimiters (i.e. reformat the pattern so that
* it could have used '/' as delimiter).
*
* #param string $expression The regular expression string to strip.
* #return array An array whose first entry is the expression itself, the
* second an array of delimiters. If the argument is not a valid regular
* expression, returns <code>FALSE</code>.
*
*/
function preg_strip($expression) {
if (preg_match('/^(.)(.*)\\1([imsxeADSUXJu]*)$/s', $expression, $matches) !== 1)
return false;
$delim = $matches[1];
$sub_expr = $matches[2];
if ($delim !== '/') {
// Replace occurrences by the escaped delimiter by its unescaped
// version and escape new delimiter.
$sub_expr = str_replace("\\$delim", $delim, $sub_expr);
$sub_expr = str_replace('/', '\\/', $sub_expr);
}
$modifiers = $matches[3] === '' ? array() : str_split(trim($matches[3]));
return array($sub_expr, $modifiers);
}
PS: I've made this posting community wiki editable. You know what this means …!

I'm pretty sure it's not possible to just put regexps together like that in any language - they could have incompatible modifiers.
I'd probably just put them in an array and loop through them, or combine them by hand.
Edit: If you're doing them one at a time as described in your edit, you maybe be able to run the second one on a substring (from the start up to the earliest match). That might help things.

function preg_magic_coalasce($split, $re1, $re2) {
$re1 = rtrim($re1, "\/#is");
$re2 = ltrim($re2, "\/#");
return $re1.$split.$re2;
}

You could do it the alternative way like this:
$a = '# /[a-z] #i';
$b = '/ Moo /x';
$a_matched = preg_match($a, $text, $a_matches);
$b_matched = preg_match($b, $text, $b_matches);
if ($a_matched && $b_matched) {
$a_pos = strpos($text, $a_matches[1]);
$b_pos = strpos($text, $b_matches[1]);
if ($a_pos == $b_pos) {
if (strlen($a_matches[1]) == strlen($b_matches[1])) {
// $a and $b matched the exact same string
} else if (strlen($a_matches[1]) > strlen($b_matches[1])) {
// $a and $b started matching at the same spot but $a is longer
} else {
// $a and $b started matching at the same spot but $b is longer
}
} else if ($a_pos < $b_pos) {
// $a matched first
} else {
// $b matched first
}
} else if ($a_matched) {
// $a matched, $b didn't
} else if ($b_matched) {
// $b matched, $a didn't
} else {
// neither one matched
}

Related

PHP extract comparison operator

I was asked on an interview what would be the fastest way to extract the comparison operator between two statements.
For example rate>=4 the comparison operator is '>=' it should be able to extract '>','<','!=','=','<=','>=','='
The function must return the comparison operator.
This is what I wrote, and they marked it as wrong.
function extractcomp($str)
{
$temp = [];
$matches = array('>','<','!','=');
foreach($matches as $match)
{
if(strpos($str,$match)!== false)
{
$temp[] = $match;
}
}
return implode('',$temp);
}
Does anyone have a better way?

you can read character by character once you hit the first occurrence you can determine what's gonna be the next character i.e.:
$ops = ['>','<','!','='];
$str = "rate!=4";
foreach($ops as $op)
{
if(($c1 = strpos($str, $op)) !== false)
{
$c2 = $str[$c1++] . (($str[$c1] == $ops[3]) ? $str[$c1] : "");
break;
}
}
echo $c2;
So if the first search character is ">" you can only assume the 2nd one is gonna be "=" or it doesn't exist. So you get the index of 1st character and increment it and check if the 2nd character exists in our search array or not. Then return the value. this will loop until it finds the 1st occurrence then breaks.
EDIT:
here's another solution:
$str = "rate!=4";
$arr = array_intersect(str_split($str), ['>','<','=','!']);
echo current($arr).(end($arr) ? end($arr) : '');
not as fast as the loop but definitely decreases the bloat code.
There's always a better way to optimize the code.

Unless they have some monkeywrenching strings to throw at this custom function, I recommend trim() with a ranged character mask. Something like echo trim('rate>=4',"A..Za..z0..9"); would work for your sample input in roughly half the time.
Code: (Demo)
function extractcomp($str){
return trim($str,"A..Za..z0..9");
}
echo extractcomp("rate>=4");
Regarding regex, better efficiency in terms of step count with preg_match() would be to use a character class to match the operators.
Assuming only valid operators will be used, you can use /[><!=]+/ or if you want to tighen up length /[><!=]{1,3}/
Just 8 steps on your sample input string. Demo
This is less strict than Andreas' | based pattern, but takes fewer steps.
It depends on how strict the pattern must be. My pattern will match !==.
If you want to improve your loop method, write a break after you have matched the entire comparison operator.
Actually, you are looping the operators. That would have been their issue (or one of them). Your method will not match ==. I'm not sure if that is a possible comparison (it is not in your list).

PHP regex pattern match recursive

I have this in a function which is supposed to replace any sequence of parentheses with what is enclosed in it like (abc) becomes abc any where it appears even recursively because parens can be nested.
$return = preg_replace_callback(
'|(\((.+)\))+|',
function ($matches) {
return $matches[2];
},
$s
);
when the above regex is fed this string "a(bcdefghijkl(mno)p)q" as input it returns "ap)onm(lkjihgfedcbq". This shows the regex is matched once. What can I do to make it continue to match even inside already made matches and produce this `abcdefghijklmnopq'"

To match balanced parenthetical substrings you may use a well-known \((?:[^()]++|(?R))*\) pattern (described in Matching Balanced Constructs), inside a preg_replace_callback method, where the match value can be further manipulated (just remove all ( and ) symbols from the match that is easy to do even without a regex:
$re = '/\((?:[^()]++|(?R))*\)/';
$str = 'a(bcdefghijkl(mno)p)q((('; // Added three ( at the end
$result = preg_replace_callback($re, function($m) {
return str_replace(array('(',')'), '', $m[0]);
}, $str);
echo $result; // => abcdefghijklmnopq(((
See the PHP demo
To get overlapping matches, you need to use a known technique, capturing inside a positive lookahead, but you won't be able to perform two operations at once (replacing and matching), you can run matching first, and then replace:
$re = '/(?=(\((?:[^()]++|(?1))*\)))/';
$str = 'a(bcdefghijkl(mno)p)q(((';
preg_match_all($re, $str, $m);
print_r($m[1]);
// => Array ( [0] => (bcdefghijkl(mno)p) [1] => (mno) )
See the PHP demo.

Try this one,
preg_match('/\((?:[^\(\)]*+|(?0))*\)/', $str )
https://regex101.com/r/NsQSla/1
It will match everything inside of the ( ) as long as they are matched pairs.
Example
(abc) (abc (abc))
will have the following matches
Match 1
Full match 0-5 `(abc)`
Match 2
Full match 6-17 `(abc (abc))`

It is slightly unclear exactly what the postcondition of the algorithm is supposed to be. It seems to me that you are wanting to strip out matching pairs of ( ). The assumption here is that unmatched parentheses are left alone (otherwise you just strip out all of the ('s and )'s).
So I guess this means the input string a(bcdefghijkl(mno)p)q becomes abcdefghijklmnopq but the input string a(bcdefghijkl(mno)pq becomes a(bcdefghijklmnopq. Likewise an input string (a)) would become a).
It may be possible to do this using pcre since it does provide some non-regular features but I'm doubtful about it. The language of the input strings is not regular; it's context-free. What #ArtisticPhoenix's answer does is match complete pairs of matched parentheses. What it does not do is match all nested pairs. This nested matching is inherently non-regular in my humble understanding of language theory.
I suggest writing a parser to strip out the matching pairs of parentheses. It gets a little wordy having to account for productions that fail to match:
<?php
// Parse the punctuator sub-expression (i.e. anything within ( ... ) ).
function parse_punc(array $tokens,&$iter) {
if (!isset($tokens[$iter])) {
return;
}
$inner = parse_punc_seq($tokens,$iter);
if (!isset($tokens[$iter]) || $tokens[$iter] != ')') {
// Leave unmatched open parentheses alone.
$inner = "($inner";
}
$iter += 1;
return $inner;
}
// Parse a sequence (inside punctuators).
function parse_punc_seq(array $tokens,&$iter) {
if (!isset($tokens[$iter])) {
return;
}
$tok = $tokens[$iter];
if ($tok == ')') {
return;
}
$iter += 1;
if ($tok == '(') {
$tok = parse_punc($tokens,$iter);
}
$tok .= parse_punc_seq($tokens,$iter);
return $tok;
}
// Parse a sequence (outside punctuators).
function parse_seq(array $tokens,&$iter) {
if (!isset($tokens[$iter])) {
return;
}
$tok = $tokens[$iter++];
if ($tok == '(') {
$tok = parse_punc($tokens,$iter);
}
$tok .= parse_seq($tokens,$iter);
return $tok;
}
// Wrapper for parser.
function parse(array $tokens) {
$iter = 0;
return strval(parse_seq($tokens,$iter));
}
// Grab input from stdin and run it through the parser.
$str = trim(stream_get_contents(STDIN));
$tokens = preg_split('/([\(\)])/',$str,-1,PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
var_dump(parse($tokens));
I know this is a lot more code than a regex one-liner but it does solve the problem as I understand it. I'd be interested to know if anyone can solve this problem with a regular expression.

very large php string magically turns into array

I am getting an "Array to string conversion error on PHP";
I am using the "variable" (that should be a string) as the third parameter to str_replace. So in summary (very simplified version of whats going on):
$str = "very long string";
str_replace("tag", $some_other_array, $str);
$str is throwing the error, and I have been trying to fix it all day, the thing I have tried is:
if(is_array($str)) die("its somehow an array");
serialize($str); //inserted this before str_replace call.
I have spent all day on it, and no its not something stupid like variables around the wrong way - it is something bizarre. I have even dumped it to a file and its a string.
My hypothesis:
The string is too long and php can't deal with it, turns into an array.
The $str value in this case is nested and called recursively, the general flow could be explained like this:
--code
//pass by reference
function the_function ($something, &$OFFENDING_VAR, $something_else) {
while(preg_match($something, $OFFENDING_VAR)) {
$OFFENDING_VAR = str_replace($x, y, $OFFENDING_VAR); // this is the error
}
}
So it may be something strange due to str_replace, but that would mean that at some point str_replace would have to return an array.
Please help me work this out, its very confusing and I have wasted a day on it.
---- ORIGINAL FUNCTION CODE -----
//This function gets called with multiple different "Target Variables" Target is the subject
//line, from and body of the email filled with << tags >> so the str_replace function knows
//where to replace them
function perform_replacements($replacements, &$target, $clean = TRUE,
$start_tag = '<<', $end_tag = '>>', $max_substitutions = 5) {
# Construct separate tag and replacement value arrays for use in the substitution loop.
$tags = array();
$replacement_values = array();
foreach ($replacements as $tag_text => $replacement_value) {
$tags[] = $start_tag . $tag_text . $end_tag;
$replacement_values[] = $replacement_value;
}
# TODO: this badly needs refactoring
# TODO: auto upgrade <<foo>> to <<foo_html>> if foo_html exists and acting on html template
# Construct a regular expression for use in scanning for tags.
$tag_match = '/' . preg_quote($start_tag) . '\w+' . preg_quote($end_tag) . '/';
# Perform the substitution until all valid tags are replaced, or the maximum substitutions
# limit is reached.
$substitution_count = 0;
while (preg_match ($tag_match, $target) && ($substitution_count++ < $max_substitutions)) {
$target = serialize($target);
$temp = str_replace($tags,
$replacement_values,
$target); //This is the line that is failing.
unset($target);
$target = $temp;
}
if ($clean) {
# Clean up any unused search values.
$target = preg_replace($tag_match, '', $target);
}
}

How do you know $str is the problem and not $some_other_array?
From the manual:
If search and replace are arrays, then str_replace() takes a value
from each array and uses them to search and replace on subject. If
replace has fewer values than search, then an empty string is used for
the rest of replacement values. If search is an array and replace is a
string, then this replacement string is used for every value of
search. The converse would not make sense, though.
The second parameter can only be an array if the first one is as well.

Replacing variables in a string

I am working on a multilingual website in PHP and in my languages files i often have strings which contain multiple variables that will be later filled in to complete the sentences.
Currently i am placing {VAR_NAME} in the string and manually replacing each occurence with its matching value when used.
So basically :
{X} created a thread on {Y}
becomes :
Dany created a thread on Stack Overflow
I have already thought of sprintf but i find it inconvenient because it depends on the order of the variables which can change from a language to another.
And I have already checked How replace variable in string with value in php? and for now i basically use this method.
But i am interested in knowing if there is a built-in (or maybe not) convenient way in PHP to do that considering that i already have variables named exactly as X and Y in the previous example, more like $$ for a variable variable.
So instead of doing str_replace on the string i would maybe call a function like so :
$X = 'Dany';
$Y = 'Stack Overflow';
$lang['example'] = '{X} created a thread on {Y}';
echo parse($lang['example']);
would also print out :
Dany created a thread on Stack Overflow
Thanks!
Edit
The strings serve as templates and can be used multiple times with different inputs.
So basically doing "{$X} ... {$Y}" won't do the trick because i will lose the template and the string will be initialized with the starting values of $X and $Y which aren't yet determined.

I'm going to add an answer here because none of the current answers really cut the mustard in my view. I'll dive straight in and show you the code I would use to do this:
function parse(
/* string */ $subject,
array $variables,
/* string */ $escapeChar = '#',
/* string */ $errPlaceholder = null
) {
$esc = preg_quote($escapeChar);
$expr = "/
$esc$esc(?=$esc*+{)
| $esc{
| {(\w+)}
/x";
$callback = function($match) use($variables, $escapeChar, $errPlaceholder) {
switch ($match[0]) {
case $escapeChar . $escapeChar:
return $escapeChar;
case $escapeChar . '{':
return '{';
default:
if (isset($variables[$match[1]])) {
return $variables[$match[1]];
}
return isset($errPlaceholder) ? $errPlaceholder : $match[0];
}
};
return preg_replace_callback($expr, $callback, $subject);
}
What does that do?
In a nutshell:
Create a regular expression using the specified escape character that will match one of three sequences (more on that below)
Feed that into preg_replace_callback(), where the callback handles two of those sequences exactly and treats everything else as a replacement operation.
Return the resulting string
The regex
The regex matches any one of these three sequences:
Two occurrences of the escape character, followed by zero or more occurrences of the escape character, followed by an opening curly brace. Only the first two occurrences of the escape character are consumed. This is replaced by a single occurrence of the escape character.
A single occurrence of the escape character followed by an opening curly brace. This is replaced by a literal open curly brace.
An opening curly brace, followed by one or more perl word characters (alpha-numerics and the underscore character) followed by a closing curly brace. This is treated as a placeholder and a lookup is performed for the name between the braces in the $variables array, if it is found then return the replacement value, if not then return the value of $errPlaceholder - by default this is null, which is treated as a special case and the original placeholder is returned (i.e. the string is not modified).
Why is it better?
To understand why it's better, let's look at the replacement approaches take by other answers. With one exception (the only failing of which is compatibility with PHP<5.4 and slightly non-obvious behaviour), these fall into two categories:
strtr() - This provides no mechanism for handling an escape character. What if your input string needs a literal {X} in it? strtr() does not account for this, and it would be substituted for the value $X.
str_replace() - this suffers from the same issue as strtr(), and another problem as well. When you call str_replace() with an array argument for the search/replace arguments, it behaves as if you had called it multiple times - one for each of the array of replacement pairs. This means that if one of your replacement strings contains a value that appears later in the search array, you will end up substituting that as well.
To demonstrate this issue with str_replace(), consider the following code:
$pairs = array('A' => 'B', 'B' => 'C');
echo str_replace(array_keys($pairs), array_values($pairs), 'AB');
Now, you'd probably expect the output here to be BC but it will actually be CC (demo) - this is because the first iteration replaced A with B, and in the second iteration the subject string was BB - so both of these occurrences of B were replaced with C.
This issue also betrays a performance consideration that might not be immediately obvious - because each pair is handled separately, the operation is O(n), for each replacement pair the entire string is searched and the single replacement operation handled. If you had a very large subject string and a lot of replacement pairs, that's a sizeable operation going on under the bonnet.
Arguably this performance consideration is a non-issue - you would need a very large string and a lot of replacement pairs before you got a meaningful slowdown, but it's still worth remembering. It's also worth remembering that regex has performance penalties of its own, so in general this consideration shouldn't be included in the decision-making process.
Instead we use preg_replace_callback(). This visits any given part of the string looking for matches exactly once, within the bounds of the supplied regular expression. I add this qualifier because if you write an expression that causes catastrophic backtracking then it will be considerably more than once, but in this case that shouldn't be a problem (to help avoid this I made the only repetition in the expression possessive).
We use preg_replace_callback() instead of preg_replace() to allow us to apply custom logic while looking for the replacement string.
What this allows you to do
The original example from the question
$X = 'Dany';
$Y = 'Stack Overflow';
$lang['example'] = '{X} created a thread on {Y}';
echo parse($lang['example']);
This becomes:
$pairs = array(
'X' = 'Dany',
'Y' = 'Stack Overflow',
);
$lang['example'] = '{X} created a thread on {Y}';
echo parse($lang['example'], $pairs);
// Dany created a thread on Stack Overflow
Something more advanced
Now let's say we have:
$lang['example'] = '{X} created a thread on {Y} and it contained {X}';
// Dany created a thread on Stack Overflow and it contained Dany
...and we want the second {X} to appear literally in the resulting string. Using the default escape character of #, we would change it to:
$lang['example'] = '{X} created a thread on {Y} and it contained #{X}';
// Dany created a thread on Stack Overflow and it contained {X}
OK, looks good so far. But what if that # was supposed to be a literal?
$lang['example'] = '{X} created a thread on {Y} and it contained ##{X}';
// Dany created a thread on Stack Overflow and it contained #Dany
Note that the regular expression has been designed to only pay attention to escape sequences that immediately precede an opening curly brace. This means that you don't need to escape the escape character unless it appears immediately in front of a placeholder.
A note about the use of an array as an argument
Your original code sample uses variables named the same way as the placeholders in the string. Mine uses an array with named keys. There are two very good reasons for this:
Clarity and security - it's much easier to see what will end up being substituted, and you don't risk accidentally substituting variables you don't want to be exposed. It wouldn't be much good if someone could simply feed in {dbPass} and see your database password, now would it?
Scope - it's not possible to import variables from the calling scope unless the caller is the global scope. This makes the function useless if called from another function, and importing data from another scope is very bad practice.
If you really want to use named variables from the current scope (and I do not recommend this due to the aforementioned security issues) you can pass the result of a call to get_defined_vars() to the second argument.
A note about choosing an escape character
You'll notice I chose # as the default escape character. You can use any character (or sequence of characters, it can be more than one) by passing it to the third argument - and you may be tempted to use \ since that's what many languages use, but hold on before you do that.
The reason you don't want to use \ is because many languages use it as their own escape character, which means that when you want to specify your escape character in, say, a PHP string literal, you run into this problem:
$lang['example'] = '\\{X}'; // results in {X}
$lang['example'] = '\\\{X}'; // results in \Dany
$lang['example'] = '\\\\{X}'; // results in \Dany
It can lead to a readability nightmare, and some non-obvious behaviour with complex patterns. Pick an escape character that is not used by any other language involved (for example, if you are using this technique to generate fragments of HTML, don't use & as an escape character either).
To sum up
What you are doing has edge-cases. To solve the problem properly, you need to use a tool capable of handling those edge-cases - and when it comes to string manipulation, the tool for the job is most often regex.

Here's a portable solution, using variable variables. Yay!
$string = "I need to replace {X} and {Y}";
$X = 'something';
$Y = 'something else';
preg_match_all('/\{(.*?)\}/', $string, $matches);
foreach ($matches[1] as $value)
{
$string = str_replace('{'.$value.'}', ${$value}, $string);
}
First you set up your string, and your replacements. Then, you perform a regular expression to get an array of matches (strings within { and }, including those brackets). Finally, you loop around these and replace those with the variables you created above, using variable variables. Lovely!
Just thought I'd update this with another option even though you've marked it as correct. You don't have to use variable variables, and an array can be used in it's place.
$map = array(
'X' => 'something',
'Y' => 'something else'
);
preg_match_all('/\{(.*?)\}/', $string, $matches);
foreach ($matches[1] as $value)
{
$string = str_replace('{'.$value.'}', $map[$value], $string);
}
That would allow you to create a function with the following signature:
public function parse($string, $map); // Probably what I'd do tbh
Another option thanks to toolmakersteve in the comments does away with the need for a loop and uses strtr, but requires minor additions to the variables and single quotes instead of double quotes:
$string = 'I need to replace {$X} and {$Y}';
$map = array(
'{$X}' => 'something',
'{$Y}' => 'something else'
);
$string = strtr($string, $map);

If you're running 5.4 and you care about being able to use PHP's builtin variable interpolation in the string, you can use the bindTo() method of Closure like so:
// Strings use interpolation, but have to return themselves from an anon func
$strings = [
'en' => [
'message_sent' => function() { return "You just sent a message to $this->recipient that said: $this->message."; }
],
'es' => [
'message_sent' => function() { return "Acabas de enviar un mensaje a $this->recipient que dijo: $this->message."; }
]
];
class LocalizationScope {
private $data;
public function __construct($data) {
$this->data = $data;
}
public function __get($param) {
if(isset($this->data[$param])) {
return $this->data[$param];
}
return '';
}
}
// Bind the string anon func to an object of the array data passed in and invoke (returns string)
function localize($stringCb, $data) {
return $stringCb->bindTo(new LocalizationScope($data))->__invoke();
}
// Demo
foreach($strings as $str) {
var_dump(localize($str['message_sent'], array(
'recipient' => 'Jeff Atwood',
'message' => 'The project should be done in 6 to 8 weeks.'
)));
}
//string(93) "You just sent a message to Jeff Atwood that said: The project should be done in 6 to 8 weeks."
//string(95) "Acabas de enviar un mensaje a Jeff Atwood que dijo: The project should be done in 6 to 8 weeks."
(Codepad Demo)
Perhaps, it feels a bit hacky, and I don't particularly like using $this in this instance. But you do get the added benefit of relying on PHP's variable interpolation (which allows you to do things like escaping, that are difficult to achieve with regex).
EDIT: Added LocalizationScope, which adds another benefit: no warnings if localization anonymous functions try to access data that was not provided.

strtr is probably a better choice for this kind of things, because it replaces longest keys first:
$repls = array(
'X' => 'Dany',
'Y' => 'Stack Overflow',
);
foreach($data as $key => $value)
$repls['{' . $key . '}'] = $value;
$result = strtr($text, $repls);
(think of situations where you have keys like XX and X)
And if you don't want to use an array and instead expose all variables from the current scope:
$repls = get_defined_vars();

If your only issue with sprintf is the order of the arguments you can use argument swapping.
From the doc (http://php.net/manual/en/function.sprintf.php):
$format = 'The %2$s contains %1$d monkeys';
echo sprintf($format, $num, $location);

gettext is a widely used universal localization system that does exactly what you want.
There are libraries for most programming languages and PHP has a built-in engine.
It is driven by po-files, simple text based format, for which there are many editors around and it is compatible with sprintf syntax.
It even has some functions to deal with things like complicated plurals that some languages have.
Here are some examples of what it does. Note that _() is an alias for gettext():
echo _('Hello world'); // will output hello world in the current selected language
echo sprintf(_("%s has created a thread on %s"), $name, $site); // translates the string, and hands it over to sprintf()
echo sprintf(_("%2$s has created a thread on %1$s"), $site, $name); // same as above, but with changed order of parameters.
If you have more than a handful of strings, you should definitely use an existing engine, rather than writing your own one.
Adding a new language is just a matter of translating a list of strings and most professional translation tools can work with this file format, too.
Check Wikipedia and the PHP documentation for a basic overview on how this works:
http://en.wikipedia.org/wiki/Gettext
http://de.php.net/gettext
Google finds heaps of documentation and your favourite software repository will most likely have a handful of tools for managing po-files.
Some that I have used are:
poedit: Very light and simple. Good if you don't have too much stuff to translate and don't want to spend time thinking about how that stuff works.
Virtaal: A bit more complex and has a bit of a learning curve, but also some nice features that make your life easier. Good if you need to translate a lot.
GlotPress is a web application (from the wordpress people) that allows collaborative editing of the translation database files.

Why not use str_replace then? If you want it as template.
echo str_replace(array('{X}', '{Y}'), array($X, $Y), $lang['example']);
for every occurrence of this that you need
str_replace was built for this in the first place.

How about defining the "variable" parts as an array with keys corresponding to the placeholders in your string?
$string = "{X} created a thread on {Y}";
$values = array(
'X' => "Danny",
'Y' => "Stack Overflow",
);
echo str_replace(
array_map(function($v) { return '{'.$v.'}'; }, array_keys($values)),
array_values($values),
$string
);

Why can't you just use the template string within a function?
function threadTemplate($x, $y) {
return "{$x} created a thread on {$y}";
}
echo threadTemplate($foo, $bar);

Simple:
$X = 'Dany';
$Y = 'Stack Overflow';
$lang['example'] = "{$X} created a thread on {$Y}";
Hence:
echo $lang['example'];
Will output:
Dany created a thread on Stack Overflow
As you requested.
UPDATE:
As per the OP's comments about making the solution more portable:
Have a class do the parsing for you each time:
class MyParser {
function parse($vstr) {
return "{$x} created a thread on {$y}";
}
}
That way, if the following occurs:
$X = 3;
$Y = 4;
$a = new MyParser();
$lang['example'] = $a->parse($X, $Y);
echo $lang['example'];
Which will return:
3 created a thread on 4;
And, double checking:
$X = 'Steve';
$Y = 10.9;
$lang['example'] = $a->parse($X, $Y);
Will print:
Steve created a thread on 10.9;
As desired.
UPDATE 2:
As per the OP's comments about improving portability:
class MyParser {
function parse($vstr) {
return "{$vstr}";
}
}
$a = new MyParser();
$X = 3;
$Y = 4;
$vstr = "{$X} created a thread on {$Y}";
$a = new MyParser();
$lang['example'] = $a->parse($vstr);
echo $lang['example'];
Will output the results cited previously.

Try
$lang['example'] = "$X created a thread on $Y";
EDIT: Based on latest info
Maybe you need to look at the sprintf() function
Then you could have your template string defined as this
$template_string = '%s created a thread on %s';
$X = 'Fred';
$Y = 'Sunday';
echo sprintf( $template_string, $X, $Y );
$template_string does not change but later in your code when you have assigned different values to $X and $Y you can still use the echo sprintf( $template_string, $X, $Y );
See PHP Manual

just throwing another solution in using associative arrays. This will loop through the associative array and either replace the template or leave it blank.
example:
$list = array();
$list['X'] = 'Dany';
$list['Y'] = 'Stack Overflow';
$str = '{X} created a thread on {Y}';
$newstring = textReplaceContent($str,$list);
function textReplaceContent($contents, $list) {
while (list($key, $val) = each($list)) {
$key = "{" . $key . "}";
if ($val) {
$contents = str_replace($key, $val, $contents);
} else {
$contents = str_replace($key, "", $contents);
}
}
$final = preg_replace('/\[\w+\]/', '', $contents);
return ($final);
}

Regex to match specific string not enclosed by another, different specific string

I need a regex to match a string not enclosed by another different, specific string. For instance, in the following situation it would split the content into two groups: 1) The content before the second {Switch} and 2) The content after the second {Switch}. It wouldn't match the first {Switch} because it is enclosed by {my_string}'s. The string will always look like shown below (i.e. {my_string}any content here{/my_string})
Some more
{my_string}
Random content
{Switch} //This {Switch} may or may not be here, but should be ignored if it is present
More random content
{/my_string}
Content here too
{Switch}
More content
So far I've gotten what is below which I know isn't very close at all:
(.*?)\{Switch\}(.*?)
I'm just not sure how to use the [^] (not operator) with a specific string versus different characters.

It really seems you're trying to use a regular expression to parse a grammar - something that regular expressions are really bad at doing. You might be better off writing a parser to break down your string into the tokens that build it, and then processing that tree.
Perhaps something like http://drupal.org/project/grammar_parser might help.

Try this simple function:
function find_content()
function find_content($doc) {
$temp = $doc;
preg_match_all('~{my_string}.*?{/my_string}~is', $temp, $x);
$i = 0;
while (isset($x[0][$i])) {
$temp = str_replace($x[0][$i], "{REPL:$i}", $temp);
$i++;
}
$res = explode('{Switch}', $temp);
foreach ($res as &$part)
foreach($x[0] as $id=>$content)
$part = str_replace("{REPL:$id}", $content, $part);
return $res;
}
Use it this way
$content_parts = find_content($doc); // $doc is your input document
print_r($content_parts);
Output (your example)
Array
(
[0] => Some more
{my_string}
Random content
{Switch} //This {Switch} may or may not be here, but should be ignored if it is present
More random content
{/my_string}
Content here too
[1] =>
More content
)

You can try positive lookahead and lookbehind assertions (http://www.regular-expressions.info/lookaround.html)
It might look something like this:
$content = 'string of text before some random content switch text some more random content string of text after';
$before = preg_quote('String of text before');
$switch = preg_quote('switch text');
$after = preg_quote('string of text after');
if( preg_match('/(?<=' $before .')(.*)(?:' $switch .')?(.*)(?=' $after .')/', $content, $matches) ) {
// $matches[1] == ' some random content '
// $matches[2] == ' some more random content '
}

$regex = (?:(?!\{my_string\})(.*?))(\{Switch\})(?:(.*?)(?!\{my_string\}));
/* if "my_string" and "Switch" aren't wrapped by "{" and "}" just remove "\{" and "\}" */
$yourNewString = preg_replace($regex,"$1",$yourOriginalString);
This might work. Can't test it know, but i'll update later!
I don't if this is what you're looking for, but to negate more than one character, the regex syntax is:
(?!yourString)
and it is called "negative lookahead assertion".
/Edit:
This should work and return true:
$stringMatchesYourRulesBoolean = preg_match('~(.*?)('.$my_string.')(.*?)(?<!'.$my_string.') ?('.$switch.') ?(?!'.$my_string.')(.*?)('.$my_string.')(.*?)~',$yourString);

Have a look at PHP PEG. It is a little parser written in PHP. You can write your own grammar and parse it. It's going to be very simple in your case.
The grammar syntax and the way of parsing is all explained in the README.md
Extracts from the readme:
token* - Token is optionally repeated
token+ - Token is repeated at least one
token? - Token is optionally present
Tokens may be :
- bare-words, which are recursive matchers - references to token rules defined elsewhere in the grammar,
- literals, surrounded by `"` or `'` quote pairs. No escaping support is provided in literals.
- regexs, surrounded by `/` pairs.
- expressions - single words (match \w+)
Sample grammar: (file EqualRepeat.peg.inc)
class EqualRepeat extends Packrat {
/* Any number of a followed by the same number of b and the same number of c characters
* aabbcc - good
* aaabbbccc - good
* aabbc - bad
* aabbacc - bad
*/
/*Parser:Grammar1
A: "a" A? "b"
B: "b" B? "c"
T: !"b"
X: &(A !"b") "a"+ B !("a" | "b" | "c")
*/
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.