I am working on a multilingual website in PHP and in my languages files i often have strings which contain multiple variables that will be later filled in to complete the sentences.
Currently i am placing {VAR_NAME} in the string and manually replacing each occurence with its matching value when used.
So basically :
{X} created a thread on {Y}
becomes :
Dany created a thread on Stack Overflow
I have already thought of sprintf but i find it inconvenient because it depends on the order of the variables which can change from a language to another.
And I have already checked How replace variable in string with value in php? and for now i basically use this method.
But i am interested in knowing if there is a built-in (or maybe not) convenient way in PHP to do that considering that i already have variables named exactly as X and Y in the previous example, more like $$ for a variable variable.
So instead of doing str_replace on the string i would maybe call a function like so :
$X = 'Dany';
$Y = 'Stack Overflow';
$lang['example'] = '{X} created a thread on {Y}';
echo parse($lang['example']);
would also print out :
Dany created a thread on Stack Overflow
Thanks!
Edit
The strings serve as templates and can be used multiple times with different inputs.
So basically doing "{$X} ... {$Y}" won't do the trick because i will lose the template and the string will be initialized with the starting values of $X and $Y which aren't yet determined.
I'm going to add an answer here because none of the current answers really cut the mustard in my view. I'll dive straight in and show you the code I would use to do this:
function parse(
/* string */ $subject,
array $variables,
/* string */ $escapeChar = '#',
/* string */ $errPlaceholder = null
) {
$esc = preg_quote($escapeChar);
$expr = "/
$esc$esc(?=$esc*+{)
| $esc{
| {(\w+)}
/x";
$callback = function($match) use($variables, $escapeChar, $errPlaceholder) {
switch ($match[0]) {
case $escapeChar . $escapeChar:
return $escapeChar;
case $escapeChar . '{':
return '{';
default:
if (isset($variables[$match[1]])) {
return $variables[$match[1]];
}
return isset($errPlaceholder) ? $errPlaceholder : $match[0];
}
};
return preg_replace_callback($expr, $callback, $subject);
}
What does that do?
In a nutshell:
Create a regular expression using the specified escape character that will match one of three sequences (more on that below)
Feed that into preg_replace_callback(), where the callback handles two of those sequences exactly and treats everything else as a replacement operation.
Return the resulting string
The regex
The regex matches any one of these three sequences:
Two occurrences of the escape character, followed by zero or more occurrences of the escape character, followed by an opening curly brace. Only the first two occurrences of the escape character are consumed. This is replaced by a single occurrence of the escape character.
A single occurrence of the escape character followed by an opening curly brace. This is replaced by a literal open curly brace.
An opening curly brace, followed by one or more perl word characters (alpha-numerics and the underscore character) followed by a closing curly brace. This is treated as a placeholder and a lookup is performed for the name between the braces in the $variables array, if it is found then return the replacement value, if not then return the value of $errPlaceholder - by default this is null, which is treated as a special case and the original placeholder is returned (i.e. the string is not modified).
Why is it better?
To understand why it's better, let's look at the replacement approaches take by other answers. With one exception (the only failing of which is compatibility with PHP<5.4 and slightly non-obvious behaviour), these fall into two categories:
strtr() - This provides no mechanism for handling an escape character. What if your input string needs a literal {X} in it? strtr() does not account for this, and it would be substituted for the value $X.
str_replace() - this suffers from the same issue as strtr(), and another problem as well. When you call str_replace() with an array argument for the search/replace arguments, it behaves as if you had called it multiple times - one for each of the array of replacement pairs. This means that if one of your replacement strings contains a value that appears later in the search array, you will end up substituting that as well.
To demonstrate this issue with str_replace(), consider the following code:
$pairs = array('A' => 'B', 'B' => 'C');
echo str_replace(array_keys($pairs), array_values($pairs), 'AB');
Now, you'd probably expect the output here to be BC but it will actually be CC (demo) - this is because the first iteration replaced A with B, and in the second iteration the subject string was BB - so both of these occurrences of B were replaced with C.
This issue also betrays a performance consideration that might not be immediately obvious - because each pair is handled separately, the operation is O(n), for each replacement pair the entire string is searched and the single replacement operation handled. If you had a very large subject string and a lot of replacement pairs, that's a sizeable operation going on under the bonnet.
Arguably this performance consideration is a non-issue - you would need a very large string and a lot of replacement pairs before you got a meaningful slowdown, but it's still worth remembering. It's also worth remembering that regex has performance penalties of its own, so in general this consideration shouldn't be included in the decision-making process.
Instead we use preg_replace_callback(). This visits any given part of the string looking for matches exactly once, within the bounds of the supplied regular expression. I add this qualifier because if you write an expression that causes catastrophic backtracking then it will be considerably more than once, but in this case that shouldn't be a problem (to help avoid this I made the only repetition in the expression possessive).
We use preg_replace_callback() instead of preg_replace() to allow us to apply custom logic while looking for the replacement string.
What this allows you to do
The original example from the question
$X = 'Dany';
$Y = 'Stack Overflow';
$lang['example'] = '{X} created a thread on {Y}';
echo parse($lang['example']);
This becomes:
$pairs = array(
'X' = 'Dany',
'Y' = 'Stack Overflow',
);
$lang['example'] = '{X} created a thread on {Y}';
echo parse($lang['example'], $pairs);
// Dany created a thread on Stack Overflow
Something more advanced
Now let's say we have:
$lang['example'] = '{X} created a thread on {Y} and it contained {X}';
// Dany created a thread on Stack Overflow and it contained Dany
...and we want the second {X} to appear literally in the resulting string. Using the default escape character of #, we would change it to:
$lang['example'] = '{X} created a thread on {Y} and it contained #{X}';
// Dany created a thread on Stack Overflow and it contained {X}
OK, looks good so far. But what if that # was supposed to be a literal?
$lang['example'] = '{X} created a thread on {Y} and it contained ##{X}';
// Dany created a thread on Stack Overflow and it contained #Dany
Note that the regular expression has been designed to only pay attention to escape sequences that immediately precede an opening curly brace. This means that you don't need to escape the escape character unless it appears immediately in front of a placeholder.
A note about the use of an array as an argument
Your original code sample uses variables named the same way as the placeholders in the string. Mine uses an array with named keys. There are two very good reasons for this:
Clarity and security - it's much easier to see what will end up being substituted, and you don't risk accidentally substituting variables you don't want to be exposed. It wouldn't be much good if someone could simply feed in {dbPass} and see your database password, now would it?
Scope - it's not possible to import variables from the calling scope unless the caller is the global scope. This makes the function useless if called from another function, and importing data from another scope is very bad practice.
If you really want to use named variables from the current scope (and I do not recommend this due to the aforementioned security issues) you can pass the result of a call to get_defined_vars() to the second argument.
A note about choosing an escape character
You'll notice I chose # as the default escape character. You can use any character (or sequence of characters, it can be more than one) by passing it to the third argument - and you may be tempted to use \ since that's what many languages use, but hold on before you do that.
The reason you don't want to use \ is because many languages use it as their own escape character, which means that when you want to specify your escape character in, say, a PHP string literal, you run into this problem:
$lang['example'] = '\\{X}'; // results in {X}
$lang['example'] = '\\\{X}'; // results in \Dany
$lang['example'] = '\\\\{X}'; // results in \Dany
It can lead to a readability nightmare, and some non-obvious behaviour with complex patterns. Pick an escape character that is not used by any other language involved (for example, if you are using this technique to generate fragments of HTML, don't use & as an escape character either).
To sum up
What you are doing has edge-cases. To solve the problem properly, you need to use a tool capable of handling those edge-cases - and when it comes to string manipulation, the tool for the job is most often regex.
Here's a portable solution, using variable variables. Yay!
$string = "I need to replace {X} and {Y}";
$X = 'something';
$Y = 'something else';
preg_match_all('/\{(.*?)\}/', $string, $matches);
foreach ($matches[1] as $value)
{
$string = str_replace('{'.$value.'}', ${$value}, $string);
}
First you set up your string, and your replacements. Then, you perform a regular expression to get an array of matches (strings within { and }, including those brackets). Finally, you loop around these and replace those with the variables you created above, using variable variables. Lovely!
Just thought I'd update this with another option even though you've marked it as correct. You don't have to use variable variables, and an array can be used in it's place.
$map = array(
'X' => 'something',
'Y' => 'something else'
);
preg_match_all('/\{(.*?)\}/', $string, $matches);
foreach ($matches[1] as $value)
{
$string = str_replace('{'.$value.'}', $map[$value], $string);
}
That would allow you to create a function with the following signature:
public function parse($string, $map); // Probably what I'd do tbh
Another option thanks to toolmakersteve in the comments does away with the need for a loop and uses strtr, but requires minor additions to the variables and single quotes instead of double quotes:
$string = 'I need to replace {$X} and {$Y}';
$map = array(
'{$X}' => 'something',
'{$Y}' => 'something else'
);
$string = strtr($string, $map);
If you're running 5.4 and you care about being able to use PHP's builtin variable interpolation in the string, you can use the bindTo() method of Closure like so:
// Strings use interpolation, but have to return themselves from an anon func
$strings = [
'en' => [
'message_sent' => function() { return "You just sent a message to $this->recipient that said: $this->message."; }
],
'es' => [
'message_sent' => function() { return "Acabas de enviar un mensaje a $this->recipient que dijo: $this->message."; }
]
];
class LocalizationScope {
private $data;
public function __construct($data) {
$this->data = $data;
}
public function __get($param) {
if(isset($this->data[$param])) {
return $this->data[$param];
}
return '';
}
}
// Bind the string anon func to an object of the array data passed in and invoke (returns string)
function localize($stringCb, $data) {
return $stringCb->bindTo(new LocalizationScope($data))->__invoke();
}
// Demo
foreach($strings as $str) {
var_dump(localize($str['message_sent'], array(
'recipient' => 'Jeff Atwood',
'message' => 'The project should be done in 6 to 8 weeks.'
)));
}
//string(93) "You just sent a message to Jeff Atwood that said: The project should be done in 6 to 8 weeks."
//string(95) "Acabas de enviar un mensaje a Jeff Atwood que dijo: The project should be done in 6 to 8 weeks."
(Codepad Demo)
Perhaps, it feels a bit hacky, and I don't particularly like using $this in this instance. But you do get the added benefit of relying on PHP's variable interpolation (which allows you to do things like escaping, that are difficult to achieve with regex).
EDIT: Added LocalizationScope, which adds another benefit: no warnings if localization anonymous functions try to access data that was not provided.
strtr is probably a better choice for this kind of things, because it replaces longest keys first:
$repls = array(
'X' => 'Dany',
'Y' => 'Stack Overflow',
);
foreach($data as $key => $value)
$repls['{' . $key . '}'] = $value;
$result = strtr($text, $repls);
(think of situations where you have keys like XX and X)
And if you don't want to use an array and instead expose all variables from the current scope:
$repls = get_defined_vars();
If your only issue with sprintf is the order of the arguments you can use argument swapping.
From the doc (http://php.net/manual/en/function.sprintf.php):
$format = 'The %2$s contains %1$d monkeys';
echo sprintf($format, $num, $location);
gettext is a widely used universal localization system that does exactly what you want.
There are libraries for most programming languages and PHP has a built-in engine.
It is driven by po-files, simple text based format, for which there are many editors around and it is compatible with sprintf syntax.
It even has some functions to deal with things like complicated plurals that some languages have.
Here are some examples of what it does. Note that _() is an alias for gettext():
echo _('Hello world'); // will output hello world in the current selected language
echo sprintf(_("%s has created a thread on %s"), $name, $site); // translates the string, and hands it over to sprintf()
echo sprintf(_("%2$s has created a thread on %1$s"), $site, $name); // same as above, but with changed order of parameters.
If you have more than a handful of strings, you should definitely use an existing engine, rather than writing your own one.
Adding a new language is just a matter of translating a list of strings and most professional translation tools can work with this file format, too.
Check Wikipedia and the PHP documentation for a basic overview on how this works:
http://en.wikipedia.org/wiki/Gettext
http://de.php.net/gettext
Google finds heaps of documentation and your favourite software repository will most likely have a handful of tools for managing po-files.
Some that I have used are:
poedit: Very light and simple. Good if you don't have too much stuff to translate and don't want to spend time thinking about how that stuff works.
Virtaal: A bit more complex and has a bit of a learning curve, but also some nice features that make your life easier. Good if you need to translate a lot.
GlotPress is a web application (from the wordpress people) that allows collaborative editing of the translation database files.
Why not use str_replace then? If you want it as template.
echo str_replace(array('{X}', '{Y}'), array($X, $Y), $lang['example']);
for every occurrence of this that you need
str_replace was built for this in the first place.
How about defining the "variable" parts as an array with keys corresponding to the placeholders in your string?
$string = "{X} created a thread on {Y}";
$values = array(
'X' => "Danny",
'Y' => "Stack Overflow",
);
echo str_replace(
array_map(function($v) { return '{'.$v.'}'; }, array_keys($values)),
array_values($values),
$string
);
Why can't you just use the template string within a function?
function threadTemplate($x, $y) {
return "{$x} created a thread on {$y}";
}
echo threadTemplate($foo, $bar);
Simple:
$X = 'Dany';
$Y = 'Stack Overflow';
$lang['example'] = "{$X} created a thread on {$Y}";
Hence:
echo $lang['example'];
Will output:
Dany created a thread on Stack Overflow
As you requested.
UPDATE:
As per the OP's comments about making the solution more portable:
Have a class do the parsing for you each time:
class MyParser {
function parse($vstr) {
return "{$x} created a thread on {$y}";
}
}
That way, if the following occurs:
$X = 3;
$Y = 4;
$a = new MyParser();
$lang['example'] = $a->parse($X, $Y);
echo $lang['example'];
Which will return:
3 created a thread on 4;
And, double checking:
$X = 'Steve';
$Y = 10.9;
$lang['example'] = $a->parse($X, $Y);
Will print:
Steve created a thread on 10.9;
As desired.
UPDATE 2:
As per the OP's comments about improving portability:
class MyParser {
function parse($vstr) {
return "{$vstr}";
}
}
$a = new MyParser();
$X = 3;
$Y = 4;
$vstr = "{$X} created a thread on {$Y}";
$a = new MyParser();
$lang['example'] = $a->parse($vstr);
echo $lang['example'];
Will output the results cited previously.
Try
$lang['example'] = "$X created a thread on $Y";
EDIT: Based on latest info
Maybe you need to look at the sprintf() function
Then you could have your template string defined as this
$template_string = '%s created a thread on %s';
$X = 'Fred';
$Y = 'Sunday';
echo sprintf( $template_string, $X, $Y );
$template_string does not change but later in your code when you have assigned different values to $X and $Y you can still use the echo sprintf( $template_string, $X, $Y );
See PHP Manual
just throwing another solution in using associative arrays. This will loop through the associative array and either replace the template or leave it blank.
example:
$list = array();
$list['X'] = 'Dany';
$list['Y'] = 'Stack Overflow';
$str = '{X} created a thread on {Y}';
$newstring = textReplaceContent($str,$list);
function textReplaceContent($contents, $list) {
while (list($key, $val) = each($list)) {
$key = "{" . $key . "}";
if ($val) {
$contents = str_replace($key, $val, $contents);
} else {
$contents = str_replace($key, "", $contents);
}
}
$final = preg_replace('/\[\w+\]/', '', $contents);
return ($final);
}
Related
I am getting an "Array to string conversion error on PHP";
I am using the "variable" (that should be a string) as the third parameter to str_replace. So in summary (very simplified version of whats going on):
$str = "very long string";
str_replace("tag", $some_other_array, $str);
$str is throwing the error, and I have been trying to fix it all day, the thing I have tried is:
if(is_array($str)) die("its somehow an array");
serialize($str); //inserted this before str_replace call.
I have spent all day on it, and no its not something stupid like variables around the wrong way - it is something bizarre. I have even dumped it to a file and its a string.
My hypothesis:
The string is too long and php can't deal with it, turns into an array.
The $str value in this case is nested and called recursively, the general flow could be explained like this:
--code
//pass by reference
function the_function ($something, &$OFFENDING_VAR, $something_else) {
while(preg_match($something, $OFFENDING_VAR)) {
$OFFENDING_VAR = str_replace($x, y, $OFFENDING_VAR); // this is the error
}
}
So it may be something strange due to str_replace, but that would mean that at some point str_replace would have to return an array.
Please help me work this out, its very confusing and I have wasted a day on it.
---- ORIGINAL FUNCTION CODE -----
//This function gets called with multiple different "Target Variables" Target is the subject
//line, from and body of the email filled with << tags >> so the str_replace function knows
//where to replace them
function perform_replacements($replacements, &$target, $clean = TRUE,
$start_tag = '<<', $end_tag = '>>', $max_substitutions = 5) {
# Construct separate tag and replacement value arrays for use in the substitution loop.
$tags = array();
$replacement_values = array();
foreach ($replacements as $tag_text => $replacement_value) {
$tags[] = $start_tag . $tag_text . $end_tag;
$replacement_values[] = $replacement_value;
}
# TODO: this badly needs refactoring
# TODO: auto upgrade <<foo>> to <<foo_html>> if foo_html exists and acting on html template
# Construct a regular expression for use in scanning for tags.
$tag_match = '/' . preg_quote($start_tag) . '\w+' . preg_quote($end_tag) . '/';
# Perform the substitution until all valid tags are replaced, or the maximum substitutions
# limit is reached.
$substitution_count = 0;
while (preg_match ($tag_match, $target) && ($substitution_count++ < $max_substitutions)) {
$target = serialize($target);
$temp = str_replace($tags,
$replacement_values,
$target); //This is the line that is failing.
unset($target);
$target = $temp;
}
if ($clean) {
# Clean up any unused search values.
$target = preg_replace($tag_match, '', $target);
}
}
How do you know $str is the problem and not $some_other_array?
From the manual:
If search and replace are arrays, then str_replace() takes a value
from each array and uses them to search and replace on subject. If
replace has fewer values than search, then an empty string is used for
the rest of replacement values. If search is an array and replace is a
string, then this replacement string is used for every value of
search. The converse would not make sense, though.
The second parameter can only be an array if the first one is as well.
Trying to replace a string, but it seems to only match the first occurrence, and if I have another occurrence it doesn't match anything, so I think I need to add some sort of end delimiter?
My code:
$mappings = array(
'fname' => $prospect->forename,
'lname' => $prospect->surname,
'cname' => $prospect->company,
);
foreach($mappings as $key => $mapping) if(empty($mapping)) $mappings[$key] = '$2';
$match = '~{(.*)}(.*?){/.*}$~ise';
$source = 'Hello {fname}Default{/fname} {lname}Last{/lname}';
// $source = 'Hello {fname}Default{/fname}';
$text = preg_replace($match, '$mappings["$1"]', $source);
So if I use the $source that's commented, it matches fine, but if I use the one currently in the code above where there's 2 matches, it doesn't match anything and I get an error:
Message: Undefined index: fname}Default{/fname} {lname
Filename: schedule.php(62) : regexp code
So am I right in saying I need to provide an end delimiter or something?
Thanks,
Christian
Apparently your regexp matches fname}Default{/fname} {lname instead of Default.
As I mentioned here use {(.*?)} instead of {(.*)}.
{ has special meaning in regexps so you should escape it \\{.
And I recommend using preg_replace_callback instead of e modifier (you have more flow control and syntax higlighting and it's impossible to force your program to execute malicious code).
Last mistake you're making is not checking whether the requested index exists. :)
My solution would be:
<?php
class A { // Of course with better class name :)
public $mappings = array(
'fname' => 'Tested'
);
public function callback( $match)
{
if( isset( $this->mappings[$match[1]])){
return $this->mappings[$match[1]];
}
return $match[2];
}
}
$a = new A();
$match = '~\\{([^}]+)\\}(.*?)\\{/\\1\\}~is';
$source = 'Hello {fname}Default{/fname} {lname}Last{/lname}';
echo preg_replace_callback( $match, array($a, 'callback'), $source);
This results into:
[vyktor#grepfruit tmp]$ php stack.php
Hello Tested Last
Your regular expression is anchored to the end of the string so you closing {/whatever} must be the last thing in your string. Also, since your open and closing tags are simply .*, there's nothing in there to make sure they match up. What you want is to make sure that your closing tag matches your opening one - using a backreference like {(.+)}(.*?){/\1} will make sure they're the same.
I'm sure there's other gotchas in there - if you have control over the format of strings you're working with (IE - you're rolling your own templating language), I'd seriously consider moving to a simpler, easier to match format. Since you're not 'saving' the default values, having enclosing tags provides you with no added value but makes the parsing more complicated. Just using $VARNAME would work just as well and be easier to match (\$[A-Z]+), without involving back-references or having to explicitly state you're using non-greedy matching.
This is for an osCommerce contribution called
("Automatically add multiple products with attribute to cart from external source")
This existing code uses sscanf to 'explode' a string that represents a
- product ID,
- a productOption,
- and quantity:
sscanf('28{8}17[1]', '%d{%d}%d[%f]',
$productID, // 28
$productOptionID, $optionValueID, //{8}17 <--- Product Options!!!
$productQuantity //[1]
);
This works great if there is only 1 'set' of Product Options (e.g. {8}17).
But this procedure needs to be adapted so that it can handle multiple Product Options, and put them into an array, e.g.:
'28{8}17{7}15{9}19[1]' //array(8=>17, 7=>15, 9=>19)
OR
'28{8}17{7}15[1]' //array(8=>17, 7=>15)
OR
'28{8}17[1]' //array(8=>17)
Thanks in advance. (I'm a pascal programmer)
You should not try to do complex recursive parses with one sscanf. Stick it in a loop. Something like:
<?php
$str = "28{8}17{7}15{9}19[1]";
#$str = "28{8}17{7}15[1]";
#$str = "28{8}17[1]";
sscanf($str,"%d%s",$prod,$rest);
printf("Got prod %d\n", $prod);
while (sscanf($rest,"{%d}%d%s",$opt,$id,$rest))
{
printf("opt=%d id=%d\n",$opt,$id);
}
sscanf($rest,"[%d]",$quantity);
printf("Got qty %d\n",$quantity);
?>
Maybe regular expressions may be interesting
$a = '28{8}17{7}15{9}19[1]';
$matches = null;
preg_match_all('~\\{[0-9]{1,3}\\}[0-9]{1,3}~', $a, $matches);
To get the other things
$id = (int) $a; // ;)
$quantity = substr($a, strrpos($a, '[')+ 1, -1);
According the comment a little update
$a = '28{8}17{7}15{9}19[1]';
$matches = null;
preg_match_all('~\\{([0-9]{1,3})\\}([0-9]{1,3})~', $a, $matches, PREG_SET_ORDER);
$result = array();
foreach ($matches as $entry) {
$result[$entry[1]] = $entry[2];
}
sscanf() is not the ideal tool for this task because it doesn't handle recurring patterns and I don't see any real benefit in type casting or formatting the matched subexpressions.
If this was purely a text extraction task (in other words your incoming data was guaranteed to be perfectly formatted and valid), then I could have recommended a cute solution that used strtr() and parse_str() to quickly generate a completely associative multi-dimensional output array.
However, when you commented "with sscanf I had an infinite loop if there is a missing bracket in the string (because it looks for open and closing {}s). Or if I leave out a value. But with your regex solution, if I drop a bracket or leave out a value", then this means that validation is an integral component of this process.
For that reason, I'll recommend a regex pattern that both validates the string and breaks the string into its meaningful parts. There are several logical aspects to the pattern but the hero here is the \G metacharacter that allows the pattern to "continue" matching where the pattern last finished matching in the string. This way we have an array of continuous fullstring matches to pull data from when creating your desired multidimensional output.
The pattern ^\d+(?=.+\[\d+]$)|\G(?!^)(?:{\K\d+}\d+|\[\K\d(?=]$)) in preg_match_all() generates the following type of output in the fullstring element ([0]):
[id], [option0, option1, ...](optional), [quantity]
The first branch in the pattern (^\d+(?=.+\[\d+]$)) validates the string to start with the id number and ends with a square brace wrapped number representing the quantity.
The second branch begins with the "continue" character and contains two logical branches itself. The first matches an option expression (and forgets the leading { thanks to \K) and the second matches the number in the quantity expression.
To create the associative array of options, target the "middle" elements (if there are any), then split the strings on the lingering } and assign these values as key-value pairs.
This is a direct solution because it only uses one preg_ call and it does an excellent job of validating and parsing the variable length data.
Code: (Demo with a battery of test cases)
if (!preg_match_all('~^\d+(?=.+\[\d+]$)|\G(?!^)(?:{\K\d+}\d+|\[\K\d(?=]$))~', $test, $m)) {
echo "invalid input";
} else {
var_export(
[
'id' => array_shift($m[0]),
'quantity' => array_pop($m[0]),
'options' => array_reduce(
$m[0],
function($result, $string) {
[$key, $result[$key]] = explode('}', $string, 2);
return $result;
},
[]
)
]
);
}
join_strings(string $glue, string $var, string $var2 [, string $...]);
I am looking for something that would behave similar to the function I conceptualized above. A functional example would be:
$title = "Mr.";
$fname = "Jonathan";
$lname = "Sampson";
print join_strings(" ", $title, $fname, $lname); // Mr. Jonathan Sampson
After giving the documentation a quick look-over, I didn't see anything that does this. The closest I can think of is implode(), which operates on arrays - so I would have to first add the strings into an array, and then implode.
Is there already a method that exists to accomplish this, or would I need to author one from scratch?
Note: I'm familiar with concatenation (.), and building-concatenation (.=). I'm not wanting to do that (that would take place within the function). My intentions are to write the $glue variable only once. Not several times with each concatenation.
you can use join or implode, both do same thing
and as you say it needs to be an array which is not difficult
join($glue, array($va1, $var2, $var3));
You can use func_get_args() to make implode() (or its alias join()) bend to your will:
function join_strings($glue) {
$args = func_get_args();
array_shift($args);
return implode($glue, $args);
}
As the documentation for func_get_args() notes, however:
Returns an array in which each element is a copy of the corresponding member of the current user-defined function's argument list.
So you still end up making an array out of the arguments and then passing it on, except now you're letting PHP take care of that for you.
Do you have a more convincing example than the one in your question to justify not simply using implode() directly?
The only thing you're doing now is saving yourself the trouble to type array() around the variables, but that's actually shorter than the _strings you appended to the function name.
Try this:
function join_strings($glue, $arg){
$args = func_get_args();
$result = "";
$argcount = count($args)
for($i = 1; $i < $argcount; $i++){
$result .= $args[$i];
if($i+1!=count($args){
$result .= $glue;
}
}
return $result;
}
EDIT: Improved function thanks to comment suggestion
$a="hi";
$b = " world";
echo $a.$b;
While I agree with the accepted answer, see this article thats says implode is faster than join.
You'll have to use the dot "." operator.
This comes from abstract algebra theory, the strings being a monoid with respect to the concatenation operator, and the fact that when dealing with algebraic structures the dot is usually used as a generic operator (the other being "+", which you may have seen used in other languages).
For quick reference, being a monoid implies associativity of the operation, and that there exists an identity element (the empty string, in our case).
Suppose I have the following two strings containing regular expressions. How do I coalesce them? More specifically, I want to have the two expressions as alternatives.
$a = '# /[a-z] #i';
$b = '/ Moo /x';
$c = preg_magic_coalesce('|', $a, $b);
// Desired result should be equivalent to:
// '/ \/[a-zA-Z] |Moo/'
Of course, doing this as string operations isn't practical because it would involve parsing the expressions, constructing syntax trees, coalescing the trees and then outputting another regular expression equivalent to the tree. I'm completely happy without this last step. Unfortunately, PHP doesn't have a RegExp class (or does it?).
Is there any way to achieve this? Incidentally, does any other language offer a way? Isn't this a pretty normal scenario? Guess not. :-(
Alternatively, is there a way to check efficiently if either of the two expressions matches, and which one matches earlier (and if they match at the same position, which match is longer)? This is what I'm doing at the moment. Unfortunately, I do this on long strings, very often, for more than two patterns. The result is slow (and yes, this is definitely the bottleneck).
EDIT:
I should have been more specific – sorry. $a and $b are variables, their content is outside of my control! Otherwise, I would just coalesce them manually. Therefore, I can't make any assumptions about the delimiters or regex modifiers used. Notice, for example, that my first expression uses the i modifier (ignore casing) while the second uses x (extended syntax). Therefore, I can't just concatenate the two because the second expression does not ignore casing and the first doesn't use the extended syntax (and any whitespace therein is significant!
I see that porneL actually described a bunch of this, but this handles most of the problem. It cancels modifiers set in previous sub-expressions (which the other answer missed) and sets modifiers as specified in each sub-expression. It also handles non-slash delimiters (I could not find a specification of what characters are allowed here so I used ., you may want to narrow further).
One weakness is it doesn't handle back-references within expressions. My biggest concern with that is the limitations of back-references themselves. I'll leave that as an exercise to the reader/questioner.
// Pass as many expressions as you'd like
function preg_magic_coalesce() {
$active_modifiers = array();
$expression = '/(?:';
$sub_expressions = array();
foreach(func_get_args() as $arg) {
// Determine modifiers from sub-expression
if(preg_match('/^(.)(.*)\1([eimsuxADJSUX]+)$/', $arg, $matches)) {
$modifiers = preg_split('//', $matches[3]);
if($modifiers[0] == '') {
array_shift($modifiers);
}
if($modifiers[(count($modifiers) - 1)] == '') {
array_pop($modifiers);
}
$cancel_modifiers = $active_modifiers;
foreach($cancel_modifiers as $key => $modifier) {
if(in_array($modifier, $modifiers)) {
unset($cancel_modifiers[$key]);
}
}
$active_modifiers = $modifiers;
} elseif(preg_match('/(.)(.*)\1$/', $arg)) {
$cancel_modifiers = $active_modifiers;
$active_modifiers = array();
}
// If expression has modifiers, include them in sub-expression
$sub_modifier = '(?';
$sub_modifier .= implode('', $active_modifiers);
// Cancel modifiers from preceding sub-expression
if(count($cancel_modifiers) > 0) {
$sub_modifier .= '-' . implode('-', $cancel_modifiers);
}
$sub_modifier .= ')';
$sub_expression = preg_replace('/^(.)(.*)\1[eimsuxADJSUX]*$/', $sub_modifier . '$2', $arg);
// Properly escape slashes
$sub_expression = preg_replace('/(?<!\\\)\//', '\\\/', $sub_expression);
$sub_expressions[] = $sub_expression;
}
// Join expressions
$expression .= implode('|', $sub_expressions);
$expression .= ')/';
return $expression;
}
Edit: I've rewritten this (because I'm OCD) and ended up with:
function preg_magic_coalesce($expressions = array(), $global_modifier = '') {
if(!preg_match('/^((?:-?[eimsuxADJSUX])+)$/', $global_modifier)) {
$global_modifier = '';
}
$expression = '/(?:';
$sub_expressions = array();
foreach($expressions as $sub_expression) {
$active_modifiers = array();
// Determine modifiers from sub-expression
if(preg_match('/^(.)(.*)\1((?:-?[eimsuxADJSUX])+)$/', $sub_expression, $matches)) {
$active_modifiers = preg_split('/(-?[eimsuxADJSUX])/',
$matches[3], -1, PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE);
}
// If expression has modifiers, include them in sub-expression
if(count($active_modifiers) > 0) {
$replacement = '(?';
$replacement .= implode('', $active_modifiers);
$replacement .= ':$2)';
} else {
$replacement = '$2';
}
$sub_expression = preg_replace('/^(.)(.*)\1(?:(?:-?[eimsuxADJSUX])*)$/',
$replacement, $sub_expression);
// Properly escape slashes if another delimiter was used
$sub_expression = preg_replace('/(?<!\\\)\//', '\\\/', $sub_expression);
$sub_expressions[] = $sub_expression;
}
// Join expressions
$expression .= implode('|', $sub_expressions);
$expression .= ')/' . $global_modifier;
return $expression;
}
It now uses (?modifiers:sub-expression) rather than (?modifiers)sub-expression|(?cancel-modifiers)sub-expression but I've noticed that both have some weird modifier side-effects. For instance, in both cases if a sub-expression has a /u modifier, it will fail to match (but if you pass 'u' as the second argument of the new function, that will match just fine).
Strip delimiters and flags from each. This regex should do it:
/^(.)(.*)\1([imsxeADSUXJu]*)$/
Join expressions together. You'll need non-capturing parenthesis to inject flags:
"(?$flags1:$regexp1)|(?$flags2:$regexp2)"
If there are any back references, count capturing parenthesis and update back references accordingly (e.g. properly joined /(.)x\1/ and /(.)y\1/ is /(.)x\1|(.)y\2/ ).
EDIT
I’ve rewritten the code! It now contains the changes that are listed as follows. Additionally, I've done extensive tests (which I won’t post here because they’re too many) to look for errors. So far, I haven’t found any.
The function is now split into two parts: There’s a separate function preg_split which takes a regular expression and returns an array containing the bare expression (without delimiters) and an array of modifiers. This might come in handy (it already has, in fact; this is why I made this change).
The code now correctly handles back-references. This was necessary for my purpose after all. It wasn’t difficult to add, the regular expression used to capture the back-references just looks weird (and may actually be extremely inefficient, it looks NP-hard to me – but that’s only an intuition and only applies in weird edge cases). By the way, does anyone know a better way of checking for an uneven number of matches than my way? Negative lookbehinds won't work here because they only accept fixed-length strings instead of regular expressions. However, I need the regex here to test whether the preceeding backslash is actually escaped itself.
Additionally, I don’t know how good PHP is at caching anonymous create_function use. Performance-wise, this might not be the best solution but it seems good enough.
I’ve fixed a bug in the sanity check.
I’ve removed the cancellation of obsolete modifiers since my tests show that it isn't necessary.
By the way, this code is one of the core components of a syntax highlighter for various languages that I’m working on in PHP since I’m not satisfied with the alternatives listed elsewhere.
Thanks!
porneL, eyelidlessness, amazing work! Many, many thanks. I had actually given up.
I've built upon your solution and I'd like to share it here. I didn't implement re-numbering back-references since this isn't relevant in my case (I think …). Perhaps this will become necessary later, though.
Some Questions …
One thing, #eyelidlessness: Why do you feel the necessity to cancel old modifiers? As far as I see it, this isn't necessary since the modifiers are only applied locally anyway.
Ah yes, one other thing. Your escaping of the delimiter seems overly complicated. Care to explain why you think this is needed? I believe my version should work as well but I could be very wrong.
Also, I've changed the signature of your function to match my needs. I also thing that my version is more generally useful. Again, I might be wrong.
BTW, you should now realize the importance of real names on SO. ;-) I can't give you real credit in the code. :-/
The Code
Anyway, I'd like to share my result so far because I can't believe that nobody else ever needs something like that. The code seems to work very well. Extensive tests are yet to be done, though. Please comment!
And without further ado …
/**
* Merges several regular expressions into one, using the indicated 'glue'.
*
* This function takes care of individual modifiers so it's safe to use
* <em>different</em> modifiers on the individual expressions. The order of
* sub-matches is preserved as well. Numbered back-references are adapted to
* the new overall sub-match count. This means that it's safe to use numbered
* back-refences in the individual expressions!
* If {#link $names} is given, the individual expressions are captured in
* named sub-matches using the contents of that array as names.
* Matching pair-delimiters (e.g. <code>"{…}"</code>) are currently
* <strong>not</strong> supported.
*
* The function assumes that all regular expressions are well-formed.
* Behaviour is undefined if they aren't.
*
* This function was created after a {#link https://stackoverflow.com/questions/244959/
* StackOverflow discussion}. Much of it was written or thought of by
* “porneL” and “eyelidlessness”. Many thanks to both of them.
*
* #param string $glue A string to insert between the individual expressions.
* This should usually be either the empty string, indicating
* concatenation, or the pipe (<code>|</code>), indicating alternation.
* Notice that this string might have to be escaped since it is treated
* like a normal character in a regular expression (i.e. <code>/</code>)
* will end the expression and result in an invalid output.
* #param array $expressions The expressions to merge. The expressions may
* have arbitrary different delimiters and modifiers.
* #param array $names Optional. This is either an empty array or an array of
* strings of the same length as {#link $expressions}. In that case,
* the strings of this array are used to create named sub-matches for the
* expressions.
* #return string An string representing a regular expression equivalent to the
* merged expressions. Returns <code>FALSE</code> if an error occurred.
*/
function preg_merge($glue, array $expressions, array $names = array()) {
// … then, a miracle occurs.
// Sanity check …
$use_names = ($names !== null and count($names) !== 0);
if (
$use_names and count($names) !== count($expressions) or
!is_string($glue)
)
return false;
$result = array();
// For keeping track of the names for sub-matches.
$names_count = 0;
// For keeping track of *all* captures to re-adjust backreferences.
$capture_count = 0;
foreach ($expressions as $expression) {
if ($use_names)
$name = str_replace(' ', '_', $names[$names_count++]);
// Get delimiters and modifiers:
$stripped = preg_strip($expression);
if ($stripped === false)
return false;
list($sub_expr, $modifiers) = $stripped;
// Re-adjust backreferences:
// We assume that the expression is correct and therefore don't check
// for matching parentheses.
$number_of_captures = preg_match_all('/\([^?]|\(\?[^:]/', $sub_expr, $_);
if ($number_of_captures === false)
return false;
if ($number_of_captures > 0) {
// NB: This looks NP-hard. Consider replacing.
$backref_expr = '/
( # Only match when not escaped:
[^\\\\] # guarantee an even number of backslashes
(\\\\*?)\\2 # (twice n, preceded by something else).
)
\\\\ (\d) # Backslash followed by a digit.
/x';
$sub_expr = preg_replace_callback(
$backref_expr,
create_function(
'$m',
'return $m[1] . "\\\\" . ((int)$m[3] + ' . $capture_count . ');'
),
$sub_expr
);
$capture_count += $number_of_captures;
}
// Last, construct the new sub-match:
$modifiers = implode('', $modifiers);
$sub_modifiers = "(?$modifiers)";
if ($sub_modifiers === '(?)')
$sub_modifiers = '';
$sub_name = $use_names ? "?<$name>" : '?:';
$new_expr = "($sub_name$sub_modifiers$sub_expr)";
$result[] = $new_expr;
}
return '/' . implode($glue, $result) . '/';
}
/**
* Strips a regular expression string off its delimiters and modifiers.
* Additionally, normalize the delimiters (i.e. reformat the pattern so that
* it could have used '/' as delimiter).
*
* #param string $expression The regular expression string to strip.
* #return array An array whose first entry is the expression itself, the
* second an array of delimiters. If the argument is not a valid regular
* expression, returns <code>FALSE</code>.
*
*/
function preg_strip($expression) {
if (preg_match('/^(.)(.*)\\1([imsxeADSUXJu]*)$/s', $expression, $matches) !== 1)
return false;
$delim = $matches[1];
$sub_expr = $matches[2];
if ($delim !== '/') {
// Replace occurrences by the escaped delimiter by its unescaped
// version and escape new delimiter.
$sub_expr = str_replace("\\$delim", $delim, $sub_expr);
$sub_expr = str_replace('/', '\\/', $sub_expr);
}
$modifiers = $matches[3] === '' ? array() : str_split(trim($matches[3]));
return array($sub_expr, $modifiers);
}
PS: I've made this posting community wiki editable. You know what this means …!
I'm pretty sure it's not possible to just put regexps together like that in any language - they could have incompatible modifiers.
I'd probably just put them in an array and loop through them, or combine them by hand.
Edit: If you're doing them one at a time as described in your edit, you maybe be able to run the second one on a substring (from the start up to the earliest match). That might help things.
function preg_magic_coalasce($split, $re1, $re2) {
$re1 = rtrim($re1, "\/#is");
$re2 = ltrim($re2, "\/#");
return $re1.$split.$re2;
}
You could do it the alternative way like this:
$a = '# /[a-z] #i';
$b = '/ Moo /x';
$a_matched = preg_match($a, $text, $a_matches);
$b_matched = preg_match($b, $text, $b_matches);
if ($a_matched && $b_matched) {
$a_pos = strpos($text, $a_matches[1]);
$b_pos = strpos($text, $b_matches[1]);
if ($a_pos == $b_pos) {
if (strlen($a_matches[1]) == strlen($b_matches[1])) {
// $a and $b matched the exact same string
} else if (strlen($a_matches[1]) > strlen($b_matches[1])) {
// $a and $b started matching at the same spot but $a is longer
} else {
// $a and $b started matching at the same spot but $b is longer
}
} else if ($a_pos < $b_pos) {
// $a matched first
} else {
// $b matched first
}
} else if ($a_matched) {
// $a matched, $b didn't
} else if ($b_matched) {
// $b matched, $a didn't
} else {
// neither one matched
}