Regex to match specific string not enclosed by another, different specific string - php

I need a regex to match a string not enclosed by another different, specific string. For instance, in the following situation it would split the content into two groups: 1) The content before the second {Switch} and 2) The content after the second {Switch}. It wouldn't match the first {Switch} because it is enclosed by {my_string}'s. The string will always look like shown below (i.e. {my_string}any content here{/my_string})
Some more
{my_string}
Random content
{Switch} //This {Switch} may or may not be here, but should be ignored if it is present
More random content
{/my_string}
Content here too
{Switch}
More content
So far I've gotten what is below which I know isn't very close at all:
(.*?)\{Switch\}(.*?)
I'm just not sure how to use the [^] (not operator) with a specific string versus different characters.

It really seems you're trying to use a regular expression to parse a grammar - something that regular expressions are really bad at doing. You might be better off writing a parser to break down your string into the tokens that build it, and then processing that tree.
Perhaps something like http://drupal.org/project/grammar_parser might help.

Try this simple function:
function find_content()
function find_content($doc) {
$temp = $doc;
preg_match_all('~{my_string}.*?{/my_string}~is', $temp, $x);
$i = 0;
while (isset($x[0][$i])) {
$temp = str_replace($x[0][$i], "{REPL:$i}", $temp);
$i++;
}
$res = explode('{Switch}', $temp);
foreach ($res as &$part)
foreach($x[0] as $id=>$content)
$part = str_replace("{REPL:$id}", $content, $part);
return $res;
}
Use it this way
$content_parts = find_content($doc); // $doc is your input document
print_r($content_parts);
Output (your example)
Array
(
[0] => Some more
{my_string}
Random content
{Switch} //This {Switch} may or may not be here, but should be ignored if it is present
More random content
{/my_string}
Content here too
[1] =>
More content
)

You can try positive lookahead and lookbehind assertions (http://www.regular-expressions.info/lookaround.html)
It might look something like this:
$content = 'string of text before some random content switch text some more random content string of text after';
$before = preg_quote('String of text before');
$switch = preg_quote('switch text');
$after = preg_quote('string of text after');
if( preg_match('/(?<=' $before .')(.*)(?:' $switch .')?(.*)(?=' $after .')/', $content, $matches) ) {
// $matches[1] == ' some random content '
// $matches[2] == ' some more random content '
}

$regex = (?:(?!\{my_string\})(.*?))(\{Switch\})(?:(.*?)(?!\{my_string\}));
/* if "my_string" and "Switch" aren't wrapped by "{" and "}" just remove "\{" and "\}" */
$yourNewString = preg_replace($regex,"$1",$yourOriginalString);
This might work. Can't test it know, but i'll update later!
I don't if this is what you're looking for, but to negate more than one character, the regex syntax is:
(?!yourString)
and it is called "negative lookahead assertion".
/Edit:
This should work and return true:
$stringMatchesYourRulesBoolean = preg_match('~(.*?)('.$my_string.')(.*?)(?<!'.$my_string.') ?('.$switch.') ?(?!'.$my_string.')(.*?)('.$my_string.')(.*?)~',$yourString);

Have a look at PHP PEG. It is a little parser written in PHP. You can write your own grammar and parse it. It's going to be very simple in your case.
The grammar syntax and the way of parsing is all explained in the README.md
Extracts from the readme:
token* - Token is optionally repeated
token+ - Token is repeated at least one
token? - Token is optionally present
Tokens may be :
- bare-words, which are recursive matchers - references to token rules defined elsewhere in the grammar,
- literals, surrounded by `"` or `'` quote pairs. No escaping support is provided in literals.
- regexs, surrounded by `/` pairs.
- expressions - single words (match \w+)
Sample grammar: (file EqualRepeat.peg.inc)
class EqualRepeat extends Packrat {
/* Any number of a followed by the same number of b and the same number of c characters
* aabbcc - good
* aaabbbccc - good
* aabbc - bad
* aabbacc - bad
*/
/*Parser:Grammar1
A: "a" A? "b"
B: "b" B? "c"
T: !"b"
X: &(A !"b") "a"+ B !("a" | "b" | "c")
*/
}

Related

Erasing C comments with preg_replace

I need to erase all comments in $string which contains data from some C file.
The thing I need to replace looks like this:
something before that shouldnt be replaced
/*
* some text in between with / or * on many lines
*/
something after that shouldnt be replaced
and the result should look like this:
something before that shouldnt be replaced
something after that shouldnt be replaced
I have tried many regular expressions but neither work the way I need.
Here are some latest ones:
$string = preg_replace("/\/\*(.*?)\*\//u", "", $string);
and
$string = preg_replace("/\/\*[^\*\/]*\*\//u", "", $string);
Note: the text is in UTF-8, the string can contain multibyte characters.
You would also want to add the s modifier to tell the regex that .* should include newlines. I always think of s to mean "treat the input text as a single line"
So something like this should work:
$string = preg_replace("/\\/\\*(.*?)\\*\\//us", "", $string);
Example: http://codepad.viper-7.com/XVo9Tp
Edit: Added extra escape slashes to the regex as Brandin suggested because he is right.
I don't think regexp fit good here. What about wrote a very small parse to remove this? I don't do PHP coding for a long time. So, I will try to just give you the idea (simple alogorithm) I haven't tested this, it's just to you get the idea, as I said:
buf = new String() // hold the source code without comments
pos = 0
while(string[pos] != EOF) {
if(string[pos] == '/') {
pos++;
while(string[pos] != EOF)
{
if(string[pos] == '*' && string[pos + 1] == '/') {
pos++;
break;
}
pos++;
}
}
buf[buf_index++] = string[pos++];
}
where:
string is the C source code
buf a dynamic allocated string which expands as needed
It is very hard to do this perfectly without ending up writing a full C parser.
Consider the following, for example:
// Not using /*-style comment here.
// This line has an odd number of " characters.
while (1) {
printf("Wheee!
(*\/*)
\\// - I'm an ant!
");
/* This is a multiline comment with a // in, and
// an odd number of " characters. */
}
So, from the above, we can see that our problems include:
multiline quote sequences should be ignored within doublequotes. Unless those doublequotes are part of a comment.
single-line comment sequences can be contained in double-quoted strings, and in multiline strings.
Here's one possibility to address some of those issues, but far from perfect.
// Remove "-strings, //-comments and /*block-comments*/, then restore "-strings.
// Based on regex by mauke of Efnet's #regex.
$file = preg_replace('{("[^"]*")|//[^\n]*|(/\*.*?\*/)}s', '\1', $file);
try this:
$string = preg_replace("#\/\*\n?(.*)\*\/\n?#ms", "", $string);
Use # as regexp boundaries; change that u modifier with the right ones: m (PCRE_MULTILINE) and s (PCRE_DOTALL).
Reference: http://php.net/manual/en/reference.pcre.pattern.modifiers.php
It is important to note that my regexp does not find more than one "comment block"... Use of "dot match all" is generally not a good idea.

Get String of first Argument of an function-call

I want to search withing PHP-Files for a special function call. The reason is, that I want to generate .MO-Files for the GetText-Extension. So I first need to create a .PO-Files, which contains all the needed text-strings.
I already find a lot of texts, but there are some problems.
Here is my Regex to find the first Argument of an functioncall:
/\_\([\'|\"]{1}(.+?[^\\\])[\'|\"]{1}[,]{0,1}.*?\)+/si
I need to find function-calls with the following patterns:
_("text");
_("text %s", 3);
_('text');
The Text could contain escaped Quotes. My Problem is acuallty, that I need to know, if there was an apostrophe or an normal quote used for the call.
If I have the call
_('"text"');
then i get the Problem, that I get the text
"text
without the ending quote.
Does anybody of you have an Idea, how I can get my Regex to work?
I would use PHP's tokenizer for this kind of stuff, not regular expressions:
$funcName = '_';
$tokens = token_get_all(file_get_contents('path/to/your/script.php'));
$strings = array();
foreach($tokens as $index => $token){
if(!is_array($token))
continue;
if($token[0] === T_CONSTANT_ENCAPSED_STRING){
if(!isset($tokens[$index - 2]) || ($tokens[$index - 1] !== "("))
continue;
list($id, $text, $line) = $tokens[$index - 2];
// this is your string (substr drops quotes around it)
if(($id === T_STRING) && ($text === $funcName))
$strings[] = substr($token[1], 1, -1);
}
}
var_dump($strings);
Raw regex:
_\((?|'((?:[^'\\]|\\.)*)'|"((?:[^"\\]|\\.)*)")
Delimited regex:
~_\((?|'((?:[^'\\]|\\.)*)'|"((?:[^"\\]|\\.)*)")~
The result is in capturing group 1. I used the branch reset pattern (?|pattern) so that the capturing group number is reset for each alternating branch separated by |.
Inside of the branch reset (?|'((?:[^'\\]|\\.)*)'|"((?:[^"\\]|\\.)*)") are 2 pattern:
'((?:[^'\\]|\\.)*)': Match and capture content inside single quoted string, which consists of either non-quote-non-backslash or escaped sequence. Actually, I am a bit careless here, since (raw) new line character is considered part of the string. I don't think the specification would allow this, but if the input contains valid code, then there should be no problem.
"((?:[^"\\]|\\.)*)": Same as above, but for double quoted string.
Note that I don't consume the rest of the arguments to the function.

PHP - preg_replace not matching multiple occurrences

Trying to replace a string, but it seems to only match the first occurrence, and if I have another occurrence it doesn't match anything, so I think I need to add some sort of end delimiter?
My code:
$mappings = array(
'fname' => $prospect->forename,
'lname' => $prospect->surname,
'cname' => $prospect->company,
);
foreach($mappings as $key => $mapping) if(empty($mapping)) $mappings[$key] = '$2';
$match = '~{(.*)}(.*?){/.*}$~ise';
$source = 'Hello {fname}Default{/fname} {lname}Last{/lname}';
// $source = 'Hello {fname}Default{/fname}';
$text = preg_replace($match, '$mappings["$1"]', $source);
So if I use the $source that's commented, it matches fine, but if I use the one currently in the code above where there's 2 matches, it doesn't match anything and I get an error:
Message: Undefined index: fname}Default{/fname} {lname
Filename: schedule.php(62) : regexp code
So am I right in saying I need to provide an end delimiter or something?
Thanks,
Christian
Apparently your regexp matches fname}Default{/fname} {lname instead of Default.
As I mentioned here use {(.*?)} instead of {(.*)}.
{ has special meaning in regexps so you should escape it \\{.
And I recommend using preg_replace_callback instead of e modifier (you have more flow control and syntax higlighting and it's impossible to force your program to execute malicious code).
Last mistake you're making is not checking whether the requested index exists. :)
My solution would be:
<?php
class A { // Of course with better class name :)
public $mappings = array(
'fname' => 'Tested'
);
public function callback( $match)
{
if( isset( $this->mappings[$match[1]])){
return $this->mappings[$match[1]];
}
return $match[2];
}
}
$a = new A();
$match = '~\\{([^}]+)\\}(.*?)\\{/\\1\\}~is';
$source = 'Hello {fname}Default{/fname} {lname}Last{/lname}';
echo preg_replace_callback( $match, array($a, 'callback'), $source);
This results into:
[vyktor#grepfruit tmp]$ php stack.php
Hello Tested Last
Your regular expression is anchored to the end of the string so you closing {/whatever} must be the last thing in your string. Also, since your open and closing tags are simply .*, there's nothing in there to make sure they match up. What you want is to make sure that your closing tag matches your opening one - using a backreference like {(.+)}(.*?){/\1} will make sure they're the same.
I'm sure there's other gotchas in there - if you have control over the format of strings you're working with (IE - you're rolling your own templating language), I'd seriously consider moving to a simpler, easier to match format. Since you're not 'saving' the default values, having enclosing tags provides you with no added value but makes the parsing more complicated. Just using $VARNAME would work just as well and be easier to match (\$[A-Z]+), without involving back-references or having to explicitly state you're using non-greedy matching.

php: "sscanf" to 'consume' a string but allows a missing parameter

This is for an osCommerce contribution called
("Automatically add multiple products with attribute to cart from external source")
This existing code uses sscanf to 'explode' a string that represents a
- product ID,
- a productOption,
- and quantity:
sscanf('28{8}17[1]', '%d{%d}%d[%f]',
$productID, // 28
$productOptionID, $optionValueID, //{8}17 <--- Product Options!!!
$productQuantity //[1]
);
This works great if there is only 1 'set' of Product Options (e.g. {8}17).
But this procedure needs to be adapted so that it can handle multiple Product Options, and put them into an array, e.g.:
'28{8}17{7}15{9}19[1]' //array(8=>17, 7=>15, 9=>19)
OR
'28{8}17{7}15[1]' //array(8=>17, 7=>15)
OR
'28{8}17[1]' //array(8=>17)
Thanks in advance. (I'm a pascal programmer)
You should not try to do complex recursive parses with one sscanf. Stick it in a loop. Something like:
<?php
$str = "28{8}17{7}15{9}19[1]";
#$str = "28{8}17{7}15[1]";
#$str = "28{8}17[1]";
sscanf($str,"%d%s",$prod,$rest);
printf("Got prod %d\n", $prod);
while (sscanf($rest,"{%d}%d%s",$opt,$id,$rest))
{
printf("opt=%d id=%d\n",$opt,$id);
}
sscanf($rest,"[%d]",$quantity);
printf("Got qty %d\n",$quantity);
?>
Maybe regular expressions may be interesting
$a = '28{8}17{7}15{9}19[1]';
$matches = null;
preg_match_all('~\\{[0-9]{1,3}\\}[0-9]{1,3}~', $a, $matches);
To get the other things
$id = (int) $a; // ;)
$quantity = substr($a, strrpos($a, '[')+ 1, -1);
According the comment a little update
$a = '28{8}17{7}15{9}19[1]';
$matches = null;
preg_match_all('~\\{([0-9]{1,3})\\}([0-9]{1,3})~', $a, $matches, PREG_SET_ORDER);
$result = array();
foreach ($matches as $entry) {
$result[$entry[1]] = $entry[2];
}
sscanf() is not the ideal tool for this task because it doesn't handle recurring patterns and I don't see any real benefit in type casting or formatting the matched subexpressions.
If this was purely a text extraction task (in other words your incoming data was guaranteed to be perfectly formatted and valid), then I could have recommended a cute solution that used strtr() and parse_str() to quickly generate a completely associative multi-dimensional output array.
However, when you commented "with sscanf I had an infinite loop if there is a missing bracket in the string (because it looks for open and closing {}s). Or if I leave out a value. But with your regex solution, if I drop a bracket or leave out a value", then this means that validation is an integral component of this process.
For that reason, I'll recommend a regex pattern that both validates the string and breaks the string into its meaningful parts. There are several logical aspects to the pattern but the hero here is the \G metacharacter that allows the pattern to "continue" matching where the pattern last finished matching in the string. This way we have an array of continuous fullstring matches to pull data from when creating your desired multidimensional output.
The pattern ^\d+(?=.+\[\d+]$)|\G(?!^)(?:{\K\d+}\d+|\[\K\d(?=]$)) in preg_match_all() generates the following type of output in the fullstring element ([0]):
[id], [option0, option1, ...](optional), [quantity]
The first branch in the pattern (^\d+(?=.+\[\d+]$)) validates the string to start with the id number and ends with a square brace wrapped number representing the quantity.
The second branch begins with the "continue" character and contains two logical branches itself. The first matches an option expression (and forgets the leading { thanks to \K) and the second matches the number in the quantity expression.
To create the associative array of options, target the "middle" elements (if there are any), then split the strings on the lingering } and assign these values as key-value pairs.
This is a direct solution because it only uses one preg_ call and it does an excellent job of validating and parsing the variable length data.
Code: (Demo with a battery of test cases)
if (!preg_match_all('~^\d+(?=.+\[\d+]$)|\G(?!^)(?:{\K\d+}\d+|\[\K\d(?=]$))~', $test, $m)) {
echo "invalid input";
} else {
var_export(
[
'id' => array_shift($m[0]),
'quantity' => array_pop($m[0]),
'options' => array_reduce(
$m[0],
function($result, $string) {
[$key, $result[$key]] = explode('}', $string, 2);
return $result;
},
[]
)
]
);
}

Regular Expressions: how to do "option split" replaces

those reqular expressions drive me crazy. I'm stuck with this one:
test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not
Task:
Remove all [[ and ]] and if there is an option split choose the later one so output should be:
test1:link test2:silver test3:out1insideout2 test4:this|not
I came up with (PHP)
$text = preg_replace("/\\[\\[|\\]\\]/",'',$text); // remove [[ or ]]
this works for part1 of the task. but before that I think I should do the option split, my best solution:
$text = preg_replace("/\\[\\[(.*\|)(.*?)\\]\\]/",'$2',$text);
Result:
test1:silver test3:[[out1[[inside]]out2]] this|not
I'm stuck. may someone with some free minutes help me? Thanks!
I think the easiest way to do this would be multiple passes. Use a regular expression like:
\[\[(?:[^\[\]]*\|)?([^\[\]]+)\]\]
This will replace option strings to give you the last option from the group. If you run it repeatedly until it no longer matches, you should get the right result (the first pass will replace [[out1[[inside]]out2]] with [[out1insideout2]] and the second will ditch the brackets.
Edit 1: By way of explanation,
\[\[ # Opening [[
(?: # A non-matching group (we don't want this bit)
[^\[\]] # Non-bracket characters
* # Zero or more of anything but [
\| # A literal '|' character representing the end of the discarded options
)? # This group is optional: if there is only one option, it won't be present
( # The group we're actually interested in ($1)
[^\[\]] # All the non-bracket characters
+ # Must be at least one
) # End of $1
\]\] # End of the grouping.
Edit 2: Changed expression to ignore ']' as well as '[' (it works a bit better like that).
Edit 3: There is no need to know the number of nested brackets as you can do something like:
$oldtext = "";
$newtext = $text;
while ($newtext != $oldtext)
{
$oldtext = $newtext;
$newtext = preg_replace(regexp,replace,$oldtext);
}
$text = $newtext;
Basically, this keeps running the regular expression replace until the output is the same as the input.
Note that I don't know PHP, so there are probably syntax errors in the above.
This is impossible to do in one regular expression since you want to keep content in multiple "hierarchies" of the content. It would be possible otherwise, using a recursive regular expression.
Anyways, here's the simplest, most greedy regular expression I can think of. It should only replace if the content matches your exact requirements.
You will need to escape all backslashes when putting it into a string (\ becomes \\.)
\[\[((?:[^][|]+|(?!\[\[|]])[^|])++\|?)*]]
As others have already explained, you use this with multiple passes. Keep looping while there are matches, performing replacement (only keeping match group 1.)
Difference from other regular expressions here is that it will allow you to have single brackets in the content, without breaking:
test1:[[link]] test2:[[gold|si[lv]er]]
test3:[[out1[[in[si]de]]out2]] test4:this|not
becomes
test1:[[link]] test2:si[lv]er
test3:out1in[si]deout2 test4:this|not
Why try to do it all in one go. Remove the [[]] first and then deal with options, do it in two lines of code.
When trying to get something going favour clarity and simplicity.
Seems like you have all the pieces.
Why not just simply remove any brackets that are left?
$str = 'test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not';
$str = preg_replace('/\\[\\[(?:[^|\\]]+\\|)+([^\\]]+)\\]\\]/', '$1', $str);
$str = str_replace(array('[', ']'), '', $str);
Well, I didn't stick to just regex, because I'm of a mind that trying to do stuff like this with one big regex leads you to the old joke about "Now you have two problems". However, give something like this a shot:
$str = 'test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not'; $reg = '/(.*?):(.*?)( |$)/';
preg_match_all($reg, $str, $m);
foreach($m[2] as $pos => $match) {
if (strpos($match, '|') !== FALSE && strpos($match, '[[') !== FALSE ) {
$opt = explode('|', $match); $match = $opt[count($opt)-1];
}
$m[2][$pos] = str_replace(array('[', ']'),'', $match );
}
foreach($m[1] as $k=>$v) $result[$k] = $v.':'.$m[2][$k];
This is C# using only using non-escaped strings, hence you will have to double the backslashes in other languages.
String input = "test1:[[link]] " +
"test2:[[gold|silver]] " +
"test3:[[out1[[inside]]out2]] " +
"test4:this|not";
String step1 = Regex.Replace(input, #"\[\[([^|]+)\|([^\]]+)\]\]", #"[[$2]]");
String step2 = Regex.Replace(step1, #"\[\[|\]\]", String.Empty);
// Prints "test1:silver test3:out1insideout2 test4:this|not"
Console.WriteLine(step2);
$str = 'test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not';
$s = preg_split("/\s+/",$str);
foreach ($s as $k=>$v){
$v = preg_replace("/\[\[|\]\]/","",$v);
$j = explode(":",$v);
$j[1]=preg_replace("/.*\|/","",$j[1]);
print implode(":",$j)."\n";
}

Categories