Examples:
input: (n!/(1+n))
output: frac{n!}{1+n}
input: ((n+11)!/(n-k)^(-1))
output: frac{(n+11)!}{(n-k)^(-1)}
input: (9/10)
output: frac{9}{10}
input: ((n+11)!/(n-k)^(-1))+(11)/(2)
output: frac{(n+11)!}{(n-k)^(-1)}+(11)/(2)
The following regex works if there are no sub parentheses.
\(([^\/\)]*)\/([^\)]*)\)
The following does matching parentheses
#\((([^()]++|\((?:[^()]++|(?R))+\))+)\)#
I just can not figure out how to "combine" them - write a single regex to handle division and balanced parentheses.
I think something like this should work:
((?:\w+|\((?1)\))(?:[+*^-](?1)|!)?)\/((?1))
Now, this probably isn't perfect, but here's the idea:
The first group, $1, is ((?:\w+|\((?1)\))(?:[+*^-](?1)|!)?), which is:
(A literal) or (a balanced expression wrapped in parentheses), followed by an optional operator and another balanced expression if needed.
Writing it that way, we can use (?1) anywhere in the regex to refer to another balanced expression.
Working example: http://ideone.com/PNLOD
I know you've already accepted a regexp, but the real answer to what you're trying to do is a proper parser... and no, you don't need to code this from scratch or reinvent the wheel. While a lot of people hate phpclasses, the evalMath class that you can find there is incredibly useful for parsing and evaluating mathematical formulae. See my answer to a similar question for details of how it can be used and extended.
Related
Here's the problem: given a string like
"<p>The price for vehicles {capitalize(pluralize(vehicle))} is {format_number(value, language)}</p><span>{employee_name}</span><span>\{do not parse me}</span>"
I need (1) a regex pattern in PHP that matches all values between un-escaped pairs of curly brackets and (2) another regex pattern that matches function calls and nested function calls (once the first pattern is matched). Of course, if I could use one regex only for both tasks that would be awesome.
By the way, I can't use Smarty, Twig or any other library - that's the only reason I have to build a parsing mechanism myself.
Thanks a ton!
Solution
(1) A partial solution for the first problem can be found here. Basically, we use the regex (?={((?:[^{}]++|{(?1)})++)}) and find the matches at index 1 of the resulting array.
It's partial because I still need to find a way of ignoring escaped braces, though.
(2) I'm considering the use of recursive regex, as suggested by Mario. Will post result here.
Thanks, guys!
Copying the answer from the comments in order to remove this question from the "Unanswered" filter:
This appears to be what you're looking for:
(?<!\\){([^(){}]+)\(((?:[^(){}]+|\(((?2))\))*)\)}
Link: http://www.regex101.com/r/uI4qN0
~ answer per Jerry
Note: For comparison - simply ignoring escaped braces is accomplished by adding a negative lookahead (?<!\\) to the beginning of the expression, like so:
(?<!\\)(?={((?:[^{}]++|{(?1)})++)})
I have the following problem.
Let's take the input (wikitext)
======hello((my first program)) world======
I want to match "hello", "my first program" and " world" (notice the space).
But for the input:
======hello(my first program)) world======
I want to match "hello(my first program" and " world".
In other words, I want to match any letters, spaces and additionally any single symbols (no double or more).
This should be done with the unicode character properties like \p{L}, \p{S} or \p{Z}, as documented here.
Any ideas?
Addendum 1
The regex has just to stop before any double symbol or punctuation in unicode terms, that is, before any \p{S}{2,} or \p{P}{2,}.
I'm not trying to parse the whole wikitext with this, read my question carefully. The regex I'm looking for IS for the lexer I'm working on, and making it match such inputs will simplify my parser incredibly.
Addendum 2
The pattern must work with preg_match(). I can imagine how I'd have to split it first. Perhaps it would use some lookahead, I don't know, I've tried everything that I could imagine.
Using only preg_match() is a requirement set in stone by the current implementation of the lexer. It must be that way, because that's the natural way of how lexers work: they match sequences in the input stream.
return preg_split('/([\pS\pP])\\1+/', $theString);
Result: http://www.ideone.com/YcbIf
(You need to get rid of the empty strings manually.)
Edit: as a preg_match regex:
'/(?:^|([\pS\pP])\\1+)((?:[^\pS\pP]|([\pS\pP])(?!\\3))*)/'
take the 2nd capture group when it is matched. Example: http://www.ideone.com/ErTVA
But you could just consume ([\pS\pP])\\1+ and discard, or if doesn't match, consume (?:[^\pS\pP]|([\pS\pP])(?!\\3))* and record, since your lexer is going to use more than 1 regex anyway?
Regular expressions are notoriously overused and ill-suited for parsing languages like this. You can get away with it for a little while, but eventually you will find something that breaks your parser, requiring tweak after tweak and a huge library of unit tests to ensure compliance.
You should seriously consider writing a proper lexer and parser instead.
Hello I want to make something like a meta language which gets parsed and cached to be more performant. So I need to be able to parse the meta code into objects or arrays.
Startidentifier: {
Endidentifier: }
You can navigate through objects with a dot(.) but you can also do arithmetic/logic/relational operations.
Here is an example of what the meta language looks like:
{mySelf.mother.job.jobName}
or nested
{mySelf.{myObj.{keys["ObjProps"][0]}.personAttribute.first}.size}
or with operations
{obj.val * (otherObj.intVal + myObj.longVal) == 1200}
or more logical
{obj.condition == !myObj.otherCondition}
I think most of you already understood what i want. At the moment I can do only simple operations(without nesting and with only 2 values) but nesting for getting values with dynamic property names works fine. also the text concatination works fine
e.g. "Hello {myObj.name}! How are you {myObj.type}?".
Also the possibility to make short if like (condition) ? (true-case) : (false-case) would be nice but I have no idea how to parse all that stuff. I am working with loops with some regex at the moment but it would be probably faster and even more maintainable if I had more in regex.
So could anyone give me some hints or want to help me? Maybe visit the project site to understand what I need that for: http://sourceforge.net/projects/blazeframework/
Thanks in advance!
It is non-trivial to parse a indeterminate number of matching braces using regular expressions, because in general, either you will match too much or too little.
For instance, consider Hello {myObj.name}! {mySelf.{myObj.{keys["ObjProps"][0]}.personAttribute.first}.size}? to use two examples from your input in the same string:
If you use the first regular expression that probably comes to mind \{.*\} to match braces, you will get one match: {myObj.name}! {mySelf.{myObj.{keys["ObjProps"][0]}.personAttribute.first}.size} This is because by default, regular expressions are greedy and will match as much as possible.
From there, we can try to use a non-greedy pattern \{.*?\}, which will match as little as possible between the opening and closing brace. Using the same string, this pattern will result in two matches: {myObj.name} and {mySelf.{myObj.{keys["ObjProps"][0]}. Obviously the second is not a full expression, but a non-greedy pattern will match as little as possible, and that is the smallest match that satisfies the pattern.
PCRE does allow recursive regular expressions, but you're going to end up with a very complex pattern if you go down that route.
The best solution, in my opinion, would be to construct a tokenizer (which could be powered by regex) to turn your text into an array of tokens which can then be parsed.
maybe have a look at the PREG_OFFSET_CAPTURE flag!?
Having the following regular expression:
([a-z])([0-9])\1
It matches a5a, is there any way for it to also match a5b, a5c, a5d and so on?
EDIT: Okay, I understand that I could just use ([a-z])([0-9])([a-z]) but I've a very long and complicated regular expression (matching sub-sub-sub-...-domains or matching an IPv4 address) that would really benefit from the behavior described above. Is that somehow possible to achieve with backreferences or anything else?
Anon. answer is what I need, but it seems to be erroneous.
The answer is not with backreferences
Backreference means match the value that was previously matched. It does not mean match the previous expression. But if your language allows it you can substitute a variable in a string into your expression before compiling it.
Tcl:
set exp1 "([a-z])"
regexp "${exp1}([0-9])${exp1}+" $string
Javascript:
var exp1 = '([a-z])';
var regexp = new RegExp(exp1 + '([0-9])' + exp1 + '+');
string.match(regexp);
Perl:
my $exp1 = '([a-z])';
$string =~ /${exp1}([0-9])${exp1}+/;
You don't need back references if the second letter is independent of the first, right?
([a-z])([0-9])([a-z])+
EDIT
If you just don't want to repeat the last part over and over again, then:
([a-z])([0-9])([a-z])
Just taking away the '+'.
The whole point of a back-reference in a regular expression is to match the same thing as the indicated sub-expression, so there's no way to disable that behavior.
To get the behavior you want, of being able to reuse a part of a regular expression later, you could just define the parts of the regular expression you wish to reuse in a separate string, and (depending on the language you're working in) use string interpolation or concatenation to build the regular expression from the pieces.
For instance, in Ruby:
>> letter = '([a-z])'
=> "([a-z])"
>> /#{letter}([0-9])#{letter}+/ =~ "a5b"
=> 0
>> /#{letter}([0-9])#{letter}+/ =~ "a51"
=> nil
Or in JavaScript:
var letter = '([a-z])';
var re = new RegExp(letter + '([0-9])' + letter + '+');
"a5b".match(re)
I suspect you're wanting something similar to the Perl (?PARNO) construct (it's not just for recursion ;).
/([a-z])([0-9])(?1)+/
will match what you want - and any changes to the first capture group will be reflected in what the (?1) matches.
I don't follow your question?
[a-z][0-9][a-z] Exactly 1
[a-z][0-9][a-z]? One or 0
[a-z][0-9][a-z]+ 1 or more
[a-z][0-9][a-z]* 0 or more
Backreferences are for retrieving data from earlier in the regex and using it later on. They aren't for fixing stylistic issues. A regex with backreferences will not function as one without. You might just need to get used to regexes being repetitive and ugly.
Maybe try Python, which makes it easy to build regexes up from smaller blocks. Not clear if you're allowed to change your environment… you're lucky to have backreferences in the first place.
I'm building a pseudo-variable parser in PHP to allow clean and simple syntax in views, and have implemented an engine for if-statements. Now I want to be able to use nested if-statements and the simplest, most elegant solution I thought of was to use identation as the marker for a block.
So this is basicly the layout I'm looking for in the view:
{if x is empty}
{if y is array}
Hello World
{endif}
{endif}
The script would find the first if-statement and match it with the endif on the same depth. If it evaluates to true the inside block will be parsed as well.
Now, I'm having trouble setting up the regular expression to use depth in the following code:
preg_match_all('|([\t ]?){if (.+?) is (.+?)}(.+?){endif}|s', $template, $match);
Basically I want the first match in ([\t ]?) to be placed before {endif} using some kind of variable, and make sure that the statement won't be complete if there is no matching {endif} on the same depth as the {if ...}.
Can you help me complete it?
You cannot in general use regular expressions for this problem, because the language you've defined is not regular (since it requires counting occurrences of {if} and {endif}).
What you've got is a variant of the classic matching parentheses problem.
You'd be better off using some kind of Finite-state machine to keep track of occurrences of {if} and {endif}.
\1 will contain your first match, so you could do this:
preg_match_all('|([\t ]?){if (.+?) is (.+?)}(.+?)\1{endif}|s', $template, $match);
However, this won't necessarily match the correct occurrence of {endif}, only the next occurrence that is preceded by the correct number of tabs.
if you're using explicit "endif" statement, your blocks are already closed, and there's no need to do any special indentation, just match '/{if.+?}(.+?){endif}/'.1 If, on the other side, you what python-alike indent blocks, you can get rid of {endif} (and brackets) and only match indent levels
if condition
this line is in the if block
this one too
this one not
your expression will be like this
/([\t]+)if(.+)\n((\1[\t]+.+\n)+)/
"condition" will be match 2 and statement body match 3
1 actually this should be something like /{if}((.(?!{if))+?){endif}/s to handle nested if's properly.