What is the algorithm for parsing expressions in infix notation? - php

I would like to parse boolean expressions in PHP. As in:
A and B or C and (D or F or not G)
The terms can be considered simple identifiers. They will have a little structure, but the parser doesn't need to worry about that. It should just recognize the keywords and or not ( ). Everything else is a term.
I remember we wrote simple arithmetic expression evaluators at school, but I don't remember how it was done anymore. Nor do I know what keywords to look for in Google/SO.
A ready made library would be nice, but as I remember the algorithm was pretty simple so it might be fun and educational to re-implement it myself.

Recursive descent parsers are fun to write and easy to read. The first step is to write your grammar out.
Maybe this is the grammar you want.
expr = and_expr ('or' and_expr)*
and_expr = not_expr ('and' not_expr)*
not_expr = simple_expr | 'not' not_expr
simple_expr = term | '(' expr ')'
Turning this into a recursive descent parser is super easy. Just write one function per nonterminal.
def expr():
x = and_expr()
while peek() == 'or':
consume('or')
y = and_expr()
x = OR(x, y)
return x
def and_expr():
x = not_expr()
while peek() == 'and':
consume('and')
y = not_expr()
x = AND(x, y)
return x
def not_expr():
if peek() == 'not':
consume('not')
x = not_expr()
return NOT(x)
else:
return simple_expr()
def simple_expr():
t = peek()
if t == '(':
consume('(')
result = expr()
consume(')')
return result
elif is_term(t):
consume(t)
return TERM(t)
else:
raise SyntaxError("expected term or (")
This isn't complete. You have to provide a little more code:
Input functions. consume, peek, and is_term are functions you provide. They'll be easy to implement using regular expressions. consume(s) reads the next token of input and throws an error if it doesn't match s. peek() simply returns a peek at the next token without consuming it. is_term(s) returns true if s is a term.
Output functions. OR, AND, NOT, and TERM are called each time a piece of the expression is successfully parsed. They can do whatever you want.
Wrapper function. Instead of just calling expr directly, you'll want to write a little wrapper function that initializes the variables used by consume and peek, then calls expr, and finally checks to make sure there's no leftover input that didn't get consumed.
Even with all this, it's still a tiny amount of code. In Python, the complete program is 84 lines, and that includes a few tests.

Why not jsut use the PHP parser?
$terms=array('and','or','not','A','B','C','D'...);
$values=array('*','+','!',1,1,0,0,1....);
$expression="A and B or C and (D or F or not G)";
$expression=preg_replace($terms, $values,$expression);
$expression=preg_replace('^(+|-|!|1|0)','',$expression);
$result=eval($expression);
Actually, that 2nd regex is wrong (and only required if you need to prevent any code injection) - but you get the idea.
C.

I'd go with Pratt parser. It's almost like recursive descent but smarter :) A decent explanation by Douglas Crockford (of JSLint fame) here.

Dijkstra's shunting yard algorithm is the traditional one for going from infix to postfix/graph.

I've implemented the shunting yard algorithm as suggested by plinth. However, this algorithm just gives you the postfix notation, aka reverse Polish notation (RNP). You still have to evaluate it, but that's quite easy once you have the expression in RNP (described for instance here).
The code below might not be good PHP style, my PHP knowledge is somewhat limited. It should be enough to get the idea though.
$operators = array("and", "or", "not");
$num_operands = array("and" => 2, "or" => 2, "not" => 1);
$parenthesis = array("(", ")");
function is_operator($token) {
global $operators;
return in_array($token, $operators);
}
function is_right_parenthesis($token) {
global $parenthesis;
return $token == $parenthesis[1];
}
function is_left_parenthesis($token) {
global $parenthesis;
return $token == $parenthesis[0];
}
function is_parenthesis($token) {
return is_right_parenthesis($token) || is_left_parenthesis($token);
}
// check whether the precedence if $a is less than or equal to that of $b
function is_precedence_less_or_equal($a, $b) {
// "not" always comes first
if ($b == "not")
return true;
if ($a == "not")
return false;
if ($a == "or" and $b == "and")
return true;
if ($a == "and" and $b == "or")
return false;
// otherwise they're equal
return true;
}
function shunting_yard($input_tokens) {
$stack = array();
$output_queue = array();
foreach ($input_tokens as $token) {
if (is_operator($token)) {
while (is_operator($stack[count($stack)-1]) && is_precedence_less_or_equal($token, $stack[count($stack)-1])) {
$o2 = array_pop($stack);
array_push($output_queue, $o2);
}
array_push($stack, $token);
} else if (is_parenthesis($token)) {
if (is_left_parenthesis($token)) {
array_push($stack, $token);
} else {
while (!is_left_parenthesis($stack[count($stack)-1]) && count($stack) > 0) {
array_push($output_queue, array_pop($stack));
}
if (count($stack) == 0) {
echo ("parse error");
die();
}
$lp = array_pop($stack);
}
} else {
array_push($output_queue, $token);
}
}
while (count($stack) > 0) {
$op = array_pop($stack);
if (is_parenthesis($op))
die("mismatched parenthesis");
array_push($output_queue, $op);
}
return $output_queue;
}
function str2bool($s) {
if ($s == "true")
return true;
if ($s == "false")
return false;
die('$s doesn\'t contain valid boolean string: '.$s.'\n');
}
function apply_operator($operator, $a, $b) {
if (is_string($a))
$a = str2bool($a);
if (!is_null($b) and is_string($b))
$b = str2bool($b);
if ($operator == "and")
return $a and $b;
else if ($operator == "or")
return $a or $b;
else if ($operator == "not")
return ! $a;
else die("unknown operator `$function'");
}
function get_num_operands($operator) {
global $num_operands;
return $num_operands[$operator];
}
function is_unary($operator) {
return get_num_operands($operator) == 1;
}
function is_binary($operator) {
return get_num_operands($operator) == 2;
}
function eval_rpn($tokens) {
$stack = array();
foreach ($tokens as $t) {
if (is_operator($t)) {
if (is_unary($t)) {
$o1 = array_pop($stack);
$r = apply_operator($t, $o1, null);
array_push($stack, $r);
} else { // binary
$o1 = array_pop($stack);
$o2 = array_pop($stack);
$r = apply_operator($t, $o1, $o2);
array_push($stack, $r);
}
} else { // operand
array_push($stack, $t);
}
}
if (count($stack) != 1)
die("invalid token array");
return $stack[0];
}
// $input = array("A", "and", "B", "or", "C", "and", "(", "D", "or", "F", "or", "not", "G", ")");
$input = array("false", "and", "true", "or", "true", "and", "(", "false", "or", "false", "or", "not", "true", ")");
$tokens = shunting_yard($input);
$result = eval_rpn($tokens);
foreach($input as $t)
echo $t." ";
echo "==> ".($result ? "true" : "false")."\n";

You could use an LR parser to build a parse tree and then evaluate the tree to obtain the result. A detailed description including examples can be found in Wikipedia. If you haven't coded it yourself already I will write a small example tonight.

The simplest way is to use regexes that converts your expression into an expression in php syntax and then use eval, as suggested by symcbean. But I'm not sure if you would want to use it in production code.
The other way is to code your own simple recursive descent parser. It isn't as hard as it might sound. For a simple grammar such yours (boolean expressions), you can easily code one from scratch. You can also use a parser generator similar to ANTLR for php, probably searching for a php parser generator would turn up something.

Related

Is there possible to check mathematical expression string?

I want to check all brackets start and close properly and also check it is mathematical expression or not in given string.
ex :
$str1 = "(A1+A2*A3)+A5+(B3^B5)*(C1*((A3/C2)+(B2+C1)))"
$str2 = "(A1+A2*A3)+A5)*C1+(B3^B5*(C1*((A3/C2)+(B2+C1)))"
$str3 = "(A1+A2*A3)+A5++(B2+C1)))"
$str4 = "(A1+A2*A3)+A5+(B3^B5)*(C1*(A3/C2)+(B2+C1))"
In above Example $str1 and $str4 are valid string....
Please Help....
You'll need a kind of parser. I don't think you can handle this by a regular expression, because you have to check the amount and the order of parentheses and possible nested ones. This class below is quick PHP port of a Python based Math expression syntax validator of parentheses I found:
class MathExpression {
private static $parentheses_open = array('(', '{', '[');
private static $parentheses_close = array(')', '}', ']');
protected static function getParenthesesType( $c ) {
if(in_array($c,MathExpression::$parentheses_open)) {
return array_search($c, MathExpression::$parentheses_open);
} elseif(in_array($c,MathExpression::$parentheses_close)) {
return array_search($c, MathExpression::$parentheses_close);
} else {
return false;
}
}
public static function validate( $expression ) {
$size = strlen( $expression );
$tmp = array();
for ($i=0; $i<$size; $i++) {
if(in_array($expression[$i],MathExpression::$parentheses_open)) {
$tmp[] = $expression[$i];
} elseif(in_array($expression[$i],MathExpression::$parentheses_close)) {
if (count($tmp) == 0 ) {
return false;
}
if(MathExpression::getParenthesesType(array_pop($tmp))
!= MathExpression::getParenthesesType($expression[$i])) {
return false;
}
}
}
if (count($tmp) == 0 ) {
return true;
} else {
return false;
}
}
}
//Mathematical expressions to validate
$tests = array(
'(A1+A2*A3)+A5+(B3^B5)*(C1*((A3/C2)+(B2+C1)))',
'(A1+A2*A3)+A5)*C1+(B3^B5*(C1*((A3/C2)+(B2+C1)))',
'(A1+A2*A3)+A5++(B2+C1)))',
'(A1+A2*A3)+A5+(B3^B5)*(C1*(A3/C2)+(B2+C1))'
);
// running the tests...
foreach($tests as $test) {
$isValid = MathExpression::validate( $test );
echo 'test of: '. $test .'<br>';
var_dump($isValid);
}
Well I suppose that the thing, you are looking for, is some Context-free grammar or Pushdown automaton. It can not be done only using regular expressions. (at least there is no easy or nice way)
That is because you are dealing with nested structures. Some idea of an implementation can be found here Regular expression to detect semi-colon terminated C++ for & while loops
Use Regular Expression that returns you howmany Opening Brackets and Closing Brackets are there?
then check for the number of both braces....if it is equal then your expression is right otherwise wrong...

Can I store a logical comparison in a string?

I am trying to modularise a lengthy if..else function.
$condition = "$a < $b";
if($condition)
{
$c++;
}
Is there any way of translating the literal string into a logical expression?
I am trying to modularise a lengthy if..else function.
You don't need to put the condition in a string for that, just store the boolean true or false:
$condition = ($a < $b);
if($condition)
{
$c++;
}
the values of $a and $b may change between definition of $condition and its usage
One solution would be a Closure (assuming that definition and usage are happening in the same scope):
$condition = function() use (&$a, &$b) {
return $a < $b;
}
$a = 1;
$b = 2;
if ($condition()) {
echo 'a is less than b';
}
But I don't know if this makes sense for you without remotely knowing what you are trying to accomplish.
Use lambda if you know variables that are enough to determine result
$f = function ($a, $b) { return $a < $b; }
if ($f($x, $y)){}
you could do this using eval. not sure why you wouldn't just evaluate the condition immediately, though.
<?php
$a=0;
$b=1;
function resultofcondition()
{
global $a,$b;
return $a<$b;
}
if(resultofcondition()){
echo " You are dumb,";
}else{
echo " I am dumb,";
}
$a=1;
$b=0;
if(resultofcondition()){
echo " You were correct.";
}else{
echo " in your face.";
}
?>
Indeed thanks for commenting that out, was missing the GLOBAL parameter, for those who voted down, what would that code output? ¬_¬ w/e have fun xD

How to mathematically evaluate a string like "2-1" to produce "1"?

I was just wondering if PHP has a function that can take a string like 2-1 and produce the arithmetic result of it?
Or will I have to do this manually with explode() to get the values left and right of the arithmetic operator?
I know this question is old, but I came across it last night while searching for something that wasn't quite related, and every single answer here is bad. Not just bad, very bad. The examples I give here will be from a class that I created back in 2005 and spent the past few hours updating for PHP5 because of this question. Other systems do exist, and were around before this question was posted, so it baffles me why every answer here tells you to use eval, when the caution from PHP is:
The eval() language construct is very dangerous because it allows execution of arbitrary PHP code. Its use thus is discouraged. If you have carefully verified that there is no other option than to use this construct, pay special attention not to pass any user provided data into it without properly validating it beforehand.
Before I jump in to the example, the places to get the class I will be using is on either PHPClasses or GitHub. Both the eos.class.php and stack.class.php are required, but can be combined in to the same file.
The reason for using a class like this is that it includes and infix to postfix(RPN) parser, and then an RPN Solver. With these, you never have to use the eval function and open your system up to vulnerabilities. Once you have the classes, the following code is all that is needed to solve a simple (to more complex) equation such as your 2-1 example.
require_once "eos.class.php";
$equation = "2-1";
$eq = new eqEOS();
$result = $eq->solveIF($equation);
That's it! You can use a parser like this for most equations, however complicated and nested without ever having to resort to the 'evil eval'.
Because I really don't want this only only to have my class in it, here are some other options. I am just familiar with my own since I've been using it for 8 years. ^^
Wolfram|Alpha API
Sage
A fairly bad parser
phpdicecalc
Not quite sure what happened to others that I had found previously - came across another one on GitHub before as well, unfortunately I didn't bookmark it, but it was related to large float operations that included a parser as well.
Anyways, I wanted to make sure an answer to solving equations in PHP on here wasn't pointing all future searchers to eval as this was at the top of a google search. ^^
$operation='2-1';
eval("\$value = \"$operation\";");
or
$value=eval("return ($operation);");
This is one of the cases where eval comes in handy:
$expression = '2 - 1';
eval( '$result = (' . $expression . ');' );
echo $result;
You can use BC Math arbitrary precision
echo bcsub(5, 4); // 1
echo bcsub(1.234, 5); // 3
echo bcsub(1.234, 5, 4); // -3.7660
http://www.php.net/manual/en/function.bcsub.php
In this forum someone made it without eval. Maybe you can try it? Credits to them, I just found it.
function calculate_string( $mathString ) {
$mathString = trim($mathString); // trim white spaces
$mathString = ereg_replace ('[^0-9\+-\*\/\(\) ]', '', $mathString); // remove any non-numbers chars; exception for math operators
$compute = create_function("", "return (" . $mathString . ");" );
return 0 + $compute();
}
$string = " (1 + 1) * (2 + 2)";
echo calculate_string($string); // outputs 8
Also see this answer here: Evaluating a string of simple mathematical expressions
Please note this solution does NOT conform to BODMAS, but you can use brackets in your evaluation string to overcome this.
function callback1($m) {
return string_to_math($m[1]);
}
function callback2($n,$m) {
$o=$m[0];
$m[0]=' ';
return $o=='+' ? $n+$m : ($o=='-' ? $n-$m : ($o=='*' ? $n*$m : $n/$m));
}
function string_to_math($s){
while ($s != ($t = preg_replace_callback('/\(([^()]*)\)/','callback1',$s))) $s=$t;
preg_match_all('![-+/*].*?[\d.]+!', "+$s", $m);
return array_reduce($m[0], 'callback2');
}
echo string_to_match('2-1'); //returns 1
As create_function got deprecated and I was utterly needed an alternative lightweight solution of evaluating string as math. After a couple of hours spending, I came up with following. By the way, I did not care about parentheses as I don't need in my case. I just needed something that conform operator precedence correctly.
Update: I have added parentheses support as well. Please check this project Evaluate Math String
function evalAsMath($str) {
$error = false;
$div_mul = false;
$add_sub = false;
$result = 0;
$str = preg_replace('/[^\d\.\+\-\*\/]/i','',$str);
$str = rtrim(trim($str, '/*+'),'-');
if ((strpos($str, '/') !== false || strpos($str, '*') !== false)) {
$div_mul = true;
$operators = array('*','/');
while(!$error && $operators) {
$operator = array_pop($operators);
while($operator && strpos($str, $operator) !== false) {
if ($error) {
break;
}
$regex = '/([\d\.]+)\\'.$operator.'(\-?[\d\.]+)/';
preg_match($regex, $str, $matches);
if (isset($matches[1]) && isset($matches[2])) {
if ($operator=='+') $result = (float)$matches[1] + (float)$matches[2];
if ($operator=='-') $result = (float)$matches[1] - (float)$matches[2];
if ($operator=='*') $result = (float)$matches[1] * (float)$matches[2];
if ($operator=='/') {
if ((float)$matches[2]) {
$result = (float)$matches[1] / (float)$matches[2];
} else {
$error = true;
}
}
$str = preg_replace($regex, $result, $str, 1);
$str = str_replace(array('++','--','-+','+-'), array('+','+','-','-'), $str);
} else {
$error = true;
}
}
}
}
if (!$error && (strpos($str, '+') !== false || strpos($str, '-') !== false)) {
$add_sub = true;
preg_match_all('/([\d\.]+|[\+\-])/', $str, $matches);
if (isset($matches[0])) {
$result = 0;
$operator = '+';
$tokens = $matches[0];
$count = count($tokens);
for ($i=0; $i < $count; $i++) {
if ($tokens[$i] == '+' || $tokens[$i] == '-') {
$operator = $tokens[$i];
} else {
$result = ($operator == '+') ? ($result + (float)$tokens[$i]) : ($result - (float)$tokens[$i]);
}
}
}
}
if (!$error && !$div_mul && !$add_sub) {
$result = (float)$str;
}
return $error ? 0 : $result;
}
Demo: http://sandbox.onlinephpfunctions.com/code/fdffa9652b748ac8c6887d91f9b10fe62366c650
Here is a somewhat verbose bit of code I rolled for another SO question. It does conform to BOMDAS without eval(), but is not equipped to do complex/higher-order/parenthetical expressions. This library-free approach pulls the expression apart and systematically reduces the array of components until all of the operators are removed. It certainly works for your sample expression: 2-1 ;)
preg_match() checks that each operator has a numeric substring on each side.
preg_split() divides the string into an array of alternating numbers and operators.
array_search() finds the index of the targeted operator, while it exists in the array.
array_splice() replaces the operator element and the elements on either side of it with a new element that contains the mathematical result of the three elements removed.
** updated to allow negative numbers **
Code: (Demo)
$expression = "-11+3*1*4/-6-12";
if (!preg_match('~^-?\d*\.?\d+([*/+-]-?\d*\.?\d+)*$~', $expression)) {
echo "invalid expression";
} else {
$components = preg_split('~(?<=\d)([*/+-])~', $expression, 0, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
var_export($components); // ['-11','+','3','*','1','*','4','/','-6','-','12']
while (($index = array_search('*',$components)) !== false) {
array_splice($components, $index - 1, 3, $components[$index - 1] * $components[$index + 1]);
var_export($components);
// ['-11','+','3','*','4','/','-6','-','12']
// ['-11','+','12','/','-6','-','12']
}
while (($index = array_search('/', $components)) !== false) {
array_splice($components, $index - 1, 3, $components[$index - 1] / $components[$index + 1]);
var_export($components); // [-'11','+','-2','-','12']
}
while (($index = array_search('+', $components)) !== false) {
array_splice($components, $index - 1, 3, $components[$index - 1] + $components[$index + 1]);
var_export($components); // ['-13','-','12']
}
while (($index = array_search('-', $components)) !== false) {
array_splice($components, $index - 1, 3, $components[$index - 1] - $components[$index + 1]);
var_export($components); // [-25]
}
echo current($components); // -25
}
Here is a demo of the BOMDAS version that uses php's pow() when ^ is encountered between two numbers (positive or negative).
I don't think I'll ever bother writing a version that handles parenthetical expressions ... but we'll see how bored I get.
You can do it by eval function.
Here is how you can do this.
<?php
$exp = "2-1;";
$res = eval("return $exp");
echo $res; // it will return 1
?>
You can use $res anywhere in the code to get the result.
You can use it in form by changing $exp value.
Here is an example of creating a web calculator.
<?php
if (isset($_POST['submit'])) {
$exp = $_POST['calc'];
$res = eval("return $exp;"); //Here we have to add ; after $exp to make a complete code.
echo $res;
}
?>
// html code
<form method="post">
<input type="text" name="calc">
<input type="submit" name="submit">
</form>

detect a string contained by another discontinuously

Recently I'm working on bad content(such as advertise post) filter of a BBS.And I write a function to detect a string is in another string not continuously.Code as below:
$str = 'helloguys';
$substr1 = 'hlu';
$substr2 = 'elf';
function detect($a,$b) //function that detect a in b
{
$c = '';
for($i=0;$i<=strlen($a);$i++)
{
for($j=0;$j<=strlen($b);$j++)
{
if($a[$i] == $b[$j])
{
$b=substr($b,$j+1);
$c .=$a[$i];
break;
}
}
}
if($c == $a) return true;
else return false;
}
var_dump(detect($substr1,$str)); //true
var_dump(detect($substr2,$str)); //false
Since the filter works before the users do their posts so I think the efficiency here is important.And I wonder if there's any better solution? Thanks!
a faster way to do this is converting $a to a regular expression and match it with $b, so that you just leave the optimization to the PCRE module itself which is written in C code.
for example:
detect("hlu",$b) is equal to preg_match("/h.*l.*u/", $b)
(detect("hlu",$b) && detect("elf",$b)) is equal to preg_match("/(h.*l.*u|e.*l.*f)/", $b)
not sure why you would want to do this. but i was bored
function detect( $a,$b ) {
return count( array_intersect( str_split($b), str_split($a) ) ) == strlen($b);
}

Regex to parse define() contents, possible?

I am very new to regex, and this is way too advanced for me. So I am asking the experts over here.
Problem
I would like to retrieve the constants / values from a php define()
DEFINE('TEXT', 'VALUE');
Basically I would like a regex to be able to return the name of constant, and the value of constant from the above line. Just TEXT and VALUE . Is this even possible?
Why I need it? I am dealing with language file and I want to get all couples (name, value) and put them in array. I managed to do it with str_replace() and trim() etc.. but this way is long and I am sure it could be made easier with single line of regex.
Note: The VALUE may contain escaped single quotes as well. example:
DEFINE('TEXT', 'J\'ai');
I hope I am not asking for something too complicated. :)
Regards
For any kind of grammar-based parsing, regular expressions are usually an awful solution. Even smple grammars (like arithmetic) have nesting and it's on nesting (in particular) that regular expressions just fall over.
Fortunately PHP provides a far, far better solution for you by giving you access to the same lexical analyzer used by the PHP interpreter via the token_get_all() function. Give it a character stream of PHP code and it'll parse it into tokens ("lexemes"), which you can do a bit of simple parsing on with a pretty simple finite state machine.
Run this program (it's run as test.php so it tries it on itself). The file is deliberately formatted badly so you can see it handles that with ease.
<?
define('CONST1', 'value' );
define (CONST2, 'value2');
define( 'CONST3', time());
define('define', 'define');
define("test", VALUE4);
define('const5', //
'weird declaration'
) ;
define('CONST7', 3.14);
define ( /* comment */ 'foo', 'bar');
$defn = 'blah';
define($defn, 'foo');
define( 'CONST4', define('CONST5', 6));
header('Content-Type: text/plain');
$defines = array();
$state = 0;
$key = '';
$value = '';
$file = file_get_contents('test.php');
$tokens = token_get_all($file);
$token = reset($tokens);
while ($token) {
// dump($state, $token);
if (is_array($token)) {
if ($token[0] == T_WHITESPACE || $token[0] == T_COMMENT || $token[0] == T_DOC_COMMENT) {
// do nothing
} else if ($token[0] == T_STRING && strtolower($token[1]) == 'define') {
$state = 1;
} else if ($state == 2 && is_constant($token[0])) {
$key = $token[1];
$state = 3;
} else if ($state == 4 && is_constant($token[0])) {
$value = $token[1];
$state = 5;
}
} else {
$symbol = trim($token);
if ($symbol == '(' && $state == 1) {
$state = 2;
} else if ($symbol == ',' && $state == 3) {
$state = 4;
} else if ($symbol == ')' && $state == 5) {
$defines[strip($key)] = strip($value);
$state = 0;
}
}
$token = next($tokens);
}
foreach ($defines as $k => $v) {
echo "'$k' => '$v'\n";
}
function is_constant($token) {
return $token == T_CONSTANT_ENCAPSED_STRING || $token == T_STRING ||
$token == T_LNUMBER || $token == T_DNUMBER;
}
function dump($state, $token) {
if (is_array($token)) {
echo "$state: " . token_name($token[0]) . " [$token[1]] on line $token[2]\n";
} else {
echo "$state: Symbol '$token'\n";
}
}
function strip($value) {
return preg_replace('!^([\'"])(.*)\1$!', '$2', $value);
}
?>
Output:
'CONST1' => 'value'
'CONST2' => 'value2'
'CONST3' => 'time'
'define' => 'define'
'test' => 'VALUE4'
'const5' => 'weird declaration'
'CONST7' => '3.14'
'foo' => 'bar'
'CONST5' => '6'
This is basically a finite state machine that looks for the pattern:
function name ('define')
open parenthesis
constant
comma
constant
close parenthesis
in the lexical stream of a PHP source file and treats the two constants as a (name,value) pair. In doing so it handles nested define() statements (as per the results) and ignores whitespace and comments as well as working across multiple lines.
Note: I've deliberatley made it ignore the case when functions and variables are constant names or values but you can extend it to that as you wish.
It's also worth pointing out that PHP is quite forgiving when it comes to strings. They can be declared with single quotes, double quotes or (in certain circumstances) with no quotes at all. This can be (as pointed out by Gumbo) be an ambiguous reference reference to a constant and you have no way of knowing which it is (no guaranteed way anyway), giving you the chocie of:
Ignoring that style of strings (T_STRING);
Seeing if a constant has already been declared with that name and replacing it's value. There's no way you can know what other files have been called though nor can you process any defines that are conditionally created so you can't say with any certainty if anything is definitely a constant or not nor what value it has; or
You can just live with the possibility that these might be constants (which is unlikely) and just treat them as strings.
Personally I would go for (1) then (3).
This is possible, but I would rather use get_defined_constants(). But make sure all your translations have something in common (like all translations starting with T), so you can tell them apart from other constants.
Try this regular expression to find the define calls:
/\bdefine\(\s*("(?:[^"\\]+|\\(?:\\\\)*.)*"|'(?:[^'\\]+|\\(?:\\\\)*.)*')\s*,\s*("(?:[^"\\]+|\\(?:\\\\)*.)*"|'(?:[^'\\]+|\\(?:\\\\)*.)*')\s*\);/is
So:
$pattern = '/\\bdefine\\(\\s*("(?:[^"\\\\]+|\\\\(?:\\\\\\\\)*.)*"|\'(?:[^\'\\\\]+|\\\\(?:\\\\\\\\)*.)*\')\\s*,\\s*("(?:[^"\\\\]+|\\\\(?:\\\\\\\\)*.)*"|\'(?:[^\'\\\\]+|\\\\(?:\\\\\\\\)*.)*\')\\s*\\);/is';
$str = '<?php define(\'foo\', \'bar\'); define("define(\\\'foo\\\', \\\'bar\\\')", "define(\'foo\', \'bar\')"); ?>';
preg_match_all($pattern, $str, $matches, PREG_SET_ORDER);
var_dump($matches);
I know that eval is evil. But that’s the best way to evaluate the string expressions:
$constants = array();
foreach ($matches as $match) {
eval('$constants['.$match[1].'] = '.$match[1].';');
}
var_dump($constants);
You might not need to go overboard with the regex complexity - something like this will probably suffice
/DEFINE\('(.*?)',\s*'(.*)'\);/
Here's a PHP sample showing how you might use it
$lines=file("myconstants.php");
foreach($lines as $line) {
$matches=array();
if (preg_match('/DEFINE\(\'(.*?)\',\s*\'(.*)\'\);/i', $line, $matches)) {
$name=$matches[1];
$value=$matches[2];
echo "$name = $value\n";
}
}
Not every problem with text should be solved with a regexp, so I'd suggest you state what you want to achieve and not how.
So, instead of using php's parser which is not really useful, or instead of using a completely undebuggable regexp, why not write a simple parser?
<?php
$str = "define('nam\\'e', 'va\\\\\\'lue');\ndefine('na\\\\me2', 'value\\'2');\nDEFINE('a', 'b');";
function getDefined($str) {
$lines = array();
preg_match_all('#^define[(][ ]*(.*?)[ ]*[)];$#mi', $str, $lines);
$res = array();
foreach ($lines[1] as $cnt) {
$p = 0;
$key = parseString($cnt, $p);
// Skip comma
$p++;
// Skip space
while ($cnt{$p} == " ") {
$p++;
}
$value = parseString($cnt, $p);
$res[$key] = $value;
}
return $res;
}
function parseString($s, &$p) {
$quotechar = $s[$p];
if (! in_array($quotechar, array("'", '"'))) {
throw new Exception("Invalid quote character '" . $quotechar . "', input is " . var_export($s, true) . " # " . $p);
}
$len = strlen($s);
$quoted = false;
$res = "";
for ($p++;$p < $len;$p++) {
if ($quoted) {
$quoted = false;
$res .= $s{$p};
} else {
if ($s{$p} == "\\") {
$quoted = true;
continue;
}
if ($s{$p} == $quotechar) {
$p++;
return $res;
}
$res .= $s{$p};
}
}
throw new Exception("Premature end of line");
}
var_dump(getDefined($str));
Output:
array(3) {
["nam'e"]=>
string(7) "va\'lue"
["na\me2"]=>
string(7) "value'2"
["a"]=>
string(1) "b"
}

Categories