Validate user inputted PHP code before passing it to eval() - php

Before passing a string to eval() I would like to make sure the syntax is correct and allow:
Two functions: a() and b()
Four operators: /*-+
Brackets: ()
Numbers: 1.2, -1, 1
How can I do this, maybe it has something to do with PHP Tokenizer?
I'm actually trying to make a simple formula interpreter so a() and b() will be replaced by ln() and exp(). I don't want to write a tokenizer and parser from scratch.

As far as validation is concerned, the following character tokens are valid:
operator: [/*+-]
funcs: (a\(|b\()
brackets: [()]
numbers: \d+(\.\d+)?
space: [ ]
A simple validation could then check if the input string matches any combination of these patterns. Because the funcs token is pretty precise and it does not clash much with other tokens, this validation should be quite stable w/o the need implementing any syntax/grammar already:
$tokens = array(
'operator' => '[/*+-]',
'funcs' => '(a\(|b\()',
'brackets' => '[()]',
'numbers' => '\d+(\.\d+)?',
'space' => '[ ]',
);
$pattern = '';
foreach($tokens as $token)
{
$pattern .= sprintf('|(?:%s)', $token);
}
$pattern = sprintf('~^(%s)*$~', ltrim($pattern, '|'));
echo $pattern;
Only if the whole input string matches against the token based pattern, it validates. It still might be syntactically wrong PHP, put you can ensure it only is build upon the specified tokens:
~^((?:[/*+-])|(?:(a\(|b\())|(?:[()])|(?:\d+(\.\d+)?)|(?:[ ]))*$~
If you build the pattern dynamically - as in the example - you're able to modify your language tokens later on more easily.
Additionally this can be the first step to your own tokenizer / lexer. The token stream can then passed on to a parser which can syntactically validate and interpret it. That's the part user187291 wrote about.
Alternatively to writing a full lexer+parser, and you need to validate the syntax, you can formulate your grammar based on tokens as well and then do a regex based token grammar on the token representation of the input.
The tokens are the words you use in your grammar. You will need to describe parenthesis and function definition more precisely then in tokens, and the tokenizer should follow more clear rules which token supersedes another token. The concept is outlined in another question of mine. It uses regex as well for grammar formulation and syntax validation, but it still does not parse. In your case eval would be the parser you're making use of.

Parser generators have indeed already been written for PHP, and "LIME" in particular comes with the typical "calculator" example, which would be an obvious starting point for your "mini language": http://sourceforge.net/projects/lime-php/
It's been years since I last played with LIME, but it was already mature & stable then.
Notes:
1) Using a full-on parser generator gives you the advantage of avoiding PHP eval() entirely if you wish - you can make LIME emit a parser which effectively provides an "eval" function for expressions written in your mini language (with validation baked in). This gives you the additional advantage of allowing you to add support for new functions, as needed.
2) It may seem like overkill at first to use a parser generator for such an apparently small task, but once you get the examples working you'll be impressed by how easy it is to modify and extend them. And it's very easy to underestimate the difficulty of writing a bug-free parser (even a "trivial" one) from scratch.

yes, you need the Tokenizer, or something similar, but it's only part of the story. A tokenizer (more commonly called "lexer") can only read and parse elements of an expression, but has no means to detect that something like "foo()+*bar)" is invalid. You need the second part, called parser which would be able to arrange tokens in a kind of a tree (called "AST") or provide an error message when failing to do so. Ironically, once you've got a tree, "eval" is not needed anymore, you can evaluate your expression directly from the tree.
I would recommend you to write a parser by hand because it's a very useful exercise and a lot of fun. Recursive descent parsers are quite easy to program.

You could use token_get_all(), inspect each token, and abort at the first invalid token.

hakre's answer, using regexes is a nice solution, but is a wee bit complicated. Also handling a whitelist of functions becomes rather messy. And if this does go wrong it could have a very nasty effect on your system.
Is there a reason you don't use the javascript 'eval' instead?

Related

RegEx vs. Manually Parsing String (PHP Performance)

Is there any problem (performance-wise) with manually parsing a string as follows, as opposed to using Regular Expressions or the built in string replacement functions?
for ($i=0;$i<strlen($string);$i++) {
$thisChar = $string[$i];
//do more stuff
}
Thanks!
There are things that are done more efficient with custom code than with regex.
As long as both have the same O complexity and your not handling humongous strings, readability and maintainability should be an equally or even more important argument.
For the actual performance just do a benchmark to compare the two solutions.

Regex to match all top-level functions

I'd like to parse PHP code with a regex to find all the top-level functions declared in our codebase.
The simple:
^\s*function\s*([\w_-]+)\(
works pretty well, but catches the extra
class Foo {
function bar() {...}
}
Any ideas on how to skip non-top-level functions that don't have scope delcared?
Note: I know, I know, I should use a real parser but I want something quick and dirty that can run in grep -R -P over a very large codebase.
On a well-indented code base,
^function\s*([\w_-]+)\(
should catch only top-level functions. If you expect leading spaces, you could use a zero-width negative look-behind for a {, to avoid functions right at the beginning of a class declaration:
(?<!{)\s*function\s*([\w_-]+)\(
First of all, I have to say that this sort of stuffs depend largely on how disciplined you code is. For myself, I start all top-level functions immediately at the beginning of lines. So if I wanted to find non-top-level functions (in vim), I simply do
/^[[:space:]]\+function[[:space:]]\+\w\+\>
and
/^function[[:space:]]\+\w\+\>
for all top-level functions.
However, as I said, it depends on how well formatted your codebase is. Good luck!
If you're willing to use ruby (or basically anything with named capture groups), you could use something like this:
^\s*(?<type>\w+)\s*(?<name>[\w_-]+)(?<function>\([^()]*\))?\s*(?<body>{((?>[^{}]+)|(\g<body>))*})
The ones that are functions will have brackets in the function capturing group. The ones are classes won't.
http://rubular.com/r/3dXZts6OYF
Extremely brittle though.

Hypothetical concatenation predicament

So I am working on a simple micro language/alternative syntax for PHP.
Its syntax takes a lot from JavaScript and CoffeeScript including a few of my own concepts. I have hand written the parser (no parser generator used) in PHP to convert the code into PHP then execute it. It is more of a proof of concept/learning tool rather than anything else but I'd be lying if I said I didn't want to see it used on an actual project one day.
Anyway here is a little problem I have come across that I thought I would impose on you great intellects:
As you know in PHP the period ( . ) is used for string concatenation. However in JavaScript it is used for method chaining.
Now one thing that annoys me in PHP is having to do use that bloody arrow (->) for my method chains, so I went the JavaScript way and implemented the period (.) for use with objects.
(I think you can see the problem already)
Because I'm currently only writing a 'dumb' parser that merely does a huge search and replace, there is no way to distinguish whether a period (.) is being used for concatenation or for method chaining.
"So if you are trying to be like JavaScript, just use the addition (+) operator Franky!", I hear you scream. Well I would but because the addition (+) operator is used for math in PHP I would merely be putting myself in the same situation.
Unless I can make my parser smart enough (with a crap load of work) to know that when the addition (+) operator is working with integers then don't convert it into a period (.) for concatenation I am pretty much screwed.
But here is the cool thing. Because this is pretty much a new language. I don't have to use the period or addition operator for concatenation.
So my question is: If I was to decide to introduce a new method of string concatenation, what character would make the most sense?
Does it have to be one character? .. could work!
Any myriad of combinations, like ~~ or >: even!
If you don't want to use + or ., then I would recommend ^ because that's used in some other languages for string concatenation and I don't believe that it's used for anything in PHP.
Edit: It's been pointed out that it's used for XOR. One option would be to use ^ anyway since bitwise XOR is not commonly used and then to map something else like ^^ to XOR. Another option would be to use .. for concatenation. The problem is that the single characters are mostly taken.
Another option would be to use +, but map it to a function which concatenates when one argument is a string and adds otherwise. In order to not break things which rely on strings which are numbers being treated as their values, we should probably treat numeric strings as numbers for these purposes. Here's the function that I would use.
function smart_add($arg1,$arg2) {
if ($arg1.is_numeric() && $arg2.is_numeric()) {
return $arg1 + $arg2;
} else {
return $arg1 . $arg2;
}
}
Then a + b + c + d just gets turned into smart_add(smart_add(smart_add(a,b),c),d)
This may not be perfect in all cases, but it should work pretty well most of the time and has clear rules for use.
So my question is: If I was to decide to introduce a new method of
string concatenation, what character would make the most sense?
As you're well aware of, you'll need to chose a character that is not being used as one of PHP's operators. Since string concatenation is a common technique, I would try to avoid using characters that you need to press SHIFT to type, as those characters will be a hindrance.
Instead of trying to assign one character for string concatenation (as most are already in use), perhaps you should define your own syntax for string concatenation (or any other operation you need to overwrite with a different operator), as a shorthand operator (sort of). Something like:
[$string, $string]
Should be easy to pick up by a parser and form the resulting concatenated string.
Edit: I should also note that whether you're using literal strings or variables, there's no way (as far as I know) to confuse this syntax with any other PHP functionality, since the comma in the middle is invalid for array manipulations. So, all of the following would still be recognized as string concatenation and not something else in PHP.
["stack", "overflow"]
["stack", $overflow]
[$stack, $overflow]
Edit: Since this conflicts to JSON notation, the following alternative variations exist:
Changing the delimiter
Omitting the delimiter
Example:
[$stack $overflow $string $concatenation] // Use nothing (but really need space)

PHP preg_match Math Function

I'm writing a script that will allow a user to input a string that is a math statement, to then be evaluated. I however have hit a roadblock. I cannot figure out how, using preg_match, to dissallow statements that have variables in them.
Using this, $calc = create_function("", "return (" . $string . ");" ); $calc();, allows users to input a string that will be evaluated, but it crashes whenever something like echo 'foo'; is put in place of the variable $string.
I've seen this post, but it does not allow for math functions inside the string, such as $string = 'sin(45)';.
For a stack-based parser implemented in PHP that uses Djikstra's shunting yard algorithm to convert infix to postfix notation, and with support for functions with varying number of arguments, you can look at the source for the PHPExcel calculation engine (and which does not use eval)
Also have a look at the responses to this question
How complex of a math function do you need to allow? If you only need basic math, then you might be able to get away with only allowing whitespace + the characters 0123456789.+/-* or some such.
In general, however, using the language's eval-type capabilities to just do math is probably a bad idea.
Something like:
^([\d\(\)\+\-*/ ,.]|sin\(|cos\(|sqrt\(|...)+$
would allow only numbers, brackets, math operations and provided math functions. But it won't check if provided expression is valid, so something like +++sin()))333((( would be still accepted.
I wonder if this class would help you? Found that doing a search on Google for "php math expressions".

Dynamic logical expression parsing/evaluation in PHP?

I have a need to evaluate user-defined logical expressions of arbitrary complexity on some PHP pages. Assuming that form fields are the primary variables, it would need to:
substitute"varibles" for form
fields values;
handle comparison operators,
minimally ==, <, <=, >= and > by
symbol, name (eg eq, lt, le, ge, gt
respectively);
handle boolean operators not, and, or and
possibly xor by name, symbol (eg !,
&&, || and ^^ respectively);
handle literal values for strings
and numbers;
be plaintext not XML (eg "firstname
== '' or lastname == ''); and
be reasonably performant.
Now in years gone by I've written recursive descent parsers that could build an expression tree and do this kind of thing but thats not a task I'm relishing in PHP so I'm hoping there are things out there that will at least get me some of the way there.
Suggestions?
Much time has gone by since this question was asked, and I happened to be looking for an expression parser for php. I chose to use the ExpressionLanguage component from Symfony 2.4. It can be installed with no dependencies from composer via packagist.
composer require symfony/expression-language
Check create_function, it creates an anonymous function from the string parameters passed, I'm not sure about its performance, but it's very flexible...
If I understand the problem correctly, you want the users to write out functions in non-PHP, and then have PHP interpret it?
If so, you could simply take their string and replace "lt" with "<" and "gt" with ">" ... then do eval().
I have a hunch the problem isn't this simple, but if it is, eval() could do the job. Of course, then you're opening yourself up for any kind of attack.
Take a look at my infix to postfix example I think you could port it to PHP with relative ease. It only uses an array and some switches. No trees. A stack is only needed to run the postfix result.
Check out this function: http://pluginphp.com/plug-in31.php
You can try adapting my Evaluator class (https://github.com/djfm/Evaluator), it does arithmetic expressions (for now) and you can use variables too. All the major PHP operators are implemented.

Categories