PHP preg_match Math Function

PHP preg_match Math Function - php

I'm writing a script that will allow a user to input a string that is a math statement, to then be evaluated. I however have hit a roadblock. I cannot figure out how, using preg_match, to dissallow statements that have variables in them.
Using this, $calc = create_function("", "return (" . $string . ");" ); $calc();, allows users to input a string that will be evaluated, but it crashes whenever something like echo 'foo'; is put in place of the variable $string.
I've seen this post, but it does not allow for math functions inside the string, such as $string = 'sin(45)';.

For a stack-based parser implemented in PHP that uses Djikstra's shunting yard algorithm to convert infix to postfix notation, and with support for functions with varying number of arguments, you can look at the source for the PHPExcel calculation engine (and which does not use eval)
Also have a look at the responses to this question

How complex of a math function do you need to allow? If you only need basic math, then you might be able to get away with only allowing whitespace + the characters 0123456789.+/-* or some such.
In general, however, using the language's eval-type capabilities to just do math is probably a bad idea.

Something like:
^([\d\(\)\+\-*/ ,.]|sin\(|cos\(|sqrt\(|...)+$
would allow only numbers, brackets, math operations and provided math functions. But it won't check if provided expression is valid, so something like +++sin()))333((( would be still accepted.

I wonder if this class would help you? Found that doing a search on Google for "php math expressions".

Related

PHP preg_replace RCE [duplicate]

This question already has answers here:
Replace preg_replace() e modifier with preg_replace_callback
(3 answers)
Closed 4 years ago.
I'm currently improving my knowledge about security holes in HTML, PHP, JavaScript etc.
A few hours ago, I stumbled across the /e modifier in regular expressions and I still don't get how it works. I've taken a look at the documentation, but that didn't really help.
What I understood is that this modifier can be manipulated to give someone the opportunity to execute PHP code in (for example, preg_replace()). I've seen the following example describing a security hole but it wasn't explained, so could someone please explain me how to call phpinfo() in the following code?
$input = htmlentities("");
if (strpos($input, 'bla'))
{
echo preg_replace("/" .$input ."/", $input ."<img src='".$input.".png'>", "bla");
}

The e Regex Modifier in PHP with example vulnerability & alternatives
What e does, with an example...
The e modifier is a deprecated regex modifier which allows you to use PHP code within your regular expression. This means that whatever you parse in will be evaluated as a part of your program.
For example, we can use something like this:
$input = "Bet you want a BMW.";
echo preg_replace("/([a-z]*)/e", "strtoupper('\\1')", $input);
This will output BET YOU WANT A BMW.
Without the e modifier, we get this very different output:
strtoupper('')Bstrtoupper('et')strtoupper('') strtoupper('you')strtoupper('') strtoupper('want')strtoupper('') strtoupper('a')strtoupper('') strtoupper('')Bstrtoupper('')Mstrtoupper('')Wstrtoupper('').strtoupper('')
Potential security issues with e...
The e modifier is deprecated for security reasons. Here's an example of an issue you can run into very easily with e:
$password = 'secret';
...
$input = $_GET['input'];
echo preg_replace('|^(.*)$|e', '"\1"', $input);
If I submit my input as "$password", the output to this function will be secret. It's very easy, therefore, for me to access session variables, all variables being used on the back-end and even take deeper levels of control over your application (eval('cat /etc/passwd');?) through this simple piece of poorly written code.
Like the similarly deprecated mysql libraries, this doesn't mean that you cannot write code which is not subject to vulnerability using e, just that it's more difficult to do so.
What you should use instead...
You should use preg_replace_callback in nearly all places you would consider using the e modifier. The code is definitely not as brief in this case but don't let that fool you -- it's twice as fast:
$input = "Bet you want a BMW.";
echo preg_replace_callback(
"/([a-z]*)/",
function($matches){
foreach($matches as $match){
return strtoupper($match);
}
},
$input
);
On performance, there's no reason to use e...
Unlike the mysql libraries (which were also deprecated for security purposes), e is not quicker than its alternatives for most operations. For the example given, it's twice as slow: preg_replace_callback (0.14 sec for 50,000 operations) vs e modifier (0.32 sec for 50,000 operations)

The e modifier is a PHP-specific modifier that triggers PHP to run the resulting string as PHP code. It is basically a eval() wrapped inside a regex engine.
eval() on its own is considered a security risk and a performance problem; wrapping it inside a regex amplifies both those issues significantly.
It is therefore considered bad practice, and is being formally deprecated as of the soon-to-be-released PHP v5.5.
PHP has provided for several versions now an alternative solution in the form of preg_replace_callback(), which uses callback functions instead of using eval(). This is the recommended method of doing this kind of thing.
With specific regard to the code you've quoted:
I don't see an e modifier in the sample code you've given in the question. It has a slash at each end as the regex delimiter; the e would have to be outside of that, and it isn't. Therefore I don't think the code you've quoted is likely to be directly vulnerable to having an e modifier injected into it.
However, if $input contains any / characters, it will be vulnerable to being entirely broken (ie throwing an error due to invalid regex). The same would apply if it had anything else that made it an invalid regular expression.
Because of this, it is a bad idea to use an unvalidated user input string as part of a regex pattern - even if you are sure that it can't be hacked to use the e modifier, there's plenty of other mischief that could be achieved with it.

As explained in the manual, the /e modifier actually evaluates the text the regular expression works on as PHP code. The example given in the manual is:
$html = preg_replace(
'(<h([1-6])>(.*?)</h\1>)e',
'"<h$1>" . strtoupper("$2") . "</h$1>"',
$html
);
This matches any "<hX>XXXXX</hX>" text (i.e. headline HTML tags), replaces this text with "<hX>" . strtoupper("XXXXXX") . "<hX>", then executes "<hX>" . strtoupper("XXXXXX") . "<hX>" as PHP code, then puts the result back into the string.
If you run this on arbitrary user input, any user has a chance to slip something in which will actually be evaluated as PHP code. If he does it correctly, the user can use this opportunity to execute any code he wants to. In the above example, imagine if in the second step the text would be "<hX>" . strtoupper("" . shell('rm -rf /') . "") . "<hX>".

It's evil, that's all you need to know :p
More specifically, it generates the replacement string as normal, but then runs it through eval.
You should use preg_replace_callback instead.

Using php regex to translate output buffer, but not within HTML tags

I have an array with strings to translate ($translation), and I want to use it to translate the output buffer. However, it should not replace within html tags. I have tried using php DOM, but this is too slow and probably too complex for what I want to do.
I currently use this code, but this of course also translates between tags.
$output = ob_get_clean();
foreach($translation as $original => $translated) {
$output = str_replace($original,utf8_encode($translated),$output);
}
I guess I should use a regular expression to replace not within HTML tags, but I can't seem to find the correct expression to do this. Can anyone help? Thanks.

aside from opinions on the orginial idea:
i would not use regexp for that for performance reasen. you could utilize strpos($html,'<') + strpos($html,'>') in combination with substr to extract string by string.
But if somebody(including you) ever has to change the results at another point, then i suggest you go the extra mile and implement a 'proper' translation.
My recommendation:
look into gettext
filter out the strings like mentioned above and generate a .mo -file
encapsulate the strings between the tags with the gettext-functions (like here)

Hypothetical concatenation predicament

So I am working on a simple micro language/alternative syntax for PHP.
Its syntax takes a lot from JavaScript and CoffeeScript including a few of my own concepts. I have hand written the parser (no parser generator used) in PHP to convert the code into PHP then execute it. It is more of a proof of concept/learning tool rather than anything else but I'd be lying if I said I didn't want to see it used on an actual project one day.
Anyway here is a little problem I have come across that I thought I would impose on you great intellects:
As you know in PHP the period ( . ) is used for string concatenation. However in JavaScript it is used for method chaining.
Now one thing that annoys me in PHP is having to do use that bloody arrow (->) for my method chains, so I went the JavaScript way and implemented the period (.) for use with objects.
(I think you can see the problem already)
Because I'm currently only writing a 'dumb' parser that merely does a huge search and replace, there is no way to distinguish whether a period (.) is being used for concatenation or for method chaining.
"So if you are trying to be like JavaScript, just use the addition (+) operator Franky!", I hear you scream. Well I would but because the addition (+) operator is used for math in PHP I would merely be putting myself in the same situation.
Unless I can make my parser smart enough (with a crap load of work) to know that when the addition (+) operator is working with integers then don't convert it into a period (.) for concatenation I am pretty much screwed.
But here is the cool thing. Because this is pretty much a new language. I don't have to use the period or addition operator for concatenation.
So my question is: If I was to decide to introduce a new method of string concatenation, what character would make the most sense?

Does it have to be one character? .. could work!
Any myriad of combinations, like ~~ or >: even!

If you don't want to use + or ., then I would recommend ^ because that's used in some other languages for string concatenation and I don't believe that it's used for anything in PHP.
Edit: It's been pointed out that it's used for XOR. One option would be to use ^ anyway since bitwise XOR is not commonly used and then to map something else like ^^ to XOR. Another option would be to use .. for concatenation. The problem is that the single characters are mostly taken.
Another option would be to use +, but map it to a function which concatenates when one argument is a string and adds otherwise. In order to not break things which rely on strings which are numbers being treated as their values, we should probably treat numeric strings as numbers for these purposes. Here's the function that I would use.
function smart_add($arg1,$arg2) {
if ($arg1.is_numeric() && $arg2.is_numeric()) {
return $arg1 + $arg2;
} else {
return $arg1 . $arg2;
}
}
Then a + b + c + d just gets turned into smart_add(smart_add(smart_add(a,b),c),d)
This may not be perfect in all cases, but it should work pretty well most of the time and has clear rules for use.

So my question is: If I was to decide to introduce a new method of
string concatenation, what character would make the most sense?
As you're well aware of, you'll need to chose a character that is not being used as one of PHP's operators. Since string concatenation is a common technique, I would try to avoid using characters that you need to press SHIFT to type, as those characters will be a hindrance.
Instead of trying to assign one character for string concatenation (as most are already in use), perhaps you should define your own syntax for string concatenation (or any other operation you need to overwrite with a different operator), as a shorthand operator (sort of). Something like:
[$string, $string]
Should be easy to pick up by a parser and form the resulting concatenated string.
Edit: I should also note that whether you're using literal strings or variables, there's no way (as far as I know) to confuse this syntax with any other PHP functionality, since the comma in the middle is invalid for array manipulations. So, all of the following would still be recognized as string concatenation and not something else in PHP.
["stack", "overflow"]
["stack", $overflow]
[$stack, $overflow]
Edit: Since this conflicts to JSON notation, the following alternative variations exist:
Changing the delimiter
Omitting the delimiter
Example:
[$stack $overflow $string $concatenation] // Use nothing (but really need space)

Validate user inputted PHP code before passing it to eval()

Before passing a string to eval() I would like to make sure the syntax is correct and allow:
Two functions: a() and b()
Four operators: /*-+
Brackets: ()
Numbers: 1.2, -1, 1
How can I do this, maybe it has something to do with PHP Tokenizer?
I'm actually trying to make a simple formula interpreter so a() and b() will be replaced by ln() and exp(). I don't want to write a tokenizer and parser from scratch.

As far as validation is concerned, the following character tokens are valid:
operator: [/*+-]
funcs: (a\(|b\()
brackets: [()]
numbers: \d+(\.\d+)?
space: [ ]
A simple validation could then check if the input string matches any combination of these patterns. Because the funcs token is pretty precise and it does not clash much with other tokens, this validation should be quite stable w/o the need implementing any syntax/grammar already:
$tokens = array(
'operator' => '[/*+-]',
'funcs' => '(a\(|b\()',
'brackets' => '[()]',
'numbers' => '\d+(\.\d+)?',
'space' => '[ ]',
);
$pattern = '';
foreach($tokens as $token)
{
$pattern .= sprintf('|(?:%s)', $token);
}
$pattern = sprintf('~^(%s)*$~', ltrim($pattern, '|'));
echo $pattern;
Only if the whole input string matches against the token based pattern, it validates. It still might be syntactically wrong PHP, put you can ensure it only is build upon the specified tokens:
~^((?:[/*+-])|(?:(a\(|b\())|(?:[()])|(?:\d+(\.\d+)?)|(?:[ ]))*$~
If you build the pattern dynamically - as in the example - you're able to modify your language tokens later on more easily.
Additionally this can be the first step to your own tokenizer / lexer. The token stream can then passed on to a parser which can syntactically validate and interpret it. That's the part user187291 wrote about.
Alternatively to writing a full lexer+parser, and you need to validate the syntax, you can formulate your grammar based on tokens as well and then do a regex based token grammar on the token representation of the input.
The tokens are the words you use in your grammar. You will need to describe parenthesis and function definition more precisely then in tokens, and the tokenizer should follow more clear rules which token supersedes another token. The concept is outlined in another question of mine. It uses regex as well for grammar formulation and syntax validation, but it still does not parse. In your case eval would be the parser you're making use of.

Parser generators have indeed already been written for PHP, and "LIME" in particular comes with the typical "calculator" example, which would be an obvious starting point for your "mini language": http://sourceforge.net/projects/lime-php/
It's been years since I last played with LIME, but it was already mature & stable then.
Notes:
1) Using a full-on parser generator gives you the advantage of avoiding PHP eval() entirely if you wish - you can make LIME emit a parser which effectively provides an "eval" function for expressions written in your mini language (with validation baked in). This gives you the additional advantage of allowing you to add support for new functions, as needed.
2) It may seem like overkill at first to use a parser generator for such an apparently small task, but once you get the examples working you'll be impressed by how easy it is to modify and extend them. And it's very easy to underestimate the difficulty of writing a bug-free parser (even a "trivial" one) from scratch.

yes, you need the Tokenizer, or something similar, but it's only part of the story. A tokenizer (more commonly called "lexer") can only read and parse elements of an expression, but has no means to detect that something like "foo()+*bar)" is invalid. You need the second part, called parser which would be able to arrange tokens in a kind of a tree (called "AST") or provide an error message when failing to do so. Ironically, once you've got a tree, "eval" is not needed anymore, you can evaluate your expression directly from the tree.
I would recommend you to write a parser by hand because it's a very useful exercise and a lot of fun. Recursive descent parsers are quite easy to program.

You could use token_get_all(), inspect each token, and abort at the first invalid token.

hakre's answer, using regexes is a nice solution, but is a wee bit complicated. Also handling a whitelist of functions becomes rather messy. And if this does go wrong it could have a very nasty effect on your system.
Is there a reason you don't use the javascript 'eval' instead?

PHP dealing with huge string

I have to replace xmlns with ns in my incomming xml in order to fix SimpleXMLElements xpath() function. Most functions do not have a performance problem. But there allways seems to be an overhead as the string grows.
E.g. preg_replace on a 2 MB string takes 50ms to process, even if I limit the replaces to 1 and the replace is done at the very beginning.
If I substr the first few characters and just replace that part it is slightly faster. But not really that what I want.
Is there any PHP method that would perform better in my problem? And if there is no option, could a simple php extension help, that just does Replace => SimpleXMLElement in C?

If you know exactly where the offending "x", "m" and "l" are, you can just use something like $xml[$x_pos] = ' '; $xml[$m_pos] = ' '; $xml[$l_pos] = ' ' to transform them into spaces. Or transform them into ns___ (where _ = space).

You're always going to get an overhead when trying to do this - you're dealing with a char array and trying to do replace multiple matching elements of the array (i.e. words).
50ms is not much of an overhead, unless (as I suspect) you're trying to do this in a loop?

50ms sounds pretty reasonable to me, for something like this. The requirement itself smells of something being wrong.
Is there any particular reason that you're using regular expressions? Why do people keep jumping to the overkill regex solution?
There is a bog-standard string replace function called str_replace that may do what you want in a fraction of the time (though whether this is right for you depends on how complex your search/replace is).

From the PHP source, as we can see, for example here:
http://svn.php.net/repository/php/php-src/branches/PHP_5_2/ext/standard/string.c
I don`t see, any copies, but I'm not expert in C. From the other hand we can see there many convert to string calls, which at 1st sight could copy values. If they copy values, then we in trouble here.
Only if we in trouble
Try to invent some str_replace wheel here with the help of string-by-char processing. For example we have string $somestring = "somevalue". In PHP we could work with it's chars by indexes as echo $somestring{0}, which will give us "s" or echo $somestring{2} which will give us "m". I'm not sure in this way, but it's possible, if official implimentations don't use references, as they should use.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.