What's the point of strspn? - php

At work today we were trying to come up with any reason you would use strspn.
I searched google code to see if it's ever been implemented in a useful way and came up blank. I just can't imagine a situation in which I would really need to know the length of the first segment of a string that contains only characters from another string. Any ideas?

Although you link to the PHP manual, the strspn() function comes from C libraries, along with strlen(), strcpy(), strcmp(), etc.
strspn() is a convenient alternative to picking through a string character by character, testing if the characters match one of a set of values. It's useful when writing tokenizers. The alternative to strspn() would be lots of repetitive and error-prone code like the following:
for (p = stringbuf; *p; p++) {
if (*p == 'a' || *p == 'b' || *p = 'c' ... || *p == 'z') {
/* still parsing current token */
}
}
Can you spot the error? :-)
Of course in a language with builtin support for regular expression matching, strspn() makes little sense. But when writing a rudimentary parser for a DSL in C, it's pretty nifty.

It's based on the the ANSI C function strspn(). It can be useful in low-level C parsing code, where there is no high-level string class. It's considerably less useful in PHP, which has lots of useful string parsing functions.

Well, by my understanding, its the same thing as this regex:
^[set]*
Where set is the string containing the characters to be found.
You could use it to search for any number or text at the beginning of a string and split.
It seems it would be useful when porting code to php.

I think its great for blacklisting and letting the user know from where the error started. Like MySQL returns part of the query from where the error occured.
Please see this function, that lets the user know which part of his comment is not valid:
function blacklistChars($yourComment){
$blacklistedChars = "!##$%^&*()";
$validLength = strcspn($yourComment, $blacklistedChars);
if ($validLength !== strlen($yourComment))
{
$error = "Your comment contains invalid chars starting from here: `" .
substr($yourComment, (int) '-' . $validLength) . "`";
return $error;
}
return false;
}
$yourComment = "Hello, why can you not type and $ dollar sign in the text?";
$yourCommentError = blacklistChars($yourComment);
if ($yourCommentError <> false)
echo $yourCommentError;

It is useful specificaly for functions like atoi - where you have a string you want to convert to a number, and you don't want to deal with anything that isn't in the set "-.0123456789"
But yes, it has limited use.
-Adam

Related

Is there a built-in PHP function to check if a given string is a reserved keyword?

I'm looking at this: https://www.php.net/manual/en/reserved.php
I've made numerous search queries for things like: "php determine if string is reserved keyword".
I find nothing, and I'm starting to seriously sweat. Please don't tell me I'm going to have to code a complicated script to regularly scrape the PHP manual for all these various kinds of reserved keywords and build my own database!
Please let there be a nice, simple function to simply check:
var_dump(is_reserved_php_keyword('if'));
And it gives a true/false.
I went a different way to Andrew and instead went for having PHP figure it out rather than hard coding the list.
function isPhpKeyword($testString) {
// First check it's actually a word and not an expression/number
if (!preg_match('/^[a-z]+$/i', $testString)) {
return false;
}
$tokenised = token_get_all('<?php ' . $testString . '; ?>');
// tokenised[0] = opening PHP tag, tokenised[1] = our test string
return reset($tokenised[1]) !== T_STRING;
}
https://3v4l.org/WA6dr
This has a few advantages:
It doesn't need the list to be maintained as PHP's own parser says what's valid or not.
It's a lot simpler to understand.
Unfortunately, I'm not aware of any built-in function for what you describe. But based on the RegEx pattern you can find among the contributed notes on PHP.net, you can test it like this:
$reserved_pattern = "/\b((a(bstract|nd|rray|s))|(c(a(llable|se|tch)|l(ass|one)|on(st|tinue)))|(d(e(clare|fault)|ie|o))|(e(cho|lse(if)?|mpty|nd(declare|for(each)|if|switch|while)|val|x(it|tends)))|(f(inal|or(each)?|unction))|(g(lobal|oto))|(i(f|mplements|n(clude(_once)?|st(anceof|eadof)|terface)|sset))|(n(amespace|ew))|(p(r(i(nt|vate)|otected)|ublic))|(re(quire(_once)?|turn))|(s(tatic|witch))|(t(hrow|r(ait|y)))|(u(nset|se))|(__halt_compiler|break|list|(x)?or|var|while))\b/";
if(!preg_match($reserved_pattern, $myString)) {
// It is not reserved!
};
It may not be the most elegant-looking piece of PHP code out there, but it gets the job done.
UPDATE: See online demo of function here: https://3v4l.org/CdIjt

Combine regex and string analysis to specify a required pattern for string input validation

I should firstly apologize for my probably rookie question, but I've just got no clue how to achieve that relatively complex task being a complete newbie regarding regex. What I need is to specify a validation pattern for a string input and perform separate checks on the separate segments of that pattern. So let's begin with the task itself. I'm working with php7.0 on laravel 5.4 (which should genuinely not make any difference) and I need to somehow produce a matching pattern for a string input, which pattern is the following:
header1: expression1; header2: expression2; header3: expression3 //etc...
What I'd need here is to check if each header is present and if it's present in a special validation list of available headers. So I'd need to extract each header.
Furthermore the expressions are built as follows
expression1 = (a1 + a2)*(a3-a1)
expression2 = b1*(b2 - b3)/b4
//etc...
The point is that each expression contains some numeric parameters which should form a valid arithmetic calculation. Those parameters should also be contained in a special list of available parameter placeholders, so I'd need to check them too. So, is there a simple efficient way (using regex and string analysis in pure php) to specify that strict structure or should I do everything step by step with exploding and try-catching?
An optimal solution would be a shorthand logic (or regex expression?) of a kind like:
$value->match("^n(header: expression)")
->delimitedBy(';')
->where(in_array($header, $allowed_headers))
->where(strtr($expression, array_fill_keys($available_param_placeholders, 0))->isValidArithmeticExpression())
I hope you can follow my logic. The code above would read as: Match N repetitions of the pattern "header: expression", delimited by ';', where 'header' (given that $header is its value) is in an array and where 'expression' (given that $expression is its value) forms a valid arithmetic expression when all available parameter placeholders have been replaced by 0. That's it all. Each deviation of that strict pattern should return false.
As an alternative I'm currently thinking of something like firstly exploding the string by the main delimiter (the semicolon) and then analysing each part separately. So I'll then have to check if there is a colon present, then if everything to the left of the colon matches a valid header name and if everythin to the right of the column forms a valid arithmetic expression when all param names from the list are replaced by a random value (like 0, just to check if the code executes, which I also don't know how to do). Anyway, that way seems like an overkill and I'm sure there should be a smoother way to specify the needed pattern.
I hope I've explained everything good enough and sorry if I'm being to messy explaining my problem. Thanks in advance for each piece of advice/help! Greatly appreciated!
Using eval() must always be Plan Z. With my understanding of your input string, this method may sufficiently validate the headers and expressions (if not, I think it should sufficiently sanitize the string for arithmetic parsing). I don't code in Laravel, so if this can be converted to Laravel syntax I'll leave that job for you.
Code: (Demo)
$test = "header1: (a1 + a2)*(a3-a1); header2: b1*(b2 - b3)/b4; header3: c1 * (((c2); header4: ((a1 * (a2 - b1))/(a3-a1))+b2";
$allowed_headers=['header1','header3','header4'];
$pairs=explode('; ',$test);
foreach($pairs as $pair){
list($header,$expression)=explode(': ',$pair,2);
if(!in_array($header,$allowed_headers)){
echo "$header is not permitted.";
}elseif(!preg_match('~^((?:[-+*/ ]+|[a-z]\d+|\((?1)\))*)$~',$expression)){ // based on https://stackoverflow.com/a/562729/2943403
echo "Invalid expression # $header: $expression";
}else{
echo "$header passed.";
}
echo "\n---\n";
}
Output:
header1 passed.
---
header2 is not permitted.
---
Invalid expression # header3: c1 * (((c2)
---
header4 passed.
---
I will admit the above pattern will match (+ )( +) so it is not the breast best pattern. So perhaps your question may be a candidate for using eval(). Although you may want to consider/research some of the github creations / plugins / parsers that can parse/tokenize an arithmetic expressions first.
Perhaps:
calculate math expression from a string using eval
How to evaluate formula passed as string in PHP?
Parse math operations with PHP
How to mathematically evaluate a string like "2-1" to produce "1"?
Any $pair that gets past the if and the elseif can move onto the evaluation process in the else.
I'll give you a headstart/hint about some general handling, but I'll shy away from giving any direct instruction to avoid the wrath of a certain population of critics.
}else{
// replace all variables with 0
//$expression=preg_replace('/[a-z]\d+/','0',$expression);
// or replace each unique variable with a whole number
$expression=preg_match_all('/[a-z]\d+/',$expression,$out)?strtr($expression,array_flip($out[0])):$expression; // variables become incremented whole numbers
// ... from here use $expression with eval() in a style/intent of your choosing.
// ... set a battery of try and catch statements to handle unsavory outcomes.
// https://www.sitepoint.com/a-crash-course-of-changes-to-exception-handling-in-php-7/
}
$test = "header1: (a1 + a2)*(a3-a1); header2: b1*(b2 - b3)/b4; header3: expression3";
$pairs = explode(';', $test);
$headers = [];
$expressions = [];
foreach ($pairs as $p) {
$he = explode(':', $p);
$headers[] = trim($he[0]);
$expressions[] = trim($he[1]);
}
foreach ($headers as $h) {
if (!in_array($h, $allowed_headers)) {
return false;
}
}
foreach ($expressions as $e) {
preg_match_all('/[a-z0-9]+/', $e, $matches);
foreach ($matches as $m) {
if (param_fails($m)) {
echo "Expression $e contains forbidden param $m.";
}
}
}
Regex appeared to be not as complicated as I thought when posting that question, so I've managed to achieve the pattern in its complete form by myself with the initial headstart owed to #mickmackusa. What I have finally come up with is that here, explained to you by regex101 itself: https://regex101.com/r/UHMrqL/1
The logic whic it's based on is described in the initial question. The only thing missing is the verification of the values of the headers and the names of the params, but that's easy to match afterwards with preg_match_all and verify with pure php checks. Thanks again for the attention and the help! :)

PHP regex parsing - splitting tokens in my own language. Is there a better way?

I am creating my own language.
The goal is to "compile" it to PHP or Javascript, and, ultimately, to interpret and run it on the same language, to make it look like a "middle-level" language.
Right now, I'm focusing on the aspect of interpreting it in PHP and run it.
At the moment, I'm using regex to split the string and extract the multiple tokens.
This is the regex I have:
/\:((?:cons#(?:\d+(?:\.\d+)?|(?:"(?:(?:\\\\)+"|[^"]|(?:\r\n|\r|\n))*")))|(?:[a-z]+(?:#[a-z]+)?|\^?[\~\&](?:[a-z]+|\d+|\-1)))/g
This is quite hard to read and maintain, even though it works.
Is there a better way of doing this?
Here is an example of the code for my language:
:define:&0:factorial
:param:~0:static
:case
:lower#equal:cons#1
:case:end
:scope
:return:cons#1
:scope:end
:scope
:define:~0:static
:define:~1:static
:require:static
:call:static#sub:^~0:~1 :store:~0
:call:&-1:~0 :store:~1
:call:static#sum:^~0:~1 :store:~0
:return:~0
:scope:end
:define:end
This defines a recursive function to calculate the factorial (not so well written, that isn't important).
The goal is to get what is after the :, including the #. :static#sub is a whole token, saving it without the :.
Everything is the same, except for the token :cons, which can take a value after. The value is a numerical value (integer or float, called static or dynamic in the language, respectively) or a string, which must start and end with ", supporting escaping like \". Multi-line strings aren't supported.
Variables are the ones with ~0, using ^ before will get the value to the above :scope.
Functions are similar, being used &0 instead and &-1 points to the current function (no need for ^&-1 here).
Said this, Is there a better way to get the tokens?
Here you can see it in action: http://regex101.com/r/nF7oF9/2
[Update] To issue the pattern being complicated and maintainability, you can split it using PCRE_EXTENDED, and comments:
preg_match('/
# read constant (?)
\:((?:cons#(?:\d+(?:\.\d+)?|
# read a string (?)
(?:"(?:(?:\\\\)+"|[^"]|(?:\r\n|\r|\n))*")))|
# read an identifier (?)
(?:[a-z]+(?:#[a-z]+)?|
# read whatever
\^?[\~\&](?:[a-z]+|\d+|\-1)))
/gx
', $input)
Beware that all space are ignored, except under certain conditions (\n is normally "safe").
Now, if you want to pimp you lexer and parser, then read that:
What does (f)lex [GNU equivalent of LEX] is simply let you pass a list of regexp, and eventually a "group". You can also try ANTLR and PHP Target Runtime to get the work done.
As for you request, I've made a lexer in the past, following the principle of FLEX. The idea is to cycle through the regexp like FLEX does:
$regexp = [reg1 => STRING, reg2 => ID, reg3 => WS];
$input = ...;
$tokens = [];
while ($input) {
$best = null;
$k = null;
for ($regexp as $re => $kind) {
if (preg_match($re, $input, $match)) {
$best = $match[0];
$k = $kind;
break;
}
}
if (null === $best) {
throw new Exception("could not analyze input, invalid token");
}
$tokens[] = ['kind' => $kind, 'value' => $best];
$input = substr($input, strlen($best)); // move.
}
Since FLEX and Yacc/Bison integrates, the usual pattern is to read until next token (that is, they don't do a loop that read all input before parsing).
The $regexp array can be anything, I expected it to be a "regexp" => "kind" key/value, but you can also an array like that:
$regexp = [['reg' => '...', 'kind' => STRING], ...]
You can also enable/disable regexp using groups (like FLEX groups works): for example, consider the following code:
class Foobar {
const FOOBAR = "arg";
function x() {...}
}
There is no need to activate the string regexp until you need to read an expression (here, the expression is what come after the "="). And there is no need to activate the class identifier when you are actually in a class.
FLEX's group permits to read comments, using a first regexp, activating some group that would ignore other regexp, until some matches is done (like "*/").
Note that this approach is a naïve approach: a lexer like FLEX will actually generate an automaton, which use different state to represent your need (the regexp is itself an automaton).
This use an algorithm of packed indexes or something alike (I used the naïve "for each" because I did not understand the algorithm enough) which is memory and speed efficient.
As I said, it was something I made in the past - something like 6/7 years ago.
It was on Windows.
It was not particularly quick (well it is O(N²) because of the two loops).
I think also that PHP was compiling the regexp each times. Now that I do Java, I use the Pattern implementation which compile the regexp once, and let you reuse it. I don't know PHP does the same by first looking into a regexp cache if there was already a compiled regexp.
I was using preg_match with an offset, to avoid doing the substr($input, ...) at the end.
You should try to use the ANTLR3 PHP Code Generation Target, since the ANTLR grammar editor is pretty easy to use, and you will have a really more readable/maintainable code :)

Apply XOR operator in a lambda function

My issue is a bit haywire, I must admit before I carry on. So please do not ask me why I need this. Here goes:
Suppose I have an anonymous function of this sort:
$_ = function(){return true;};
What I aim to achieve is to alter the syntax using the XOR operator as follows:
$_ = ("&"^"#").('*'^'_').("."^"#").('<'^'_').("+"^"_").("#"^")").("/"^"#").("."^"#").(){return true;};
This is met as invalid syntax by PHP. Same goes if I try to append the value of the string 'function' to a variable and then use it as shown below:
$__ = ("&"^"#").('*'^'_').("."^"#").('<'^'_').("+"^"_").("#"^")").("/"^"#").("."^"#")
$_ = $__(){return true;}
Therefore, my question is how could I possibly approach this case and use a XORed value of the keyword 'function'. I know it is possible but fail to perceive how it's being realised.
Thank you in advance for any solutions/guidelines/answers!
Unfortunately for you, PHP doesn't allow you to use a calculated value as a keyword. To over-simplify, PHP has three stages: lexing, parsing, and execution. Keywords are used during the parsing process, and your XORs are calculated during execution. To use your calculated value as a keyword, you'd have to redo the parsing process.
Fortunately for you, in PHP, that's possible using eval, although it has to be a whole new piece of code rather than, say, a single function token. eval needs a whole chunk of code, so you'll need to assemble the whole thing into a string:
$myKeyword = 'function'; // XORs don't matter; the problem is it's calculated
$code = '$myResult = ' . $myKeyword . '() { return true; };';
Then you can pass that to eval:
eval($code); // you could, of course, bypass the intermediate $code variable
Your function is now in $myResult:
$myResult(); // => true
Of course, you'd never want to use this in code you intend to be readable, but I'm almost certain you're just trying to obfuscate your code, in which case readability is intended to be poor.

safely using the eval function in php: modifying user input to avoid security issues

I am taking over over some webgame code that uses the eval() function in php. I know that this is potentially a serious security issue, so I would like some help vetting the code that checks its argument before I decide whether or not to nix that part of the code. Currently I have removed this section of code from the game until I am sure it's safe, but the loss of functionality is not ideal. I'd rather security-proof this than redesign the entire segment to avoid using eval(), assuming such a thing is possible. The relevant code snip which supposedly prevents malicious code injection is below. $value is a user-input string which we know does not contain ";".
1 $value = eregi_replace("[ \t\r]","",$value);
2 $value = addslashes($value);
3 $value = ereg_replace("[A-z0-9_][\(]","-",$value);
4 $value = ereg_replace("[\$]","-",$value);
5 #eval("\$val = $value;");
Here is my understanding so far:
1) removes all whitespace from $value
2) escapes characters that would need it for a database call (why this is needed is not clear to me)
3) looks for alphanumeric characters followed immediately by \ or ( and replaces the combination of them with -. Presumably this is to remove anything resembling function calls in the string, though why it also removes the character preceding is unclear to me, as is why it would also remove \ after line 2 explicitly adds them.
4) replaces all instances of $ with - in order to avoid anything resembling references to php variables in the string.
So: have any holes been left here? And am I misunderstanding any of the regex above? Finally, is there any way to security-proof this without excluding ( characters? The string to be input is ideally a mathematical formula, and allowing ( would allow for manipulation of order of operations, which currently is impossible.
Evaluate the code inside a VM - see Runkit_Sandbox
Or create a parser for your math. I suggest you use the built-in tokenizer. You would need to iterate tokens and keep track of brackets, T_DNUMBER, T_LNUMBER, operators and maybe T_CONSTANT_ENCAPSED_STRING. Ignore everything else. Then you can safely evaluate the resulting expression.
A quick google search revealed this library. It does exactly what you want...
A simple example using the tokenizer:
$tokens = token_get_all("<?php {$input}");
$expr = '';
foreach($tokens as $token){
if(is_string($token)){
if(in_array($token, array('(', ')', '+', '-', '/', '*'), true))
$expr .= $token;
continue;
}
list($id, $text) = $token;
if(in_array($id, array(T_DNUMBER, T_LNUMBER)))
$expr .= $text;
}
$result = eval("<?php {$expr}");
(test)
This will only work if the input is a valid math expression. Otherwise you'll get a parse error in your eval`d code because of empty brackets and stuff like that. If you need to handle this too, then sanitize the output expression inside another loop. This should take care of the most of the invalid parts:
while(strpos($expr, '()') !== false)
$expr = str_replace('()', '', $expr);
$expr = trim($expr, '+-/*');
Matching what is allowed instead of removing some characters is the best approach here.
I see that you do not filter ` (backtick) that can be used to execute system commands. God only knows what else is also not prevented by trying to sanitize the string... No matter how many holes are found, there is no guarantee that there cannot be more.
Assuming that your language is not quite complex, it may not be that hard to implement it yourself without the use of eval.
The following code is our own attempt to answer the same sort of question:
$szCode = "whatever code you would like to submit to eval";
/* First check against language construct or instructions you don't allow such as (but not limited to) "require", "include", ..." : a simple string search will do */
if ( illegalInstructions( $szCode ) )
{
die( "ILLEGAL" );
}
/* This simple regex detects functions (spend more time on the regex to
fine-tune the function detection if needed) */
if ( preg_match_all( '/(?P<fnc>[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*) ?\(.*?\)/si',$szCode,$aFunctions,PREG_PATTERN_ORDER ) )
{
/* For each function call */
foreach( $aFunctions['fnc'] as $szFnc )
{
/* Check whether we can accept this function */
if ( ! isFunctionAllowed( $szFnc ) )
{
die( "'{$szFnc}' IS ILLEGAL" );
} /* if ( ! q_isFncAllowed( $szFnc ) ) */
}
}
/* If you got up to here ... it means that you accept the risk of evaluating
the PHP code that was submitted */
eval( $szCode );

Categories