Regular Expression how to make regex take second /** as starting point - php

Here is the $source example
/**
* These functions can be replaced via plugins. If plugins do not redefine these
* functions, then these will be used instead.
*/
if ( !function_exists('wp_set_current_user') ) :
/**
* Changes the current user by ID or name.
*
*/
function wp_set_current_user($id, $name = '') {
Attention: some don't have the function_exists line.
For my special purpose, I'm trying to parse the docblock with regular expression.
Here is the regex
$t = preg_match_all("#(/\*\*.*?\*/\nfunction\s.*?\(.*?\))\s{#mis",$source,$m);
I expect to get:
/**
* Changes the current user by ID or name.
*
*/
function wp_set_current_user($id, $name = '') {
but instead, it returns me the whole code example.
Any help would be appreciated.
I find out some people ask me my purpose, I don't think this is important here though.
I'm using geany and I find out existing wordpress code hint isn't complete.
And the docblock parsers I found that don't parse function name and function arguments.
So I try to parse them on my own.
the code hint format of geany is
wp_set_current_user|Changes the current user by ID or name.|($id, $name = '')|
However, my point of this question is how to make regex take second "/**" as starting point?
I'm sorry for my poor English that confused you all.

You can parse comment out by regexp like this (check out Regex look around tutorial):
/\*\*/(?:(?:.(?!\*\*/))*)\*\*/
Then any number of white spaces can occur:
[\s]*
What keywords can function have in php? static, virtual, final, public, private, protected correct me if I'm forgetting something.
(?:(?:static|virtual|final|public|private|protected)\s+)*
Okay, now function header and braces:
function\s+(?P<name>\w\d_+)\s*\(...\)
The ... parts get's complicated because it can contain default value which can be complicated php string ($remove_characters = '\'"\n\r '), so parsing value (string, string, number, constant):
"[^"\\\\]*(?:\\\\.[^"\\\\]*)*"
\'[^\'\\\\]*(?:\\\\.[^\'\\\\]*)*'
[\d.]+
\w+
Resulting to one large value regexp:
("[^"\\\\]*(?:\\\\.[^"\\\\]*)*"|\'[^\'\\\\]*(?:\\\\.[^\'\\\\]*)*'|[\d.]+|\w+)
And every function argument has a format $var or $var = data (of course any number of spaces + I'm omitting array $input = array()) so this is simplified var name matching:
\\$[\w_][\w\d_]*
Type matching:
([\w_]+\s+)?
So function arguments can be:
\s*([\w_]+\s+)?(\\$[\w_][\w\d_]*|\\$[\w_][\w\d_]*\s*=\s*<value>)
And complete regexp for function would look like:
function\s+(?P<name>\w\d_+)\s*\(\s*|<argument>((,<argument>)*)\)
I won't be testing those regexp for you, it's your job to do so at this point, my goal was to show you what you need if you want to do this really correctly (but feel free to edit my answer if you find a mistake).You may also use really simplified version (like just one regexp for function arguments eating everything).

If you want the easy dirty trick, use a lookahead assertion
(?<=if\ (\ !function_exists('wp_set_current_user')\ )\ :)
Appending this to your search should do the trick. (You might have to escape the single quotes.)

Related

PHP regex parsing - splitting tokens in my own language. Is there a better way?

I am creating my own language.
The goal is to "compile" it to PHP or Javascript, and, ultimately, to interpret and run it on the same language, to make it look like a "middle-level" language.
Right now, I'm focusing on the aspect of interpreting it in PHP and run it.
At the moment, I'm using regex to split the string and extract the multiple tokens.
This is the regex I have:
/\:((?:cons#(?:\d+(?:\.\d+)?|(?:"(?:(?:\\\\)+"|[^"]|(?:\r\n|\r|\n))*")))|(?:[a-z]+(?:#[a-z]+)?|\^?[\~\&](?:[a-z]+|\d+|\-1)))/g
This is quite hard to read and maintain, even though it works.
Is there a better way of doing this?
Here is an example of the code for my language:
:define:&0:factorial
:param:~0:static
:case
:lower#equal:cons#1
:case:end
:scope
:return:cons#1
:scope:end
:scope
:define:~0:static
:define:~1:static
:require:static
:call:static#sub:^~0:~1 :store:~0
:call:&-1:~0 :store:~1
:call:static#sum:^~0:~1 :store:~0
:return:~0
:scope:end
:define:end
This defines a recursive function to calculate the factorial (not so well written, that isn't important).
The goal is to get what is after the :, including the #. :static#sub is a whole token, saving it without the :.
Everything is the same, except for the token :cons, which can take a value after. The value is a numerical value (integer or float, called static or dynamic in the language, respectively) or a string, which must start and end with ", supporting escaping like \". Multi-line strings aren't supported.
Variables are the ones with ~0, using ^ before will get the value to the above :scope.
Functions are similar, being used &0 instead and &-1 points to the current function (no need for ^&-1 here).
Said this, Is there a better way to get the tokens?
Here you can see it in action: http://regex101.com/r/nF7oF9/2
[Update] To issue the pattern being complicated and maintainability, you can split it using PCRE_EXTENDED, and comments:
preg_match('/
# read constant (?)
\:((?:cons#(?:\d+(?:\.\d+)?|
# read a string (?)
(?:"(?:(?:\\\\)+"|[^"]|(?:\r\n|\r|\n))*")))|
# read an identifier (?)
(?:[a-z]+(?:#[a-z]+)?|
# read whatever
\^?[\~\&](?:[a-z]+|\d+|\-1)))
/gx
', $input)
Beware that all space are ignored, except under certain conditions (\n is normally "safe").
Now, if you want to pimp you lexer and parser, then read that:
What does (f)lex [GNU equivalent of LEX] is simply let you pass a list of regexp, and eventually a "group". You can also try ANTLR and PHP Target Runtime to get the work done.
As for you request, I've made a lexer in the past, following the principle of FLEX. The idea is to cycle through the regexp like FLEX does:
$regexp = [reg1 => STRING, reg2 => ID, reg3 => WS];
$input = ...;
$tokens = [];
while ($input) {
$best = null;
$k = null;
for ($regexp as $re => $kind) {
if (preg_match($re, $input, $match)) {
$best = $match[0];
$k = $kind;
break;
}
}
if (null === $best) {
throw new Exception("could not analyze input, invalid token");
}
$tokens[] = ['kind' => $kind, 'value' => $best];
$input = substr($input, strlen($best)); // move.
}
Since FLEX and Yacc/Bison integrates, the usual pattern is to read until next token (that is, they don't do a loop that read all input before parsing).
The $regexp array can be anything, I expected it to be a "regexp" => "kind" key/value, but you can also an array like that:
$regexp = [['reg' => '...', 'kind' => STRING], ...]
You can also enable/disable regexp using groups (like FLEX groups works): for example, consider the following code:
class Foobar {
const FOOBAR = "arg";
function x() {...}
}
There is no need to activate the string regexp until you need to read an expression (here, the expression is what come after the "="). And there is no need to activate the class identifier when you are actually in a class.
FLEX's group permits to read comments, using a first regexp, activating some group that would ignore other regexp, until some matches is done (like "*/").
Note that this approach is a naïve approach: a lexer like FLEX will actually generate an automaton, which use different state to represent your need (the regexp is itself an automaton).
This use an algorithm of packed indexes or something alike (I used the naïve "for each" because I did not understand the algorithm enough) which is memory and speed efficient.
As I said, it was something I made in the past - something like 6/7 years ago.
It was on Windows.
It was not particularly quick (well it is O(N²) because of the two loops).
I think also that PHP was compiling the regexp each times. Now that I do Java, I use the Pattern implementation which compile the regexp once, and let you reuse it. I don't know PHP does the same by first looking into a regexp cache if there was already a compiled regexp.
I was using preg_match with an offset, to avoid doing the substr($input, ...) at the end.
You should try to use the ANTLR3 PHP Code Generation Target, since the ANTLR grammar editor is pretty easy to use, and you will have a really more readable/maintainable code :)

safely using the eval function in php: modifying user input to avoid security issues

I am taking over over some webgame code that uses the eval() function in php. I know that this is potentially a serious security issue, so I would like some help vetting the code that checks its argument before I decide whether or not to nix that part of the code. Currently I have removed this section of code from the game until I am sure it's safe, but the loss of functionality is not ideal. I'd rather security-proof this than redesign the entire segment to avoid using eval(), assuming such a thing is possible. The relevant code snip which supposedly prevents malicious code injection is below. $value is a user-input string which we know does not contain ";".
1 $value = eregi_replace("[ \t\r]","",$value);
2 $value = addslashes($value);
3 $value = ereg_replace("[A-z0-9_][\(]","-",$value);
4 $value = ereg_replace("[\$]","-",$value);
5 #eval("\$val = $value;");
Here is my understanding so far:
1) removes all whitespace from $value
2) escapes characters that would need it for a database call (why this is needed is not clear to me)
3) looks for alphanumeric characters followed immediately by \ or ( and replaces the combination of them with -. Presumably this is to remove anything resembling function calls in the string, though why it also removes the character preceding is unclear to me, as is why it would also remove \ after line 2 explicitly adds them.
4) replaces all instances of $ with - in order to avoid anything resembling references to php variables in the string.
So: have any holes been left here? And am I misunderstanding any of the regex above? Finally, is there any way to security-proof this without excluding ( characters? The string to be input is ideally a mathematical formula, and allowing ( would allow for manipulation of order of operations, which currently is impossible.
Evaluate the code inside a VM - see Runkit_Sandbox
Or create a parser for your math. I suggest you use the built-in tokenizer. You would need to iterate tokens and keep track of brackets, T_DNUMBER, T_LNUMBER, operators and maybe T_CONSTANT_ENCAPSED_STRING. Ignore everything else. Then you can safely evaluate the resulting expression.
A quick google search revealed this library. It does exactly what you want...
A simple example using the tokenizer:
$tokens = token_get_all("<?php {$input}");
$expr = '';
foreach($tokens as $token){
if(is_string($token)){
if(in_array($token, array('(', ')', '+', '-', '/', '*'), true))
$expr .= $token;
continue;
}
list($id, $text) = $token;
if(in_array($id, array(T_DNUMBER, T_LNUMBER)))
$expr .= $text;
}
$result = eval("<?php {$expr}");
(test)
This will only work if the input is a valid math expression. Otherwise you'll get a parse error in your eval`d code because of empty brackets and stuff like that. If you need to handle this too, then sanitize the output expression inside another loop. This should take care of the most of the invalid parts:
while(strpos($expr, '()') !== false)
$expr = str_replace('()', '', $expr);
$expr = trim($expr, '+-/*');
Matching what is allowed instead of removing some characters is the best approach here.
I see that you do not filter ` (backtick) that can be used to execute system commands. God only knows what else is also not prevented by trying to sanitize the string... No matter how many holes are found, there is no guarantee that there cannot be more.
Assuming that your language is not quite complex, it may not be that hard to implement it yourself without the use of eval.
The following code is our own attempt to answer the same sort of question:
$szCode = "whatever code you would like to submit to eval";
/* First check against language construct or instructions you don't allow such as (but not limited to) "require", "include", ..." : a simple string search will do */
if ( illegalInstructions( $szCode ) )
{
die( "ILLEGAL" );
}
/* This simple regex detects functions (spend more time on the regex to
fine-tune the function detection if needed) */
if ( preg_match_all( '/(?P<fnc>[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*) ?\(.*?\)/si',$szCode,$aFunctions,PREG_PATTERN_ORDER ) )
{
/* For each function call */
foreach( $aFunctions['fnc'] as $szFnc )
{
/* Check whether we can accept this function */
if ( ! isFunctionAllowed( $szFnc ) )
{
die( "'{$szFnc}' IS ILLEGAL" );
} /* if ( ! q_isFncAllowed( $szFnc ) ) */
}
}
/* If you got up to here ... it means that you accept the risk of evaluating
the PHP code that was submitted */
eval( $szCode );

how to warn user if it is used a strange character in eval operation

$op=$_POST['param'];
$statement=eval("return ($op);");
in 'param', if there is a strange character such letter 'jfsdf' or something else the eval function does not work. How can I solve this problem?
eval function works only well defined entries such as '54+4*3' on the other hand if it is used entries such '6p+87+4' it gives an parse error. Is there any way to warn user in order to enter a well defined statement
First of all, depending on what you mean by "strange" character, you should implement function that verifies the string coming from outside.
$allowed_chars = array('function', 'this' ); //..add desirable ones
function is_alloved_char($char){
return in_array($char, $allowed_chars);
}
RegEx is great for this when You Do except some performance down.
In your situation str_pos() is great enough to match undesirable chars.
so, your another function should be similar to:
/**
*
* #param string
* #return TRUE on success, FALSE otherwise
*/
function is_really_safe($char){
global $allowed_chars; //just make this array visible here
foreach ($allowed_chars as $allowed_char){
if ( str_pos($allowed_char, $char) ){
return false;
}
}
return true;
}
Also, be VERY VERY careful with eval()
If you use this function in product, then you are doing something wrong.
If it's about evaluation a constrained math expression, then the lazy option would be to use a pattern assertion prior evaluating. This one's also easier as it returns you the expression result implicitly:
$result = preg_replace('#^\s*\d+([-+*/%]\s*\d+\s*)*$#e', '$0', $input);
Would need something more complex if you wanted to assert (..) parens are correctly structured as well.
If it doesn't match, it will return the original input string. Check that for generating a warning.

What does #param mean in a class?

What does the #param mean when creating a class? As far as I understand it is used to tell the script what kind of datatype the variables are and what kind of value a function returns, is that right?
For example:
/**
* #param string $some
* #param array $some2
* #return void
*/
Isn´t there another way to do that, I am thinking of things like: void function() { ... } or something like that. And for variables maybe (int)$test;
#param doesn't have special meaning in PHP, it's typically used within a comment to write up documentation. The example you've provided shows just that.
If you use a documentation tool, it will scour the code for #param, #summary, and other similar values (depending on the sophistication of the tool) to automatically generate well formatted documentation for the code.
As others have mentioned the #param you are referring to is part of a comment and not syntactically significant. It is simply there to provide hints to a documentation generator. You can find more information on the supported directives and syntax by looking at the PHPDoc project.
Speaking to the second part of your question... As far as I know, there is not a way to specify the return type in PHP. You also can't force the parameters to be of a specific primitive type (string, integer, etc).
You can, however, required that the parameters be either an array, an object, or a specific type of object as of PHP 5.1 or so.
function importArray(array $myArray)
{
}
PHP will throw an error if you try and call this method with anything besides an array.
class MyClass
{
}
function doStuff(MyClass $o)
{
}
If you attempt to call doStuff with an object of any type except MyClass, PHP will again throw an error.
I know this is old but just for reference, the answer to the second part of the question is now, PHP7.
// this function accepts and returns an int
function age(int $age): int{
return 18;
}
PHP is entirely oblivious to comments. It is used to describe the parameters the method takes only for making the code more readable and understandable.
Additionally, good code practice dedicates to use certain names (such as #param), for documentation generators.
Some IDEs will include the #param and other information in the tooltip when using a hovering over the relevant method.
#param is a part of the description comment that tells you what the input parameter type is. It has nothing to do with code syntax. Use a color supported editor like Notepad++ to easily see whats code and whats comments.

PHP reflection; extracting non-block comments

I've recently become familiar with Reflection, and have been experimenting with it, especially getDocComment(), however it appears that it only supports /** */ comment blocks.
/** foobar */
class MyClass{}
$refl = new ReflectionClass('MyClass');
// produces /** foobar */
echo $refl->getDocComment();
-Versus-
# foobar
class MyClass{}
$refl = new ReflectionClass('MyClass');
// produces nothing
echo $refl->getDocComment();
Is it not possible to capture this without resorting to any sort of file_get_contents(__FILE__) nonsense?
As per dader51's answer, I suppose my best approach would be something along these lines:
// random comment
#[annotation]
/**
* another comment with a # hash
*/
#[another annotation]
$annotations
= array_filter(token_get_all(file_get_contents(__FILE__)), function(&$token){
return (($token[0] == T_COMMENT) && ($token = strstr($token[1], '#')));
});
print_r($annotations);
Outputs:
Array
(
[4] => #[annotation]
[8] => #[another annotation]
)
DocComments distinguish themselves by saying something about how your classes are to be used, compared to regular comments that could assist a developer in reading the code. That's also why the method isn't called getComment() instead.
Of course it's all text parsing, and someone just made a choice in docComments always being these multiline comments, but that choice has apparently been made, and reading regular comments is not something in the reflection category.
I was trying to do just a you a few days ago, and here is my trick. You can just use the php internal tokenizer ( http://www.php.net/manual/en/function.token-get-all.php ) , and then walk the array returned to select only the comments, here is a sample code :
$a = token_get_all(file_get_contents('path/to/your/file.php'));
print_r($a); // display an array of all tokens found in the file file.php
Here is a list of all tokens php recognize : http://www.php.net/manual/en/tokens.php
And the comment you will get by this method include ( from php.net site ) :
T_COMMENT : // or #, and /* */ in PHP 5
Hope it helps !
AFAIK, for a comment to become documentation it is needed to start with /** not even with standard multi-line comment.
A doc comment as the name implies, is a documentation comment, not a standard comment, otherwise when you are grabbing comments for apps such as doxygen it will try to document any commented code from testing/debuggung, etc, which often gets left behind and is not important to the user of the API.
As you can read here in the first User Contributed Note:
The doc comment (T_DOC_COMMENT) must begin with a /** - that's two asterisks, not one. The comment continues until the first */. A normal multi-line comment /*...*/ (T_COMMENT) does not count as a doc comment.
So only /** */ blocks are given by this method.
I don't know any other method with php to get the other comments as using file_get_contents and filter the comments with e.g. a regex

Categories