I want to extract the whole function given its name and starting line-number.
Output should be something like
function function_name( $a = null, $b = true ) {
$i_am_test = 'foo';
return $i_am_test;
}
or whatever the function definition is. Most tools (including grep etc.) only return the first line function function_name( $a = null, $b = true ) { but I need the entire function definition.
To accurately extract a function (variable/class/....) from a computer program source file, especially for PHP, you need a real parser for that languages.
Otherwise you'll have some kind of hack that fails for an amazing variety of crazy reasons, some of which have to do with strings and comments confusing the extraction machinery (and trying skip string literals is PHP is nightmare), and some have to do with funny language rules you don't discover until you trip over it (what happens if your PHP file contains HTML that contains stuff that looks like PHP source code?).
Our DMS Software Reengineering Toolkit has a full PHP5 front end that understand PHP syntax completely. It can parse PHP source files, and then be configured to analyze/extract whatever code you want. The parser accurately captures line numbers on its internal ASTs, so it is quite easy to find the code in file at a particular line number; given the code/AST, it is quite easy to print the AST the represents the code at that line number. If you find a function identifier on a particular line and print out the relevant AST, you'll exactly the function source code.
Related
I am using the MediaWiki coding standard for php_codesniffer. Thing is, that is created for PHP version <7.0.0. Let's take the following not-formatted code snippet:
function test(){}
The sniffer will report an error, that is needs space between ) and { (the rule is Generic.Functions.OpeningFunctionBraceKernighanRitchie.SpaceAfterBracket)
Now that's ok, it is normal (for me at least) to write
function test() {}
But when it comes to PHP 7 and the function has a return type hint, I want it formatted like this
function test(): string {}
So no space between ) and :, but spaces between string and other tokens there. What is the rule I have to write to achieve this?
There is already sniff for that in Slevomat/CodingStandard: https://github.com/slevomat/coding-standard/blob/master/README.md#slevomatcodingstandardtypehintstypehintdeclaration-
It is high quality package I use for over a year. They also have a branch for 3.0 ready. Also check also sniffs, they're awesome and helps with refactoring to PHP 7.0 and 7.1.
It was noted in another question that wrapping the result of a PHP function call in parentheses can somehow convert the result into a fully-fledged expression, such that the following works:
<?php
error_reporting(E_ALL | E_STRICT);
function get_array() {
return array();
}
function foo() {
// return reset(get_array());
// ^ error: "Only variables should be passed by reference"
return reset((get_array()));
// ^ OK
}
foo();
I'm trying to find anything in the documentation to explicitly and unambiguously explain what is happening here. Unlike in C++, I don't know enough about the PHP grammar and its treatment of statements/expressions to derive it myself.
Is there anything hidden in the documentation regarding this behaviour? If not, can somebody else explain it without resorting to supposition?
Update
I first found this EBNF purporting to represent the PHP grammar, and tried to decode my scripts myself, but eventually gave up.
Then, using phc to generate a .dot file of the two foo() variants, I produced AST images for both scripts using the following commands:
$ yum install phc graphviz
$ phc --dump-ast-dot test1.php > test1.dot
$ dot -Tpng test1.dot > test1.png
$ phc --dump-ast-dot test2.php > test2.dot
$ dot -Tpng test2.dot > test2.png
In both cases the result was exactly the same:
This behavior could be classified as bug, so you should definitely not rely on it.
The (simplified) conditions for the message not to be thrown on a function call are as follows (see the definition of the opcode ZEND_SEND_VAR_NO_REF):
the argument is not a function call (or if it is, it returns by reference), and
the argument is either a reference or it has reference count 1 (if it has reference count 1, it's turned into a reference).
Let's analyze these in more detail.
First point is true (not a function call)
Due to the additional parentheses, PHP no longer detects that the argument is a function call.
When parsing a non empty function argument list there are three possibilities for PHP:
An expr_without_variable
A variable
(A & followed by a variable, for the removed call-time pass by reference feature)
When writing just get_array() PHP sees this as a variable.
(get_array()) on the other hand does not qualify as a variable. It is an expr_without_variable.
This ultimately affects the way the code compiles, namely the extended value of the opcode SEND_VAR_NO_REF will no longer include the flag ZEND_ARG_SEND_FUNCTION, which is the way the function call is detected in the opcode implementation.
Second point is true (the reference count is 1)
At several points, the Zend Engine allows non-references with reference count 1 where references are expected. These details should not be exposed to the user, but unfortunately they are here.
In your example you're returning an array that's not referenced from anywhere else. If it were, you would still get the message, i.e. this second point would not be true.
So the following very similar example does not work:
<?php
$a = array();
function get_array() {
return $GLOBALS['a'];
}
return reset((get_array()));
A) To understand what's happening here, one needs to understand PHP's handling of values/variables and references (PDF, 1.2MB). As stated throughout the documentation: "references are not pointers"; and you can only return variables by reference from a function - nothing else.
In my opinion, that means, any function in PHP will return a reference. But some functions (built in PHP) require values/variables as arguments. Now, if you are nesting function-calls, the inner one returns a reference, while the outer one expects a value. This leads to the 'famous' E_STRICT-error "Only variables should be passed by reference".
$fileName = 'example.txt';
$fileExtension = array_pop(explode('.', $fileName));
// will result in Error 2048: Only variables should be passed by reference in…
B) I found a line in the PHP-syntax description linked in the question.
expr_without_variable = "(" expr ")"
In combination with this sentence from the documentation: "In PHP, almost anything you write is an expression. The simplest yet most accurate way to define an expression is 'anything that has a value'.", this leads me to the conclusion that even (5) is an expression in PHP, which evaluates to an integer with the value 5.
(As $a = 5 is not only an assignment but also an expression, which evalutes to 5.)
Conclusion
If you pass a reference to the expression (...), this expression will return a value, which then may be passed as argument to the outer function. If that (my line of thought) is true, the following two lines should work equivalently:
// what I've used over years: (spaces only added for readability)
$fileExtension = array_pop( ( explode('.', $fileName) ) );
// vs
$fileExtension = array_pop( $tmp = explode('.', $fileName) );
See also PHP 5.0.5: Fatal error: Only variables can be passed by reference; 13.09.2005
Is there any way to easily fix this issue or do I really need to rewrite all the legacy code?
PHP Fatal error: Call-time pass-by-reference has been removed in ... on line 30
This happens everywhere as variables are passed into functions as references throughout the code.
You should be denoting the call by reference in the function definition, not the actual call. Since PHP started showing the deprecation errors in version 5.3, I would say it would be a good idea to rewrite the code.
From the documentation:
There is no reference sign on a function call - only on function definitions. Function definitions alone are enough to correctly pass the argument by reference. As of PHP 5.3.0, you will get a warning saying that "call-time pass-by-reference" is deprecated when you use & in foo(&$a);.
For example, instead of using:
// Wrong way!
myFunc(&$arg); # Deprecated pass-by-reference argument
function myFunc($arg) { }
Use:
// Right way!
myFunc($var); # pass-by-value argument
function myFunc(&$arg) { }
For anyone who, like me, reads this because they need to update a giant legacy project to 5.6: as the answers here point out, there is no quick fix: you really do need to find each occurrence of the problem manually, and fix it.
The most convenient way I found to find all problematic lines in a project (short of using a full-blown static code analyzer, which is very accurate but I don't know any that take you to the correct position in the editor right away) was using Visual Studio Code, which has a nice PHP linter built in, and its search feature which allows searching by Regex. (Of course, you can use any IDE/Code editor for this that does PHP linting and Regex searches.)
Using this regex:
^(?!.*function).*(\&\$)
it is possible to search project-wide for the occurrence of &$ only in lines that are not a function definition.
This still turns up a lot of false positives, but it does make the job easier.
VSCode's search results browser makes walking through and finding the offending lines super easy: you just click through each result, and look out for those that the linter underlines red. Those you need to fix.
PHP and references are somewhat unintuitive. If used appropriately references in the right places can provide large performance improvements or avoid very ugly workarounds and unusual code.
The following will produce an error:
function f(&$v){$v = true;}
f(&$v);
function f($v){$v = true;}
f(&$v);
None of these have to fail as they could follow the rules below but have no doubt been removed or disabled to prevent a lot of legacy confusion.
If they did work, both involve a redundant conversion to reference and the second also involves a redundant conversion back to a scoped contained variable.
The second one used to be possible allowing a reference to be passed to code that wasn't intended to work with references. This is extremely ugly for maintainability.
This will do nothing:
function f($v){$v = true;}
$r = &$v;
f($r);
More specifically, it turns the reference back into a normal variable as you have not asked for a reference.
This will work:
function f(&$v){$v = true;}
f($v);
This sees that you are passing a non-reference but want a reference so turns it into a reference.
What this means is that you can't pass a reference to a function where a reference is not explicitly asked for making it one of the few areas where PHP is strict on passing types or in this case more of a meta type.
If you need more dynamic behaviour this will work:
function f(&$v){$v = true;}
$v = array(false,false,false);
$r = &$v[1];
f($r);
Here it sees that you want a reference and already have a reference so leaves it alone. It may also chain the reference but I doubt this.
I am creating my own language.
The goal is to "compile" it to PHP or Javascript, and, ultimately, to interpret and run it on the same language, to make it look like a "middle-level" language.
Right now, I'm focusing on the aspect of interpreting it in PHP and run it.
At the moment, I'm using regex to split the string and extract the multiple tokens.
This is the regex I have:
/\:((?:cons#(?:\d+(?:\.\d+)?|(?:"(?:(?:\\\\)+"|[^"]|(?:\r\n|\r|\n))*")))|(?:[a-z]+(?:#[a-z]+)?|\^?[\~\&](?:[a-z]+|\d+|\-1)))/g
This is quite hard to read and maintain, even though it works.
Is there a better way of doing this?
Here is an example of the code for my language:
:define:&0:factorial
:param:~0:static
:case
:lower#equal:cons#1
:case:end
:scope
:return:cons#1
:scope:end
:scope
:define:~0:static
:define:~1:static
:require:static
:call:static#sub:^~0:~1 :store:~0
:call:&-1:~0 :store:~1
:call:static#sum:^~0:~1 :store:~0
:return:~0
:scope:end
:define:end
This defines a recursive function to calculate the factorial (not so well written, that isn't important).
The goal is to get what is after the :, including the #. :static#sub is a whole token, saving it without the :.
Everything is the same, except for the token :cons, which can take a value after. The value is a numerical value (integer or float, called static or dynamic in the language, respectively) or a string, which must start and end with ", supporting escaping like \". Multi-line strings aren't supported.
Variables are the ones with ~0, using ^ before will get the value to the above :scope.
Functions are similar, being used &0 instead and &-1 points to the current function (no need for ^&-1 here).
Said this, Is there a better way to get the tokens?
Here you can see it in action: http://regex101.com/r/nF7oF9/2
[Update] To issue the pattern being complicated and maintainability, you can split it using PCRE_EXTENDED, and comments:
preg_match('/
# read constant (?)
\:((?:cons#(?:\d+(?:\.\d+)?|
# read a string (?)
(?:"(?:(?:\\\\)+"|[^"]|(?:\r\n|\r|\n))*")))|
# read an identifier (?)
(?:[a-z]+(?:#[a-z]+)?|
# read whatever
\^?[\~\&](?:[a-z]+|\d+|\-1)))
/gx
', $input)
Beware that all space are ignored, except under certain conditions (\n is normally "safe").
Now, if you want to pimp you lexer and parser, then read that:
What does (f)lex [GNU equivalent of LEX] is simply let you pass a list of regexp, and eventually a "group". You can also try ANTLR and PHP Target Runtime to get the work done.
As for you request, I've made a lexer in the past, following the principle of FLEX. The idea is to cycle through the regexp like FLEX does:
$regexp = [reg1 => STRING, reg2 => ID, reg3 => WS];
$input = ...;
$tokens = [];
while ($input) {
$best = null;
$k = null;
for ($regexp as $re => $kind) {
if (preg_match($re, $input, $match)) {
$best = $match[0];
$k = $kind;
break;
}
}
if (null === $best) {
throw new Exception("could not analyze input, invalid token");
}
$tokens[] = ['kind' => $kind, 'value' => $best];
$input = substr($input, strlen($best)); // move.
}
Since FLEX and Yacc/Bison integrates, the usual pattern is to read until next token (that is, they don't do a loop that read all input before parsing).
The $regexp array can be anything, I expected it to be a "regexp" => "kind" key/value, but you can also an array like that:
$regexp = [['reg' => '...', 'kind' => STRING], ...]
You can also enable/disable regexp using groups (like FLEX groups works): for example, consider the following code:
class Foobar {
const FOOBAR = "arg";
function x() {...}
}
There is no need to activate the string regexp until you need to read an expression (here, the expression is what come after the "="). And there is no need to activate the class identifier when you are actually in a class.
FLEX's group permits to read comments, using a first regexp, activating some group that would ignore other regexp, until some matches is done (like "*/").
Note that this approach is a naïve approach: a lexer like FLEX will actually generate an automaton, which use different state to represent your need (the regexp is itself an automaton).
This use an algorithm of packed indexes or something alike (I used the naïve "for each" because I did not understand the algorithm enough) which is memory and speed efficient.
As I said, it was something I made in the past - something like 6/7 years ago.
It was on Windows.
It was not particularly quick (well it is O(N²) because of the two loops).
I think also that PHP was compiling the regexp each times. Now that I do Java, I use the Pattern implementation which compile the regexp once, and let you reuse it. I don't know PHP does the same by first looking into a regexp cache if there was already a compiled regexp.
I was using preg_match with an offset, to avoid doing the substr($input, ...) at the end.
You should try to use the ANTLR3 PHP Code Generation Target, since the ANTLR grammar editor is pretty easy to use, and you will have a really more readable/maintainable code :)
I am thinking about how one would go about creating a PHP equivalent for a couple of libraries I found for CSS and JS.
One is Less CSS which is a dynamic stylesheet language. The basic idea behind Less CSS is that it allows you to create more dynamic CSS rules containing entities that "regular" CSS does not support such as mixins, functions etc and then the final Less CSS compiles those syntax into regular CSS.
Another interesting JS library which behaves in a (kind of) similar pattern is CoffeeScript where you can write "tidier & simpler" code which then gets compiled into regular Javascript.
How would one go about creating a simple similar interface for PHP? Just as a proof of concept; I am only trying to learn stuff. Lets just take a simple use case of extending classes.
class a
{
function a_test()
{
echo "This is test in a ";
}
}
class b extends a
{
function b_test()
{
parent::a_test();
echo "This is test in b";
}
}
$b = new b();
$b->b_test();
Suppose I want to let the user write class b as (just for the example):
class b[a] //would mean b extends a
{
function b_test()
{
[a_test] //would mean parent::a_test()
echo "This is test in b";
}
}
And let them later have that code "resolve" to regular PHP (Usually by running a separate command/process I would believe). My question is how would I go about creating something like this. Can it be done in PHP, would I require to use something like C/C++. How should I approach this problem if I were to go at it? Are there any resources online? Any pointers are deeply appreciated!
Language transcoders are not as easy as one might think.
The example you gave can be implemented very easily with a preg_replace that looks for class definitions and replaces [a] with extends a.
But more complex features need a transcoder which is a suite of smaller logical pieces of code.
In most programmer jargon people incorrectly call transcoders compilers but the difference between compilers and transcoders is that compilers read source code and output raw binary machine code while transcoders read source code and output (a different) source code.
The PHP (or JavaScript) runtime for example is neither compiler nor transcoder, it's an interpreter.
But enough about jargon let's talk about transcoders:
To build a transcoder you must first build a tokenizer, it breaks apart the source code into tokens, meaning that if it sees an entire word such as 'class' or the name of a class or 'function' or the name of a function, it captures that word and considers it a token. When it encounters another token such as an opening round bracket or an opening brace or a square bracket etc. it considers that another token.
Luckily all of the recognized tokens available in PHP are already easily scanned by token_get_all which is a function PHP is bundled with. You may have some trouble because PHP assumes some things about how you use symbols but all in all you can make use of this function.
The tokenizer creates a flat list of all the tokens it finds and gives it to the parser.
The parser is the second phase of your transcoder, it reads the list of tokens and decides stuff like "if token[0] is a class and token[1] is a name_value then we have a class" etc.. after running through the entire list of tokens we should have an abstract syntax tree.
The abstract syntax tree is a structure that symbolically retains only the relevant information about a the source code.
$ast = array(
'my_derived_class' => array(
'implements' => array(
'my_interface_1',
'my_interface_2',
'my_interface_3'),
'extends' => 'my_base_class',
'members' => array(
'my_property_name' => 'my_default_value',
'my_method_name' => array( /* ... */ )
)
)
);
After you get an abstract syntax tree you need to walk through it and output the destination source code.
The real tricky part is the parser which (depending on the complexity of the language you are parsing) may need a backtracking algorithm or some other form of pattern matching to differentiate similar cases against one another.
I recommend reading about this in Terence Parr' book http://pragprog.com/book/tpdsl/language-implementation-patterns which describes in detail the design patterns needed to write a transcoder.
In Terrence' book you'll find out why some languages such as HTML or CSS are much simpler (structurally) than PHP or JavaScript and how that relates the complexity of the language parser.