Right now, I have a script which uses PHP's tokenizer to look for certain functions within a PHP source code file. The pattern I am currently looking for is:
T_STRING + T_WHITESPACE (optional) + "("
This seems to match all of my test cases so far except variable functions, which I am ignoring for the purposes of this question.
The obvious problem here is that this pattern produces a lot of false positives, like matching function definitions:
public function foo() { // foo() should not be matched
My question is, is there a more reliable/accurate method for looking at source code and plucking out all the function invocations? Maybe a better method than using the tokenizer at all?
Edit:
In particular, I'm looking to emulate the functionality of the disable_functions PHP directive within a class file. So, if exec() should be disallowed, I'm trying to find any uses of that function within the analyzed file. I do realize that variable functions make this terribly difficult, so I am detecting these and disallowing them as well.
You first run the tokenizer (available in PHP). Then you run a parser on top of the tokens. The parser needs to read the tokens and should be able to tell your what a specific token has been used for. It depends on the reliability of your parser how reliable the outcome is.
If your current parser (you have not shown any code) is not reliable enough, you need to write a better parser. That simple it is. Probably you're not doing much more than just tokenizing and then reading as it passes which just might not be enough.
Instead of using the tokenizer, consider instead using a higher-level parser to analyze your code. For example, PHP-Parser can explicitly identify function declarations, as well as variable function calls.
Related
Is there a way to detect or find files containing class and code outside it that would be executed while autoloaded or required? Like examples or so:
<?php
class SomeClass
{
private $variable;
public function __construct($variable) {
$this->variable = $variable;
}
}
// example usage
$foo = new SomeClass('test');
var_dump($foo);
// some other code
I need to find them in one older project and there are too many files to check them manually.
Thanks.
EDIT: I cannot autoload all of them (as I dont know in what files the unwanted code might be) and even that wouldnt necessary show me any executed code unless it would have some error in it.
Maybe you search for the word class and pipe the result on the word new ?
grep -r --include "*.php" 'class' | grep 'new'
The proper (and least error-prone but hard) way to do this would be using PHP's Tokenizer. You could use glob to get all the PHP files. Then go over them one by one using the tokenizer marking them as a class file when they contain the T_CLASS (or T_NAMESPACE) token, then skipping over the class itself and looking for anything but those tokens (and namespaces, comments and such) outputting you the filename when you find it.
The not so proper method (and surely one prone to false positives) would be writing your own simple parser (which is probably harder than using the Tokenizer) or using some simple regular expression with awk combination to find out stuff that is around classes and namespaces. I'm not too experienced with that though (and I'm pretty sure regexp alone isn't enough) so I can't really provide an example.
I am thinking about how one would go about creating a PHP equivalent for a couple of libraries I found for CSS and JS.
One is Less CSS which is a dynamic stylesheet language. The basic idea behind Less CSS is that it allows you to create more dynamic CSS rules containing entities that "regular" CSS does not support such as mixins, functions etc and then the final Less CSS compiles those syntax into regular CSS.
Another interesting JS library which behaves in a (kind of) similar pattern is CoffeeScript where you can write "tidier & simpler" code which then gets compiled into regular Javascript.
How would one go about creating a simple similar interface for PHP? Just as a proof of concept; I am only trying to learn stuff. Lets just take a simple use case of extending classes.
class a
{
function a_test()
{
echo "This is test in a ";
}
}
class b extends a
{
function b_test()
{
parent::a_test();
echo "This is test in b";
}
}
$b = new b();
$b->b_test();
Suppose I want to let the user write class b as (just for the example):
class b[a] //would mean b extends a
{
function b_test()
{
[a_test] //would mean parent::a_test()
echo "This is test in b";
}
}
And let them later have that code "resolve" to regular PHP (Usually by running a separate command/process I would believe). My question is how would I go about creating something like this. Can it be done in PHP, would I require to use something like C/C++. How should I approach this problem if I were to go at it? Are there any resources online? Any pointers are deeply appreciated!
Language transcoders are not as easy as one might think.
The example you gave can be implemented very easily with a preg_replace that looks for class definitions and replaces [a] with extends a.
But more complex features need a transcoder which is a suite of smaller logical pieces of code.
In most programmer jargon people incorrectly call transcoders compilers but the difference between compilers and transcoders is that compilers read source code and output raw binary machine code while transcoders read source code and output (a different) source code.
The PHP (or JavaScript) runtime for example is neither compiler nor transcoder, it's an interpreter.
But enough about jargon let's talk about transcoders:
To build a transcoder you must first build a tokenizer, it breaks apart the source code into tokens, meaning that if it sees an entire word such as 'class' or the name of a class or 'function' or the name of a function, it captures that word and considers it a token. When it encounters another token such as an opening round bracket or an opening brace or a square bracket etc. it considers that another token.
Luckily all of the recognized tokens available in PHP are already easily scanned by token_get_all which is a function PHP is bundled with. You may have some trouble because PHP assumes some things about how you use symbols but all in all you can make use of this function.
The tokenizer creates a flat list of all the tokens it finds and gives it to the parser.
The parser is the second phase of your transcoder, it reads the list of tokens and decides stuff like "if token[0] is a class and token[1] is a name_value then we have a class" etc.. after running through the entire list of tokens we should have an abstract syntax tree.
The abstract syntax tree is a structure that symbolically retains only the relevant information about a the source code.
$ast = array(
'my_derived_class' => array(
'implements' => array(
'my_interface_1',
'my_interface_2',
'my_interface_3'),
'extends' => 'my_base_class',
'members' => array(
'my_property_name' => 'my_default_value',
'my_method_name' => array( /* ... */ )
)
)
);
After you get an abstract syntax tree you need to walk through it and output the destination source code.
The real tricky part is the parser which (depending on the complexity of the language you are parsing) may need a backtracking algorithm or some other form of pattern matching to differentiate similar cases against one another.
I recommend reading about this in Terence Parr' book http://pragprog.com/book/tpdsl/language-implementation-patterns which describes in detail the design patterns needed to write a transcoder.
In Terrence' book you'll find out why some languages such as HTML or CSS are much simpler (structurally) than PHP or JavaScript and how that relates the complexity of the language parser.
I was reading up on Paul Bigger's http://blog.paulbiggar.com/archive/a-rant-about-php-compilers-in-general-and-hiphop-in-particular/ and he mentions that HPHP doesn't fully support dynamic constructs. He then states, "Still, a naive approach is to just stick a switch statement in, and compile everything that makes sense." Is he saying that instead of a dynamic include, you could use switch statements to include the proper file? If so, why would this work and why is it "easier" for a compiler to compile? As always, thx for your time!
from my understanding, if you've got this
include "$foo.php";
the compiler would have no clue what you're going to include. On the other side, with this
switch($foo) {
case 'bar' : include "bar.php";
case 'quux' : include "quux.php";
}
they can simply compile "bar" and "quux" and wrap them in an if statement which checks $foo and executes whatever is appropriate.
A compiler expects to be able to identify all of the source and binary files that might be used by the program being compiled.
include($random_file);
If the file named in $random_file declares constants, classes, variables, the compiler will have no way knowing because the value of $random_file is not known at compile time. Your code using those constants, classes and variables will fail in difficult-to-debug ways. The switch statement would make known the list of possible files so the compiler can discover any relevant declarations.
Languages designed to be compiled have dynamic linkers and foreign function interfaces that combine to provide similar functionality to include($random_file) without needing the explicit switch.
I'm pretty new to PHP, but I've been programming in similar languages for years. I was flummoxed by the following:
class Foo {
public $path = array(
realpath(".")
);
}
It produced a syntax error: Parse error: syntax error, unexpected '(', expecting ')' in test.php on line 5 which is the realpath call.
But this works fine:
$path = array(
realpath(".")
);
After banging my head against this for a while, I was told you can't call functions in an attribute default; you have to do it in __construct. My question is: why?! Is this a "feature" or sloppy implementation? What's the rationale?
The compiler code suggests that this is by design, though I don't know what the official reasoning behind that is. I'm also not sure how much effort it would take to reliably implement this functionality, but there are definitely some limitations in the way that things are currently done.
Though my knowledge of the PHP compiler isn't extensive, I'm going try and illustrate what I believe goes on so that you can see where there is an issue. Your code sample makes a good candidate for this process, so we'll be using that:
class Foo {
public $path = array(
realpath(".")
);
}
As you're well aware, this causes a syntax error. This is a result of the PHP grammar, which makes the following relevant definition:
class_variable_declaration:
//...
| T_VARIABLE '=' static_scalar //...
;
So, when defining the values of variables such as $path, the expected value must match the definition of a static scalar. Unsurprisingly, this is somewhat of a misnomer given that the definition of a static scalar also includes array types whose values are also static scalars:
static_scalar: /* compile-time evaluated scalars */
//...
| T_ARRAY '(' static_array_pair_list ')' // ...
//...
;
Let's assume for a second that the grammar was different, and the noted line in the class variable delcaration rule looked something more like the following which would match your code sample (despite breaking otherwise valid assignments):
class_variable_declaration:
//...
| T_VARIABLE '=' T_ARRAY '(' array_pair_list ')' // ...
;
After recompiling PHP, the sample script would no longer fail with that syntax error. Instead, it would fail with the compile time error "Invalid binding type". Since the code is now valid based on the grammar, this indicates that there actually is something specific in the design of the compiler that's causing trouble. To figure out what that is, let's revert to the original grammar for a moment and imagine that the code sample had a valid assignment of $path = array( 2 );.
Using the grammar as a guide, it's possible to walk through the actions invoked in the compiler code when parsing this code sample. I've left some less important parts out, but the process looks something like this:
// ...
// Begins the class declaration
zend_do_begin_class_declaration(znode, "Foo", znode);
// Set some modifiers on the current znode...
// ...
// Create the array
array_init(znode);
// Add the value we specified
zend_do_add_static_array_element(znode, NULL, 2);
// Declare the property as a member of the class
zend_do_declare_property('$path', znode);
// End the class declaration
zend_do_end_class_declaration(znode, "Foo");
// ...
zend_do_early_binding();
// ...
zend_do_end_compilation();
While the compiler does a lot in these various methods, it's important to note a few things.
A call to zend_do_begin_class_declaration() results in a call to get_next_op(). This means that it adds a new opcode to the current opcode array.
array_init() and zend_do_add_static_array_element() do not generate new opcodes. Instead, the array is immediately created and added to the current class' properties table. Method declarations work in a similar way, via a special case in zend_do_begin_function_declaration().
zend_do_early_binding() consumes the last opcode on the current opcode array, checking for one of the following types before setting it to a NOP:
ZEND_DECLARE_FUNCTION
ZEND_DECLARE_CLASS
ZEND_DECLARE_INHERITED_CLASS
ZEND_VERIFY_ABSTRACT_CLASS
ZEND_ADD_INTERFACE
Note that in the last case, if the opcode type is not one of the expected types, an error is thrown – The "Invalid binding type" error. From this, we can tell that allowing the non-static values to be assigned somehow causes the last opcode to be something other than expected. So, what happens when we use a non-static array with the modified grammar?
Instead of calling array_init(), the compiler prepares the arguments and calls zend_do_init_array(). This in turn calls get_next_op() and adds a new INIT_ARRAY opcode, producing something like the following:
DECLARE_CLASS 'Foo'
SEND_VAL '.'
DO_FCALL 'realpath'
INIT_ARRAY
Herein lies the root of the problem. By adding these opcodes, zend_do_early_binding() gets an unexpected input and throws an exception. As the process of early binding class and function definitions seems fairly integral to the PHP compilation process, it can't just be ignored (though the DECLARE_CLASS production/consumption is kind of messy). Likewise, it's not practical to try and evaluate these additional opcodes inline (you can't be sure that a given function or class has been resolved yet), so there's no way to avoid generating the opcodes.
A potential solution would be to build a new opcode array that was scoped to the class variable declaration, similar to how method definitions are handled. The problem with doing that is deciding when to evaluate such a run-once sequence. Would it be done when the file containing the class is loaded, when the property is first accessed, or when an object of that type is constructed?
As you've pointed out, other dynamic languages have found a way to handle this scenario, so it's not impossible to make that decision and get it to work. From what I can tell though, doing so in the case of PHP wouldn't be a one-line fix, and the language designers seem to have decided that it wasn't something worth including at this point.
My question is: why?! Is this a "feature" or sloppy implementation?
I'd say it's definitely a feature. A class definition is a code blueprint, and not supposed to execute code at the time of is definition. It would break the object's abstraction and encapsulation.
However, this is only my view. I can't say for sure what idea the developers had when defining this.
You can probably achieve something similar like this:
class Foo
{
public $path = __DIR__;
}
IIRC __DIR__ needs php 5.3+, __FILE__ has been around longer
It's a sloppy parser implementation. I don't have the correct terminology to describe it (I think the term "beta reduction" fits in somehow...), but the PHP language parser is more complex and more complicated than it needs to be, and so all sorts of special-casing is required for different language constructs.
My guess would be that you won't be able to have a correct stack trace if the error does not occur on an executable line... Since there can't be any error with initializing values with constants, there's no problem with that, but function can throw exceptions/errors and need to be called within an executable line, and not a declarative one.
I'm looking to improve my PHP coding and am wondering what PHP-specific techniques other programmers use to improve productivity or workaround PHP limitations.
Some examples:
Class naming convention to handle namespaces: Part1_Part2_ClassName maps to file Part1/Part2/ClassName.php
if ( count($arrayName) ) // handles $arrayName being unset or empty
Variable function names, e.g. $func = 'foo'; $func($bar); // calls foo($bar);
Ultimately, you'll get the most out of PHP first by learning generally good programming practices, before focusing on anything PHP-specific. Having said that...
Apply liberally for fun and profit:
Iterators in foreach loops. There's almost never a wrong time.
Design around class autoloading. Use spl_autoload_register(), not __autoload(). For bonus points, have it scan a directory tree recursively, then feel free to reorganize your classes into a more logical directory structure.
Typehint everywhere. Use assertions for scalars.
function f(SomeClass $x, array $y, $z) {
assert(is_bool($z))
}
Output something other than HTML.
header('Content-type: text/xml'); // or text/css, application/pdf, or...
Learn to use exceptions. Write an error handler that converts errors into exceptions.
Replace your define() global constants with class constants.
Replace your Unix timestamps with a proper Date class.
In long functions, unset() variables when you're done with them.
Use with guilty pleasure:
Loop over an object's data members like an array. Feel guilty that they aren't declared private. This isn't some heathen language like Python or Lisp.
Use output buffers for assembling long strings.
ob_start();
echo "whatever\n";
debug_print_backtrace();
$s = ob_get_clean();
Avoid unless absolutely necessary, and probably not even then, unless you really hate maintenance programmers, and yourself:
Magic methods (__get, __set, __call)
extract()
Structured arrays -- use an object
My experience with PHP has taught me a few things. To name a few:
Always output errors. These are the first two lines of my typical project (in development mode):
ini_set('display_errors', '1');
error_reporting(E_ALL);
Never use automagic. Stuff like autoLoad may bite you in the future.
Always require dependent classes using require_once. That way you can be sure you'll have your dependencies straight.
Use if(isset($array[$key])) instead of if($array[$key]). The second will raise a warning if the key isn't defined.
When defining variables (even with for cycles) give them verbose names ($listIndex instead of $j)
Comment, comment, comment. If a particular snippet of code doesn't seem obvious, leave a comment. Later on you might need to review it and might not remember what it's purpose is.
Other than that, class, function and variable naming conventions are up to you and your team. Lately I've been using Zend Framework's naming conventions because they feel right to me.
Also, and when in development mode, I set an error handler that will output an error page at the slightest error (even warnings), giving me the full backtrace.
Fortunately, namespaces are in 5.3 and 6. I would highly recommend against using the Path_To_ClassName idiom. It makes messy code, and you can never change your library structure... ever.
The SPL's autoload is great. If you're organized, it can save you the typical 20-line block of includes and requires at the top of every file. You can also change things around in your code library, and as long as PHP can include from those directories, nothing breaks.
Make liberal use of === over ==. For instance:
if (array_search('needle',$array) == false) {
// it's not there, i think...
}
will give a false negative if 'needle' is at key zero. Instead:
if (array_search('needle',$array) === false) {
// it's not there!
}
will always be accurate.
See this question: Hidden Features of PHP. It has a lot of really useful PHP tips, the best of which have bubbled up to the top of the list.
There are a few things I do in PHP that tend to be PHP-specific.
Assemble strings with an array.
A lot of string manipulation is expensive in PHP, so I tend to write algorithms that reduce the discrete number of string manipulations I do. The classic example is building a string with a loop. Start with an array(), instead, and do array concatenation in the loop. Then implode() it at the end. (This also neatly solves the trailing-comma problem.)
Array constants are nifty for implementing named parameters to functions.
Enable NOTICE, and if you realy want to STRICT error reporting. It prevents a lot of errors and code smell: ini_set('display_errors', 1); error_reporting(E_ALL && $_STRICT);
Stay away from global variables
Keep as many functions as possible short. It reads easier, and is easy to maintain. Some people say that you should be able to see the whole function on your screen, or, at least, that the beginning and end curly brackets of loops and structures in the function should both be on your screen
Don't trust user input!
I've been developing with PHP (and MySQL) for the last 5 years. Most recently I started using a framework (Zend) with a solid javascript library (Dojo) and it's changed the way I work forever (in a good way, I think).
The thing that made me think of this was your first bullet: Zend framework does exactly this as it's standard way of accessing 'controllers' and 'actions'.
In terms of encapsulating and abstracting issues with different databases, Zend_Db this very well. Dojo does an excellent job of ironing out javascript inconsistencies between different browsers.
Overall, it's worth getting into good OOP techniques and using (and READING ABOUT!) frameworks has been a very hands-on way of getting to understand OOP issues.
For some standalone tools worth using, see also:
Smarty (template engine)
ADODB (database access abstraction)
Declare variables before using them!
Get to know the different types and the === operator, it's essential for some functions like strpos() and you'll start to use return false yourself.