Creating custom PHP Syntax Parser - php

I am thinking about how one would go about creating a PHP equivalent for a couple of libraries I found for CSS and JS.
One is Less CSS which is a dynamic stylesheet language. The basic idea behind Less CSS is that it allows you to create more dynamic CSS rules containing entities that "regular" CSS does not support such as mixins, functions etc and then the final Less CSS compiles those syntax into regular CSS.
Another interesting JS library which behaves in a (kind of) similar pattern is CoffeeScript where you can write "tidier & simpler" code which then gets compiled into regular Javascript.
How would one go about creating a simple similar interface for PHP? Just as a proof of concept; I am only trying to learn stuff. Lets just take a simple use case of extending classes.
class a
{
function a_test()
{
echo "This is test in a ";
}
}
class b extends a
{
function b_test()
{
parent::a_test();
echo "This is test in b";
}
}
$b = new b();
$b->b_test();
Suppose I want to let the user write class b as (just for the example):
class b[a] //would mean b extends a
{
function b_test()
{
[a_test] //would mean parent::a_test()
echo "This is test in b";
}
}
And let them later have that code "resolve" to regular PHP (Usually by running a separate command/process I would believe). My question is how would I go about creating something like this. Can it be done in PHP, would I require to use something like C/C++. How should I approach this problem if I were to go at it? Are there any resources online? Any pointers are deeply appreciated!

Language transcoders are not as easy as one might think.
The example you gave can be implemented very easily with a preg_replace that looks for class definitions and replaces [a] with extends a.
But more complex features need a transcoder which is a suite of smaller logical pieces of code.
In most programmer jargon people incorrectly call transcoders compilers but the difference between compilers and transcoders is that compilers read source code and output raw binary machine code while transcoders read source code and output (a different) source code.
The PHP (or JavaScript) runtime for example is neither compiler nor transcoder, it's an interpreter.
But enough about jargon let's talk about transcoders:
To build a transcoder you must first build a tokenizer, it breaks apart the source code into tokens, meaning that if it sees an entire word such as 'class' or the name of a class or 'function' or the name of a function, it captures that word and considers it a token. When it encounters another token such as an opening round bracket or an opening brace or a square bracket etc. it considers that another token.
Luckily all of the recognized tokens available in PHP are already easily scanned by token_get_all which is a function PHP is bundled with. You may have some trouble because PHP assumes some things about how you use symbols but all in all you can make use of this function.
The tokenizer creates a flat list of all the tokens it finds and gives it to the parser.
The parser is the second phase of your transcoder, it reads the list of tokens and decides stuff like "if token[0] is a class and token[1] is a name_value then we have a class" etc.. after running through the entire list of tokens we should have an abstract syntax tree.
The abstract syntax tree is a structure that symbolically retains only the relevant information about a the source code.
$ast = array(
'my_derived_class' => array(
'implements' => array(
'my_interface_1',
'my_interface_2',
'my_interface_3'),
'extends' => 'my_base_class',
'members' => array(
'my_property_name' => 'my_default_value',
'my_method_name' => array( /* ... */ )
)
)
);
After you get an abstract syntax tree you need to walk through it and output the destination source code.
The real tricky part is the parser which (depending on the complexity of the language you are parsing) may need a backtracking algorithm or some other form of pattern matching to differentiate similar cases against one another.
I recommend reading about this in Terence Parr' book http://pragprog.com/book/tpdsl/language-implementation-patterns which describes in detail the design patterns needed to write a transcoder.
In Terrence' book you'll find out why some languages such as HTML or CSS are much simpler (structurally) than PHP or JavaScript and how that relates the complexity of the language parser.

Related

How to find php files containing class and code outside it that would be executed during require/include?

Is there a way to detect or find files containing class and code outside it that would be executed while autoloaded or required? Like examples or so:
<?php
class SomeClass
{
private $variable;
public function __construct($variable) {
$this->variable = $variable;
}
}
// example usage
$foo = new SomeClass('test');
var_dump($foo);
// some other code
I need to find them in one older project and there are too many files to check them manually.
Thanks.
EDIT: I cannot autoload all of them (as I dont know in what files the unwanted code might be) and even that wouldnt necessary show me any executed code unless it would have some error in it.
Maybe you search for the word class and pipe the result on the word new ?
grep -r --include "*.php" 'class' | grep 'new'
The proper (and least error-prone but hard) way to do this would be using PHP's Tokenizer. You could use glob to get all the PHP files. Then go over them one by one using the tokenizer marking them as a class file when they contain the T_CLASS (or T_NAMESPACE) token, then skipping over the class itself and looking for anything but those tokens (and namespaces, comments and such) outputting you the filename when you find it.
The not so proper method (and surely one prone to false positives) would be writing your own simple parser (which is probably harder than using the Tokenizer) or using some simple regular expression with awk combination to find out stuff that is around classes and namespaces. I'm not too experienced with that though (and I'm pretty sure regexp alone isn't enough) so I can't really provide an example.

Structuring Decorators in PHP

I'm sort of a novice developer trying to expand my toolbox and learn some more tricks. I recently came across a pattern in Python called "decoration" and I was wondering if/how I could implement this in PHP as I have an existing PHP code base.
Here is a short example of what I mean:
import time
def log_calls(func):
def wrapper(*args, **kwargs):
now = time.time()
print("Calling {0} with {1} and {2}".format(
func.__name__,
args,
kwargs
))
return_value = func(*args, **kwargs)
print("Executed {0} in {1}ms".format(
func.__name__,
time.time() - now
))
return return_value
return wrapper
#log_calls
def test1(a,b,c):
print("\ttest1 called")
#log_calls
def test2(a,b):
print("\ttest2 called")
#log_calls
def test3(a,b):
print("\ttest3 called")
time.sleep(1)
test1(1,2,3)
test2(4,b=5)
test3(6,7)
It doesn't necessarily have to be that syntactically pretty; I understand that all languages have their nuances and I know PHP does not support that syntax. But I still want to be able to achieve the same effect while rewriting as little code as possible.
Basically no, it's not supported in PHP in any way at all. As far as I know, it's not even on the roadmap for future PHP versions.
Of interest, and slightly relevant: The closest I could think of in PHP-land to what you're after is if you use phpUnit to test your code. phpUnit implements something along these lines for its own use, using references in docblock type comments above a method. eg:
/**
* #dataProvider myProviderFunc
*/
public function myTestFunc($argsFromProvider) {
....
}
public function myProviderFunc() {
return array(....);
}
Thus, when phpUnit wants to call myTestFunc(), it first calls myProviderFunc(), and passes the output of that function into myTestFunc().
This strikes me as being close to, but not quite the same as the decorator syntax you're describing. However, it's not standard PHP syntax; phpUnit implements all of this stuff itself. It reads the source code and does a load of pre-processing on as it parses the comment blocks before running the tests, so it's not exactly efficient. Suitable for a unit testing tool, but not for a production system.
But no, the short answer to your question is that what you want can't be done in PHP. Sorry.
PHP has no syntactic support for the decorator pattern, but nothing really hinders you from implementing it yourself.
You can look into the following discussions, which might be relevant to your question:
how to implement a decorator in PHP?
Trying to implement (understand) the Decorator pattern in php
Here is another resource, with UML diagrams and code samples in multiple languages, including PHP.
Decorator Design Pattern

Looking for functions with PHP tokenizer

Right now, I have a script which uses PHP's tokenizer to look for certain functions within a PHP source code file. The pattern I am currently looking for is:
T_STRING + T_WHITESPACE (optional) + "("
This seems to match all of my test cases so far except variable functions, which I am ignoring for the purposes of this question.
The obvious problem here is that this pattern produces a lot of false positives, like matching function definitions:
public function foo() { // foo() should not be matched
My question is, is there a more reliable/accurate method for looking at source code and plucking out all the function invocations? Maybe a better method than using the tokenizer at all?
Edit:
In particular, I'm looking to emulate the functionality of the disable_functions PHP directive within a class file. So, if exec() should be disallowed, I'm trying to find any uses of that function within the analyzed file. I do realize that variable functions make this terribly difficult, so I am detecting these and disallowing them as well.
You first run the tokenizer (available in PHP). Then you run a parser on top of the tokens. The parser needs to read the tokens and should be able to tell your what a specific token has been used for. It depends on the reliability of your parser how reliable the outcome is.
If your current parser (you have not shown any code) is not reliable enough, you need to write a better parser. That simple it is. Probably you're not doing much more than just tokenizing and then reading as it passes which just might not be enough.
Instead of using the tokenizer, consider instead using a higher-level parser to analyze your code. For example, PHP-Parser can explicitly identify function declarations, as well as variable function calls.

Why don't PHP attributes allow functions?

I'm pretty new to PHP, but I've been programming in similar languages for years. I was flummoxed by the following:
class Foo {
public $path = array(
realpath(".")
);
}
It produced a syntax error: Parse error: syntax error, unexpected '(', expecting ')' in test.php on line 5 which is the realpath call.
But this works fine:
$path = array(
realpath(".")
);
After banging my head against this for a while, I was told you can't call functions in an attribute default; you have to do it in __construct. My question is: why?! Is this a "feature" or sloppy implementation? What's the rationale?
The compiler code suggests that this is by design, though I don't know what the official reasoning behind that is. I'm also not sure how much effort it would take to reliably implement this functionality, but there are definitely some limitations in the way that things are currently done.
Though my knowledge of the PHP compiler isn't extensive, I'm going try and illustrate what I believe goes on so that you can see where there is an issue. Your code sample makes a good candidate for this process, so we'll be using that:
class Foo {
public $path = array(
realpath(".")
);
}
As you're well aware, this causes a syntax error. This is a result of the PHP grammar, which makes the following relevant definition:
class_variable_declaration:
//...
| T_VARIABLE '=' static_scalar //...
;
So, when defining the values of variables such as $path, the expected value must match the definition of a static scalar. Unsurprisingly, this is somewhat of a misnomer given that the definition of a static scalar also includes array types whose values are also static scalars:
static_scalar: /* compile-time evaluated scalars */
//...
| T_ARRAY '(' static_array_pair_list ')' // ...
//...
;
Let's assume for a second that the grammar was different, and the noted line in the class variable delcaration rule looked something more like the following which would match your code sample (despite breaking otherwise valid assignments):
class_variable_declaration:
//...
| T_VARIABLE '=' T_ARRAY '(' array_pair_list ')' // ...
;
After recompiling PHP, the sample script would no longer fail with that syntax error. Instead, it would fail with the compile time error "Invalid binding type". Since the code is now valid based on the grammar, this indicates that there actually is something specific in the design of the compiler that's causing trouble. To figure out what that is, let's revert to the original grammar for a moment and imagine that the code sample had a valid assignment of $path = array( 2 );.
Using the grammar as a guide, it's possible to walk through the actions invoked in the compiler code when parsing this code sample. I've left some less important parts out, but the process looks something like this:
// ...
// Begins the class declaration
zend_do_begin_class_declaration(znode, "Foo", znode);
// Set some modifiers on the current znode...
// ...
// Create the array
array_init(znode);
// Add the value we specified
zend_do_add_static_array_element(znode, NULL, 2);
// Declare the property as a member of the class
zend_do_declare_property('$path', znode);
// End the class declaration
zend_do_end_class_declaration(znode, "Foo");
// ...
zend_do_early_binding();
// ...
zend_do_end_compilation();
While the compiler does a lot in these various methods, it's important to note a few things.
A call to zend_do_begin_class_declaration() results in a call to get_next_op(). This means that it adds a new opcode to the current opcode array.
array_init() and zend_do_add_static_array_element() do not generate new opcodes. Instead, the array is immediately created and added to the current class' properties table. Method declarations work in a similar way, via a special case in zend_do_begin_function_declaration().
zend_do_early_binding() consumes the last opcode on the current opcode array, checking for one of the following types before setting it to a NOP:
ZEND_DECLARE_FUNCTION
ZEND_DECLARE_CLASS
ZEND_DECLARE_INHERITED_CLASS
ZEND_VERIFY_ABSTRACT_CLASS
ZEND_ADD_INTERFACE
Note that in the last case, if the opcode type is not one of the expected types, an error is thrown – The "Invalid binding type" error. From this, we can tell that allowing the non-static values to be assigned somehow causes the last opcode to be something other than expected. So, what happens when we use a non-static array with the modified grammar?
Instead of calling array_init(), the compiler prepares the arguments and calls zend_do_init_array(). This in turn calls get_next_op() and adds a new INIT_ARRAY opcode, producing something like the following:
DECLARE_CLASS 'Foo'
SEND_VAL '.'
DO_FCALL 'realpath'
INIT_ARRAY
Herein lies the root of the problem. By adding these opcodes, zend_do_early_binding() gets an unexpected input and throws an exception. As the process of early binding class and function definitions seems fairly integral to the PHP compilation process, it can't just be ignored (though the DECLARE_CLASS production/consumption is kind of messy). Likewise, it's not practical to try and evaluate these additional opcodes inline (you can't be sure that a given function or class has been resolved yet), so there's no way to avoid generating the opcodes.
A potential solution would be to build a new opcode array that was scoped to the class variable declaration, similar to how method definitions are handled. The problem with doing that is deciding when to evaluate such a run-once sequence. Would it be done when the file containing the class is loaded, when the property is first accessed, or when an object of that type is constructed?
As you've pointed out, other dynamic languages have found a way to handle this scenario, so it's not impossible to make that decision and get it to work. From what I can tell though, doing so in the case of PHP wouldn't be a one-line fix, and the language designers seem to have decided that it wasn't something worth including at this point.
My question is: why?! Is this a "feature" or sloppy implementation?
I'd say it's definitely a feature. A class definition is a code blueprint, and not supposed to execute code at the time of is definition. It would break the object's abstraction and encapsulation.
However, this is only my view. I can't say for sure what idea the developers had when defining this.
You can probably achieve something similar like this:
class Foo
{
public $path = __DIR__;
}
IIRC __DIR__ needs php 5.3+, __FILE__ has been around longer
It's a sloppy parser implementation. I don't have the correct terminology to describe it (I think the term "beta reduction" fits in somehow...), but the PHP language parser is more complex and more complicated than it needs to be, and so all sorts of special-casing is required for different language constructs.
My guess would be that you won't be able to have a correct stack trace if the error does not occur on an executable line... Since there can't be any error with initializing values with constants, there's no problem with that, but function can throw exceptions/errors and need to be called within an executable line, and not a declarative one.

What are some useful PHP Idioms?

I'm looking to improve my PHP coding and am wondering what PHP-specific techniques other programmers use to improve productivity or workaround PHP limitations.
Some examples:
Class naming convention to handle namespaces: Part1_Part2_ClassName maps to file Part1/Part2/ClassName.php
if ( count($arrayName) ) // handles $arrayName being unset or empty
Variable function names, e.g. $func = 'foo'; $func($bar); // calls foo($bar);
Ultimately, you'll get the most out of PHP first by learning generally good programming practices, before focusing on anything PHP-specific. Having said that...
Apply liberally for fun and profit:
Iterators in foreach loops. There's almost never a wrong time.
Design around class autoloading. Use spl_autoload_register(), not __autoload(). For bonus points, have it scan a directory tree recursively, then feel free to reorganize your classes into a more logical directory structure.
Typehint everywhere. Use assertions for scalars.
function f(SomeClass $x, array $y, $z) {
assert(is_bool($z))
}
Output something other than HTML.
header('Content-type: text/xml'); // or text/css, application/pdf, or...
Learn to use exceptions. Write an error handler that converts errors into exceptions.
Replace your define() global constants with class constants.
Replace your Unix timestamps with a proper Date class.
In long functions, unset() variables when you're done with them.
Use with guilty pleasure:
Loop over an object's data members like an array. Feel guilty that they aren't declared private. This isn't some heathen language like Python or Lisp.
Use output buffers for assembling long strings.
ob_start();
echo "whatever\n";
debug_print_backtrace();
$s = ob_get_clean();
Avoid unless absolutely necessary, and probably not even then, unless you really hate maintenance programmers, and yourself:
Magic methods (__get, __set, __call)
extract()
Structured arrays -- use an object
My experience with PHP has taught me a few things. To name a few:
Always output errors. These are the first two lines of my typical project (in development mode):
ini_set('display_errors', '1');
error_reporting(E_ALL);
Never use automagic. Stuff like autoLoad may bite you in the future.
Always require dependent classes using require_once. That way you can be sure you'll have your dependencies straight.
Use if(isset($array[$key])) instead of if($array[$key]). The second will raise a warning if the key isn't defined.
When defining variables (even with for cycles) give them verbose names ($listIndex instead of $j)
Comment, comment, comment. If a particular snippet of code doesn't seem obvious, leave a comment. Later on you might need to review it and might not remember what it's purpose is.
Other than that, class, function and variable naming conventions are up to you and your team. Lately I've been using Zend Framework's naming conventions because they feel right to me.
Also, and when in development mode, I set an error handler that will output an error page at the slightest error (even warnings), giving me the full backtrace.
Fortunately, namespaces are in 5.3 and 6. I would highly recommend against using the Path_To_ClassName idiom. It makes messy code, and you can never change your library structure... ever.
The SPL's autoload is great. If you're organized, it can save you the typical 20-line block of includes and requires at the top of every file. You can also change things around in your code library, and as long as PHP can include from those directories, nothing breaks.
Make liberal use of === over ==. For instance:
if (array_search('needle',$array) == false) {
// it's not there, i think...
}
will give a false negative if 'needle' is at key zero. Instead:
if (array_search('needle',$array) === false) {
// it's not there!
}
will always be accurate.
See this question: Hidden Features of PHP. It has a lot of really useful PHP tips, the best of which have bubbled up to the top of the list.
There are a few things I do in PHP that tend to be PHP-specific.
Assemble strings with an array.
A lot of string manipulation is expensive in PHP, so I tend to write algorithms that reduce the discrete number of string manipulations I do. The classic example is building a string with a loop. Start with an array(), instead, and do array concatenation in the loop. Then implode() it at the end. (This also neatly solves the trailing-comma problem.)
Array constants are nifty for implementing named parameters to functions.
Enable NOTICE, and if you realy want to STRICT error reporting. It prevents a lot of errors and code smell: ini_set('display_errors', 1); error_reporting(E_ALL && $_STRICT);
Stay away from global variables
Keep as many functions as possible short. It reads easier, and is easy to maintain. Some people say that you should be able to see the whole function on your screen, or, at least, that the beginning and end curly brackets of loops and structures in the function should both be on your screen
Don't trust user input!
I've been developing with PHP (and MySQL) for the last 5 years. Most recently I started using a framework (Zend) with a solid javascript library (Dojo) and it's changed the way I work forever (in a good way, I think).
The thing that made me think of this was your first bullet: Zend framework does exactly this as it's standard way of accessing 'controllers' and 'actions'.
In terms of encapsulating and abstracting issues with different databases, Zend_Db this very well. Dojo does an excellent job of ironing out javascript inconsistencies between different browsers.
Overall, it's worth getting into good OOP techniques and using (and READING ABOUT!) frameworks has been a very hands-on way of getting to understand OOP issues.
For some standalone tools worth using, see also:
Smarty (template engine)
ADODB (database access abstraction)
Declare variables before using them!
Get to know the different types and the === operator, it's essential for some functions like strpos() and you'll start to use return false yourself.

Categories