I'd like to parse PHP code with a regex to find all the top-level functions declared in our codebase.
The simple:
^\s*function\s*([\w_-]+)\(
works pretty well, but catches the extra
class Foo {
function bar() {...}
}
Any ideas on how to skip non-top-level functions that don't have scope delcared?
Note: I know, I know, I should use a real parser but I want something quick and dirty that can run in grep -R -P over a very large codebase.
On a well-indented code base,
^function\s*([\w_-]+)\(
should catch only top-level functions. If you expect leading spaces, you could use a zero-width negative look-behind for a {, to avoid functions right at the beginning of a class declaration:
(?<!{)\s*function\s*([\w_-]+)\(
First of all, I have to say that this sort of stuffs depend largely on how disciplined you code is. For myself, I start all top-level functions immediately at the beginning of lines. So if I wanted to find non-top-level functions (in vim), I simply do
/^[[:space:]]\+function[[:space:]]\+\w\+\>
and
/^function[[:space:]]\+\w\+\>
for all top-level functions.
However, as I said, it depends on how well formatted your codebase is. Good luck!
If you're willing to use ruby (or basically anything with named capture groups), you could use something like this:
^\s*(?<type>\w+)\s*(?<name>[\w_-]+)(?<function>\([^()]*\))?\s*(?<body>{((?>[^{}]+)|(\g<body>))*})
The ones that are functions will have brackets in the function capturing group. The ones are classes won't.
http://rubular.com/r/3dXZts6OYF
Extremely brittle though.
Related
So I am working on a simple micro language/alternative syntax for PHP.
Its syntax takes a lot from JavaScript and CoffeeScript including a few of my own concepts. I have hand written the parser (no parser generator used) in PHP to convert the code into PHP then execute it. It is more of a proof of concept/learning tool rather than anything else but I'd be lying if I said I didn't want to see it used on an actual project one day.
Anyway here is a little problem I have come across that I thought I would impose on you great intellects:
As you know in PHP the period ( . ) is used for string concatenation. However in JavaScript it is used for method chaining.
Now one thing that annoys me in PHP is having to do use that bloody arrow (->) for my method chains, so I went the JavaScript way and implemented the period (.) for use with objects.
(I think you can see the problem already)
Because I'm currently only writing a 'dumb' parser that merely does a huge search and replace, there is no way to distinguish whether a period (.) is being used for concatenation or for method chaining.
"So if you are trying to be like JavaScript, just use the addition (+) operator Franky!", I hear you scream. Well I would but because the addition (+) operator is used for math in PHP I would merely be putting myself in the same situation.
Unless I can make my parser smart enough (with a crap load of work) to know that when the addition (+) operator is working with integers then don't convert it into a period (.) for concatenation I am pretty much screwed.
But here is the cool thing. Because this is pretty much a new language. I don't have to use the period or addition operator for concatenation.
So my question is: If I was to decide to introduce a new method of string concatenation, what character would make the most sense?
Does it have to be one character? .. could work!
Any myriad of combinations, like ~~ or >: even!
If you don't want to use + or ., then I would recommend ^ because that's used in some other languages for string concatenation and I don't believe that it's used for anything in PHP.
Edit: It's been pointed out that it's used for XOR. One option would be to use ^ anyway since bitwise XOR is not commonly used and then to map something else like ^^ to XOR. Another option would be to use .. for concatenation. The problem is that the single characters are mostly taken.
Another option would be to use +, but map it to a function which concatenates when one argument is a string and adds otherwise. In order to not break things which rely on strings which are numbers being treated as their values, we should probably treat numeric strings as numbers for these purposes. Here's the function that I would use.
function smart_add($arg1,$arg2) {
if ($arg1.is_numeric() && $arg2.is_numeric()) {
return $arg1 + $arg2;
} else {
return $arg1 . $arg2;
}
}
Then a + b + c + d just gets turned into smart_add(smart_add(smart_add(a,b),c),d)
This may not be perfect in all cases, but it should work pretty well most of the time and has clear rules for use.
So my question is: If I was to decide to introduce a new method of
string concatenation, what character would make the most sense?
As you're well aware of, you'll need to chose a character that is not being used as one of PHP's operators. Since string concatenation is a common technique, I would try to avoid using characters that you need to press SHIFT to type, as those characters will be a hindrance.
Instead of trying to assign one character for string concatenation (as most are already in use), perhaps you should define your own syntax for string concatenation (or any other operation you need to overwrite with a different operator), as a shorthand operator (sort of). Something like:
[$string, $string]
Should be easy to pick up by a parser and form the resulting concatenated string.
Edit: I should also note that whether you're using literal strings or variables, there's no way (as far as I know) to confuse this syntax with any other PHP functionality, since the comma in the middle is invalid for array manipulations. So, all of the following would still be recognized as string concatenation and not something else in PHP.
["stack", "overflow"]
["stack", $overflow]
[$stack, $overflow]
Edit: Since this conflicts to JSON notation, the following alternative variations exist:
Changing the delimiter
Omitting the delimiter
Example:
[$stack $overflow $string $concatenation] // Use nothing (but really need space)
Before passing a string to eval() I would like to make sure the syntax is correct and allow:
Two functions: a() and b()
Four operators: /*-+
Brackets: ()
Numbers: 1.2, -1, 1
How can I do this, maybe it has something to do with PHP Tokenizer?
I'm actually trying to make a simple formula interpreter so a() and b() will be replaced by ln() and exp(). I don't want to write a tokenizer and parser from scratch.
As far as validation is concerned, the following character tokens are valid:
operator: [/*+-]
funcs: (a\(|b\()
brackets: [()]
numbers: \d+(\.\d+)?
space: [ ]
A simple validation could then check if the input string matches any combination of these patterns. Because the funcs token is pretty precise and it does not clash much with other tokens, this validation should be quite stable w/o the need implementing any syntax/grammar already:
$tokens = array(
'operator' => '[/*+-]',
'funcs' => '(a\(|b\()',
'brackets' => '[()]',
'numbers' => '\d+(\.\d+)?',
'space' => '[ ]',
);
$pattern = '';
foreach($tokens as $token)
{
$pattern .= sprintf('|(?:%s)', $token);
}
$pattern = sprintf('~^(%s)*$~', ltrim($pattern, '|'));
echo $pattern;
Only if the whole input string matches against the token based pattern, it validates. It still might be syntactically wrong PHP, put you can ensure it only is build upon the specified tokens:
~^((?:[/*+-])|(?:(a\(|b\())|(?:[()])|(?:\d+(\.\d+)?)|(?:[ ]))*$~
If you build the pattern dynamically - as in the example - you're able to modify your language tokens later on more easily.
Additionally this can be the first step to your own tokenizer / lexer. The token stream can then passed on to a parser which can syntactically validate and interpret it. That's the part user187291 wrote about.
Alternatively to writing a full lexer+parser, and you need to validate the syntax, you can formulate your grammar based on tokens as well and then do a regex based token grammar on the token representation of the input.
The tokens are the words you use in your grammar. You will need to describe parenthesis and function definition more precisely then in tokens, and the tokenizer should follow more clear rules which token supersedes another token. The concept is outlined in another question of mine. It uses regex as well for grammar formulation and syntax validation, but it still does not parse. In your case eval would be the parser you're making use of.
Parser generators have indeed already been written for PHP, and "LIME" in particular comes with the typical "calculator" example, which would be an obvious starting point for your "mini language": http://sourceforge.net/projects/lime-php/
It's been years since I last played with LIME, but it was already mature & stable then.
Notes:
1) Using a full-on parser generator gives you the advantage of avoiding PHP eval() entirely if you wish - you can make LIME emit a parser which effectively provides an "eval" function for expressions written in your mini language (with validation baked in). This gives you the additional advantage of allowing you to add support for new functions, as needed.
2) It may seem like overkill at first to use a parser generator for such an apparently small task, but once you get the examples working you'll be impressed by how easy it is to modify and extend them. And it's very easy to underestimate the difficulty of writing a bug-free parser (even a "trivial" one) from scratch.
yes, you need the Tokenizer, or something similar, but it's only part of the story. A tokenizer (more commonly called "lexer") can only read and parse elements of an expression, but has no means to detect that something like "foo()+*bar)" is invalid. You need the second part, called parser which would be able to arrange tokens in a kind of a tree (called "AST") or provide an error message when failing to do so. Ironically, once you've got a tree, "eval" is not needed anymore, you can evaluate your expression directly from the tree.
I would recommend you to write a parser by hand because it's a very useful exercise and a lot of fun. Recursive descent parsers are quite easy to program.
You could use token_get_all(), inspect each token, and abort at the first invalid token.
hakre's answer, using regexes is a nice solution, but is a wee bit complicated. Also handling a whitelist of functions becomes rather messy. And if this does go wrong it could have a very nasty effect on your system.
Is there a reason you don't use the javascript 'eval' instead?
I would like to use paredit in combination with php-mode, but it doesn't add a closing curly bracket "}". Might this have something todo with the fact that an electric brace is bound to "{"? And how could I overwrite this?
Thanks.
Using paredit is php-mode is a bad idea - it's mostly suited for Lisp code editing. There is a very nice alternative for general purpose development though - autopair-mode. It's very easy to use and inserts braces, brackets and quotes in a manner similar to the one present in most IDEs.
Some time ago, I wrote such a thing for C, but you can easily use it for PHP as well:
(define-minor-mode c-helpers-minor-mode
"This mode contains little helpers for C developement"
nil
""
'(((kbd "{") . insert-c-block-parentheses))
)
(defun insert-c-block-parentheses ()
(interactive)
(insert "{")
(newline)
(newline)
(insert "}")
(indent-for-tab-command)
(previous-line)
(indent-for-tab-command)
)
(add-hook 'php-mode-hook 'c-helpers-minor-mode)
In my experience, autopair-mode felt extremely sluggish when a large number of buffers were open (plus, paredit-mode ensures that delimiters are never unbalanced, unlike autopair-mode). So if, like me, you absolutely want to use paredit-mode and nothing else will do, have a look at this answer. In the elisp snippet given there, just replace slime-repl-mode-map and slime-repl-mode-hook with the corresponding variables for php (most likely php-mode-map and php-mode-hook)
Let's assume I do preg_replace as follows:
preg_replace ("/<my_tag>(.*)<\/my_tag>/U", "<my_new_tag>$1</my_new_tag>", $sourse);
That works but I do also want to grab the attribute of the my_tag - how would I do it with this:
<my_tag my_attribute_that_know_the_name_of="some_value">tra-la-la</my_tag>
You don't use regex. You use a real parser, because this stuff cannot be parsed with regular expressions. You'll never know if you've got all the corner cases quite right and then your regex has turned into a giant bloated monster and you'll wish you'd just taken fredley's advice and used a real parser.
For a humourous take, see this famous post.
preg_replace('#<my_tag\b([^>]*)>(.*?)</my_tag>#',
'<my_new_tag$1>$2</my_new_tag>', $source)
The ([^>]*) captures anything after the tag name and before the closing >. Of course, > is legal inside HTML attribute values, so watch out for that (but I've never seen it in the wild). The \b prevents matches of tag names that happen to start with my_tag, preventing bogus matches like this:
<my_tag_xyz>ooga-booga</my_tag_xyz><my_tag>tra-la-la</my_tag>
But that will still break on <my_tag> elements wrapped in other <my_tag> elements, yielding results like this:
<my_tag><my_tag>tra-la-la</my_tag>
If you know you'll never need to match tags with other tags inside them, you can replace the (.*?) with ([^<>]++).
I get pretty tired of the glib "don't do that" answers too, but as you can see, there are good reasons behind them--I could come up with this many more without having to consult any references. When you ask "How do I do this?" with no background or qualification, we have no idea how much of this you already know.
Forget regex's, use this instead:
http://simplehtmldom.sourceforge.net/
I've created a simple template 'engine' in PHP to substitute PHP-generated data into the HTML page. Here's how it works:
In my main template file, I have variables like so:
<title><!-- %{title}% --></title>
I then assign data into those variables for the main page load
$assign = array (
'title' => 'my website - '
);
I then have separate template blocks that get loaded for the content pages. The above really just handles the header and the footer. In one of these 'content template files', I have variables like so:
<!-- %{title=content page}% -->
Once this gets executed, the main template data is edited to include the content page variables resulting in:
<title>my website - content page</title>
It does this with the following code:
if (preg_match('/<!-- %{title=\s*(.*?)}% -->/s', $string, $matches)) {
// Find variable names in the form of %{varName=new data to append}%
// If found, append that new data to the exisiting data
$string = preg_replace('/<!-- %{title=\s*(.*?)}% -->/s', null, $string);
$varData[$i] .= $matches[1];
}
This basically removes the template variables and then assigns the variable data to the existing variable. Now, this all works fine. What I'm having issues with is nesting template variables. If I do something like:
<!-- %{title=content page (author: <!-- %{name}% -->) -->
The pattern, at times, messes up the opening and closing tags of each variable.
How can I fix my regular expression to prevent this?
Thank you.
The answer is you don't do this with regex. Regular expressions are a regular language. When you start nesting things it is no longer a regular language. It is, at a minimum, a context-free language ("CFL"). CFLs can only be processed (assuming they're unambiguous) with a stack.
Specifically, regular languages can be represented with a finite state machine ("FSM"). CFLs require a pushdown automaton ("PDA").
An example of the difference is nested tags in HTML:
<div>
<div>inner</div>
</div>
My advice is don't write your own template language. That's been done. Many times. Use Smarty or something in Zend, Kohana or whatever. If you do write your own, do it properly. Parse it.
Why are you rolling your own template engine? If you want this kind of complexity, there's a lot of places that have already come up with solutions for it. You should just plug in Smarty or something like that.
If you're asking what I think you're asking, it's literally impossible. If I read your question correctly, you want to match arbitrarily-nested <!-- ... --> sequences with particular things inside. Unfortunately, regular expressions can only match certain classes of strings; any regular expression can match only a regular language. One well-known language which is not regular is the language of balanced parentheses (also known as the the Dyck language), which is exactly what you're trying to match. In order to match arbitrarily-nested comment strings, you need a more powerful tool. I'm fairly sure there are pre-existing PHP template engines; you might look into one of those.
To resolve your problem you should
replace preg_match() with preg_match_all();
find the pattern, and replace them from the last one to the first one;
use a more restrictive pattern like '/<!-- %{title=\s*([^}]*?)}% -->/s'.
I've done something similar in the past, and I have encountered the same nesting issue you did. In your case, what I would do is repeatedly search your text for matches (rather than searching once and looping through the matches) and extract the strings you want by searching for anything that doesn't include your closing string.
In your case, it would probably look like this:
/(<!--([^(-->)]*?)-->)/
Regexes like this are a nightmare to explain, but basically, ([^(-->)]*) will find any string that doesn't include your closing tag (let's call that AAA). It will be inside a matching group that is, itself, your template tag, (<!--AAA-->).
I'm convinced this sort of templating method is the wrong way to do things, but I've never known enough to do it better. It's always bothered me in ASP and ColdFusion that you had to nest your scripting tags inside HTML and when I started to do it myself, I considered it a personal failure.
Most Regexes I do now are in JavaScript and so I may be missing some of the awesome nuances PHP has via Perl. I'd be happy if someone can write this more cleanly.
I too have ran into this problem in the past, although I didn't use regular expressions.
If instead you search from right to left for the opening tag, <!-- %{ in your syntax, using strrpos (PHP5+), then search forwards for the first occurrence of the next closing tag, and then replace that chunk first, you will end up replacing the inner-most nested variables first. This should resolve your problem.
You can also do it the other way around and find the first occurrence of a closing tag, and work backwards to find its corresponding opening tag.