parse/collator for php - php

I'm pretty much a newbie at php (at the "install an app and try to tweak it a bit" stage).
Is there a tool anywhere which can take a script which is spread over many files and show you all the code which is processed (for a given set of arguments passed to the script) in a single output?
For example, I want to make a call to zen cart from a script in a different language, which returns a category listing without any surrounding page. So I want to be able to trace what the actual process is to generate that then strip off all the unwanted bits to create a custom script.

One thing I've found very helpful when looking at new / complicated codebases is to use an IDE with some sort of code intelligence. I use php eclipse, and what it does is allow you to jump into function and variable definitions either by means of hyperlinking, or popups. This can be incredibly helpful for navigating through sprawling projects because you don't have to go through all the trouble to search by hand.
In your case, with php, the best thing to do is find the entry point for a page that pulls in your list of categories. Once you find that, you can use eclipse to expand out the various function calls it makes. Being a beginner, it's very helpful to read through code in this manner, as it exposes you to lots of different ways of doing things. An additional bonus of using something like eclipse is that it provides integration with the PHP manual. So anytime you encounter a function you don't know, you can hover over, see the manual, and also how it would be used in context.

What you want is called a "backwards slice" ("all the code that contributes to a specific computed result") in the computing theory literature. To compute the backward slice, something needs to parse the langauge, compute all the influences (control and dataflow) on a selected point in the program, and then display those points to you.
Slicing tools exist for langauges like C. They may exist for Java (as academic versions). I don't know of any that exist for PHP.
Another way to discover the code involved in an action is to run a test coverage tool. Such a tool marks all the code (across many files) that gets executed for a specific action (usually a "unit test" but test coverage tools really don't care). Then you simply exercise the action you care about, and look at the test coverage data. A graphical display will make it easy to see what code was executed; the part you want is buried in all the executed code.
A PHP Test Coverage tool does exist and will provide nice displays of the covered code.

If you are looking for a debugger of some sort, have a look at XDebug or ZendDebugger.

Related

Searching for Lime parser generator grammar examples... Just cannot find any

I'm writing a messaging system for the users of our site, which implements segmentation to allow for individual messages to target dynamic segments of users. Because a given message's segment definition may contain multiple individual segment matches, it's necessary for the content of the message body also to be segmented. I've attempted to do this by writing what turned out to be a custom lexer/parser (without me even knowing about lexers or parsers) until a chance conversation with a much more experienced programmer suggested I take a look at lexers and parser generators. I've done a bit of research, and found that the PHP native Lime parser generator seems to be my best option, seeing as the code I'm writing is PHP.
I've looked at the grammar file for the calculator example, and at the metagrammar, (in fact, I've spent a few hours analyzing most of the source code) but I'm really having trouble wrapping my head around how to construct even a simple grammar file. Is there anyone who knows of any example grammar files specifically for Lime, as it seems to us its own grammar definition, rather than that of Lemon or any of the other PGs.
Should you be willing and able to provide concrete examples, I'm specifically trying to write conditionals in the format of something like the following:
This is a text block all users will see.
{{IF user.modules.sms}}
This is a text block only visible to users with the sms module enabled
{{/IF}}
{{IF user.modules.anothermodule AND user.previouslogin < (now() - 3600)}}
This is a text block only visible to users with the anothermodule module enabled, whose previous login was more than an hour ago
{{/IF}}
Or just in general, if anyone hase any suggestions on possible other methods of implementing such a feature, I'm open to advice! Just bear in mind it's not possible to use PHP, as the people writing these messages will be project managers and marketers.
I haven't done any parser generator work since the mid 90s when I used lex & yacc to build C programs, but I'll offer this - since I see you haven't gotten a satisfactory answer or updated your question since 2012:
In general, it looks like lime is an OK substitute for yacc when you want a parser generator to emit PHP code, but the tokenize() method shown in the calculator example is an extremely weak replacement for lex. So in general if your goal is to embed bits of programming logic inside "messages" then you can expect writing the tokenizer logic "from scratch" to be a challenge (less so if the message format is highly constrained).
But your proposed example message raises the larger question:
How exactly will the PHP code which is to be emitted by your parser generator be used?
Specifically:
Will these chunks of parser generated code be "standalone" web pages - addressable directly via URL and rendered directly by the webserver (in which case the next question is how you're going to tell the webserver to execute the PHP code, e.g. by making them into CGI scripts)? Or will they run inside some sort of application framework (or "message renderer")?
How will (PHP) program state be persisted? Your example refers to "user.previouslogin", which suggests persistence not just across page views but also "sessions" of some sort.
Will the logic which you're proposing to embed in your messages inside tags really be some variant of PHP or Javascript, or something genuinely new?
Embedding logic inside static pages is an old idea (Server Side Includes were popular in the 90s, after all), and modern templating engines (as suggested in the answer by Ugo Meda) are quite powerful. Whether it really makes sense to roll your own message parsing + rendering system really depends on the constraints imposed by the application context which you're referring to when you write "user.modules.*" in your example.
Don't reinvent the wheel. Maybe you should use something like Smarty to implement this. Beware, this should be used by trusted users since it executes code, which may be dangerous.
If you don't plan on implementing hundreds of functionnalities, proper regexes should do the trick.

Code Hinting custom functions/objects/constants, and on chaining, commentary in Adobe Dreamweaver CS5

In Dreamweaver CS5 there's something called Code Hinting (let's call it CH for short).
CH has a bunch of information about functions, constants and objects built in the core library.
When you press CTRL+SPACEBAR or begin structuring a statement starting with $,
a window with lots of information pops up, giving me the information about it without having to look it up myself. If I press ENTER while the CH is up and something is selected, it will automatically fill in the rest for me.
I love this feature, I really do. Reminds me a little of Intellisense.
It saves me lots of time.
The issues I face, and haven't found any solutions to, are straightforward.
Issue #1 Chained methods do not display a code hint
Since PHP implemented the Classes and Objects, I've been able to chain my methods within classes/objects. Chaining is actually easy, by returning $this (the instance of that class), you can have a continuous chain of calls
class Object_Factory{
public function foo(){
echo "foo";
return $this;
}
public function bar(){
echo "bar";
return $this;
}
}
$objf = new Object_Factory;
//chaining
$objf->foo()
->bar();
Calling them separately shows the CH.
$objf->foo();
$objf->bar();
The problem is that after the first method has been called and I try to chain another method, there's no CH to display the next calls information.
So, here's my first question:
Is there a way, in Dreamweaver CS5, to make the code hints appear on chaining?
Plugins, some settings I haven't found, anything?
if("no") "Could you explain why?";
Issue #2 Code hinting for custom functions, objects and constants
As shown in the first picture, there's a lot of information popping up. In fact, there's a document just like it on the online library. Constants usually have a very small piece of information, such as a number.
In this image, MYSQL_BOTH represents 3.
Here's my second question:
Is it possible to get some information to the CH window for custom functions, objects and constants?
For example, with Intellisense you can use a setup with HTML tags and three slashes ///
///<summary>
///This is test function
///</summary>
public void TestFunction(){
//Do something...
}
Can something similar be done here?
Changing some settings, a plugin, anything?
Update
I thought I'd found something that might be the answer to at least issue #1, but it costs money, and I'm not going to pay for anything until I know it actually does what I want.
Has anyone tried it, or know it won't solve any of the issues?
The search continues...
In case none of these are possible to fix, here's hoping one of the developers notices this question and implements it in an update/new release.
I just switched to NetBeans after 10 years of using Dreamweaver. My impressions may help you. (I'll call them NB and DW respectively from now on)
Code Hints / Documentation
PHP built-in functions
Both DW and NB show all of the built-in PHP functions and constants. A nice feature is that they also provide with a link that opens the related PHP documentation page.
DW is much slower to update the definitions (through sporadic Adobe updates or on the next release) and updating them doesn't look easy (on the other hand, I quickly found the .zip files that NB uses for the PHP/HTML/CSS reference, in case I wanted to manually edit/update them).
However, since documentation can be opened so easily, I do not consider this to be a problem.
Custom functions/classes
This is where NB is clearly better; it instantly learns from your project's code. Hints for function parameters are smart in many cases, suggesting the most likely variable first.
Method chaining works wonderfully, as seen here:
(This would address question #1)
PHPDoc Support
I was greatly impressed with this feature. Take for example the above screenshot. I just typed /** followed by Enter and NB automatically completed the comment with the return type hint (also function parameters if present).
<?php
/**
*
* #return \Object_Factory
*/
public function foo(){
echo "foo";
return $this;
}
?>
Another example:
(This would address question #2)
You can include HTML code as well as some special # tags to your PHPDoc comments to include external links, references, examples, etc.
Debugging tools
Also noteworthy IMHO are the debugging tools included with NB. You can trace all variables (also superglobals!) while you advance step-by-step.
Configuring xDebug is very easy, just uncomment some lines in your php.ini and that's it!
Other stuff
The refactoring (i.e. renaming or safely deleting functions/variables) in NB is really nice. It gives you a very graphically detailed preview of the changes before committing them.
However, the search/replace functions of DW are vastly better. I miss a lot the "search for specific tag with attribute..." function. NB only provides a RegEx search/replace.
NB has a nice color chooser but it almost never suggests it; I thought for a while there wasn't one until I accidentally discovered it. Now I know how to invoke it (CTRL+SPACE, start typing Color chooser and Enter). Very cumbersome, indeed.
I haven't used FTP a lot since I moved to NB, but I have the feeling that DW was also much better, specially for syncing local/remote folders.
NB has really good native support for SVN, Mercurial and Git. When you activate versioning support, you can see every change next to the line number (the green part on my screenshots means those lines are new). I can click on a block and compare/revert those changes, see who originally committed every line (and when), etc.
Even when [team] versioning is deactivated, NB has a built-in local history that helps you recover previous versions as well as deleted files.
Conclusion
Starting with Macromedia Dreamweaver and seeing how it slowly stayed behind the Internet as Adobe struggled to integrate and adapt their products is a painful process. (To this day DW still doesn't render correctly, even with LiveView. To be fair, NB doesn't have a built-in renderer)
Certainly, the Adobe-ization of DW has had its advantages, but this humble programmer was having a hard time justifying a $399 USD ~400MB IDE vs a very comparable free 49MB multi-platform IDE.
After the initial learning curve, I'm very comfortable with NetBeans and I don't think I'll be returning to Dreamweaver any time soon.
I know this doesn't directly answer your questions regarding DW, but I hope it helps anyway.
Use the Site-Specfic Code Hinting feature
Make your own structure, just add the files where your functions, classes, etc. are stored. Save the structure and your done, just worked for me!
I know it is an older question and this is not the complete answer. But it will help someone for sure.
http://tv.adobe.com/watch/learn-dreamweaver-cs5/sitespecific-code-hinting-in-dreamweaver-cs5/
"Use Dreamweaver CS5 to view code hints related to content management
system frameworks such as WordPress, Drupal, and Joomla. Learn how to
set up site-specific code hinting for a CMS so you can easily work
with your PHP website in Dreamweaver. "
for #1, The complication with a scripting language is its not strict typing. The function/method could return null, false, true, int, array, string...
So the 'intellisense' has no type to base a hint off from unless it recompiles it and checks every possible return type.
for #2, the hinting is based off a clip definition file that exists for each version of PHP. With Microsoft products the currents projects (compiled) definitions are added. With PHP there is no compiling, checking or addition to the clip database (automatically). Some like PSPad will give you CodeExplorer that list each function and class in that file, but the only means I know of to get them to show up in hinting is to add it to the cips definition. I don't know where or if its possible in dreamweaver. Zend Studio and others do custom compiling and inclusion.

Understanding large php code, what techniques to use?

I have been handed over a large undocumented code of a application written in php as the original coder went AWOL. My task is to add new features but I can't do that without understanding the code.I started poking around. honestly, I am overwhelmed by the amount of source code. I have found:
Its well written based upon MVC architecture, DB persistence, Templating & OOP
modular, there is concept of URL based routing,basic templating
Uses custom written php framework which has no documentation.And there no source control history(oops!)
there over 500 files, with each file containing hundreds of line of code. And every file has 3-4 require_once statements which include tons of other files, so its kinda hard to tell which function/class/method is coming from where
Now I am looking for some techniques that I use to understand this code. for example, consider the following code snippet:
class SiteController extends Common {
private $shared;
private $view;
protected function init(){
$this->loadShared();
$this->loadView();
}
private function loadShared(){
$this->shared = new Home();
}
private function loadView(){
$this->view = new HomeView();
}
I want to know
where HomeView() & Home() are defined? Where does $this->shared & this->view come from? I checked the rest of the file, there is no method named shared or view. so obviously, they coming from one of hundreds of classes being included using require_once() But which one? how can I find out?
Can I get a list of all the functions or methods that are being executed? If yes, then how?
this class SiteController overrides a base Common class. But I unable to find out where is this Common class is located. How to tell?
Further, Please share some techniques that that be used to understand existing code written in php?
First, in this kind of situation, I try to get an overview of the application : some kind of global idea of :
What the application (not the code !) does
How the code is globally organized : where are the models, the templates, the controllers, ...
How each type of component is structured -- once you know how a Model class works, others will typically work the same way.
Once you have that global idea, a possibility to start understanding how the code works, if you have some time before you, is to use a PHP Debugger.
About that, Xdebug + Eclipse PDT is a possibility -- but pretty much all modern IDEs support that.
It'll allow you to go through the generation of a page step by step, line by line, understanding what is called, when, from where, ...
Of course, you will not do that for the whole application !
But as your application uses a Framework, there are high chances that all parts of the application work kind of the same way -- which means that really understanding one component should help understanding the other more easily.
As a couple of tools to understand what calls what and how and where, you might want to take a look at :
The inclued extension (quoting) : Allows you trace through and dump the hierarchy of file inclusions and class inheritance at runtime
Xdebug + KCacheGrind will allow you to generate call-graphs ; XHProf should do the same kind of thing.
Using your IDE (Eclipse PDT, Zend Studio, phpStorm, netbeans, ...), ctrl+click on a class/method should bring you to its declaration.
Also note that an application is not only code : it often find very useful to reverse-engineer the database, to generate a diagram of all tables.
If you are lucky, there are foreign keys in your database -- and you'll have links between tables, this way ; which will help you understand how they relate to each other.
You need an IDE. I use netbeans for PHP and it works great. This will allow you to find out where the homeview/home classes are by right clicking and selecting a "find where defined" option or something similar.
You can get a list. This is called the stack. Setting up a debugger like xdebug with the IDE will allow you to do this.
grep is the only thing makes me survive such codez
Look inside of the script where you found this code snippet for additional included or required pages that PHP imported into the main script. Those scripts should define those classes that are being instantiated.
Sorry, not sure if you can find which functions/methods have been executed. I know you can find if they exist, and you can find the generated output of them... but not sure if they have been executed.
It is important to note that SiteController doesn't override, the Common class, but it extends, or builds on top of it, like how a building is built on a foundation. The Common class is the foundation. Again, check the included and required scripts to see where Common was defines.
Hope that helps,
spryno724
I would start with:
throwing exception at certain points to see a stacktrace where the call originated.
grep for Class Common for example
create a directory listing to get a feeling for the organization of the software
use get_included_files(); to see what is actually used for a certain call
Start documenting what I find out
Start working with an IDE, like NetBeans, Eclipse or Zend Studio
Figuring out class hierarchies with maybe this "php: determining class hierarchy of an object at runtime" approach
You seem to realize that you can't read/digest every file, so you've got to focus on the important ones. Looks like you've started that process with SiteController.
Hopefully between reading the requires and using your IDE you can chase down the Home() and HomeView()
There might be a few key XML files that dictate the mappings from URLs to controller files, so you'll want to figure out how they work also.
I've worked with a poorly documented (but decently working) custom framework before, and your situation seems pretty similar. I found things pretty smooth once I understood the main controller and basically formed an understanding for how URL requests were processed.
1) You can use a search tool such as grep to find code, including definitions. But on a big code base, grep is slow, and it gives a lot of false positives because it has no understanding of the PHP language.
Our Search Engine is a GUI-based tool that indexes your source code to achieve extremely fast lookup, indexing by the langauge elements (variable names, constants, keywords, strings, ..) and allowing to formulate queries that honor the langauge structure (e.g., it ignores whitespace and comments unless you say you want to see them). A query shows hits in a hit window, and a click takes you to the file/line in which the hit occurs. With some tiny bit of additional configuration, you can go from the code window into your favorite editor.
2) Sometimes you want to know where specific functionality exists, but you have no clue what to search for. Here a test coverage tool can really help. Simple set up test coverage for the (working) application, and exercise the functionality manually; what is "covered" is potentially the code you care about. Exercise something which is NOT the feature; what is covered is NOT the code you want. This is way easier than trying to run a debugger to find the code of interest. Our PHP Test Coverage tool can provide you this coverage, and not only show you the covered code in GUI, but also do that "coverage subtraction" so that you can see just the relevant code.
Start from the entry point of the application (usually index.php) and go deeper on what gets called when.
Give PHPstorm a go, it's an ide with excellent code analyzing features, can go to definition of any class and variable, show inheritance hierarchy, find usages and many other useful stuff.
I'll also plug my own tool:
http://raveren.github.io/kint/
It's works with zero set up and is extremely useful to get a grip on what's going on where. Use Kint::trace(); to see a pretty execution backtrace and d(get_defined_vars()); to see what is defined in the current context and eventually you'll get there.
Screenshot:
(source: github.io)

PHP - Cleanup the Junk

I have inherited a very messy project. There are at least 3 versions that I can tell in it.
Is there a utility that can trace the PHP code from the main index.php so that I can figure out what isn't being used and what is, or am I stuck doing a manual cleanup?
Thanks
*Update*
I don't think I've been clear about what I'm looking for, that or I'm not understanding how the products mentioned work. What I'm looking for is something that can run on a folder (directory) and step through the project and give me a report of which files are actually referenced or used (in the case of images, CSS, etc).
This project has several thousand files and it's a very small project. I'm trying to clean it up and when I do a "search in files" in my IDE I get 3 or 4 references and can't easily tell which one is the right one.
Hope that makes it a little clearer.
Cross referencing software really lets you explore which functions are used for what.
PHPXref is quite good..
For example Yoast used it to cross reference the Wordpress PHP code. Take a look at the Wordpress example of how powerful it is.
For example, start by browsing the WP trunk. Click on some of the file names on the left and observe how the required files are listed, along with defined classes and methods, etc., etc.
There are several utilities that can do this, what first comes mind is Zend Studio's built in Optimizer that will run through your files and issue notices on a per file basis, including unused variables, warnings, etc. Alternatively, you can run your program in E_STRICT and PHP will notify you of some of your issues.
Be very careful of such cleanup tools, especially in PHP or Javascript. They work reasonably well in languages like Java, but any language that allows Eval() can trip an automated tool up, sometimes in devilishly clever ways, depending on how clever the original code developer thought they were.
You need the inclued extension. You can generate include graphs using GraphViz, see below for example code.
There are some useful examples on PHP.net: http://www.php.net/manual/en/inclued.examples-implementation.php
You might want to check xdebug's code coverage, possibly as an auto_append. However, itÅ› rather limited and it would require you to have either 100% test-cases (which I doubt as you say the project is a mess), or the tenacity to go through every possible action on the site, and even then you'll have to apply good judgement whether you can remove a portion of code because it isn't used, or leave it there because a certain condition just hasn't been met yet in your cases. On a side note: stepping through the code with xdebug's remote debugger has really helped me in the past to quickly get the different mechanisms & flows in unknown projects.
I would try opening the whole project in NetBeans PHP, its a great tool which we use for huge projects. You can easily see warnings and notifications and also follow usage of functions/classes easily. Try it!
I would recommend against automatic cleanups and the likes. Even if the code seems to work afterwards, I wouldnt sleep very well at night...

What kinds of patterns could I enforce on the code to make it easier to translate to another programming language? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
The community reviewed whether to reopen this question 11 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I am setting out to do a side project that has the goal of translating code from one programming language to another. The languages I am starting with are PHP and Python (Python to PHP should be easier to start with), but ideally I would be able to add other languages with (relative) ease. The plan is:
This is geared towards web development. The original and target code will be be sitting on top of frameworks (which I will also have to write). These frameworks will embrace an MVC design pattern and follow strict coding conventions. This should make translation somewhat easier.
I am also looking at IOC and dependency injection, as they might make the translation process easier and less error prone.
I'll make use of Python's parser module, which lets me fiddle with the Abstract Syntax Tree. Apparently the closest I can get with PHP is token_get_all(), which is a start.
From then on I can build the AST, symbol tables and control flow.
Then I believe I can start outputting code. I don't need a perfect translation. I'll still have to review the generated code and fix problems. Ideally the translator should flag problematic translations.
Before you ask "What the hell is the point of this?" The answer is... It'll be an interesting learning experience. If you have any insights on how to make this less daunting, please let me know.
EDIT:
I am more interested in knowing what kinds of patterns I could enforce on the code to make it easier to translate (ie: IoC, SOA ?) the code than how to do the translation.
I've been building tools (DMS Software Reengineering Toolkit) to do general purpose program manipulation (with language translation being a special case) since 1995, supported by a strong team of computer scientists. DMS provides generic parsing, AST building, symbol tables, control and data flow analysis, application of translation rules, regeneration of source text with comments, etc., all parameterized by explicit definitions of computer languages.
The amount of machinery you need to do this well is vast (especially if you want to be able to do this for multiple languages in a general way), and then you need reliable parsers for languages with unreliable definitions (PHP is perfect example of this).
There's nothing wrong with you thinking about building a language-to-language translator or attempting it, but I think you'll find this a much bigger task for real languages than you expect. We have some 100 man-years invested in just DMS, and another 6-12 months in each "reliable" language definition (including the one we painfully built for PHP), much more for nasty languages such as C++. It will be a "hell of a learning experience"; it has been for us. (You might find the technical Papers section at the above website interesting to jump start that learning).
People often attempt to build some kind of generalized machinery by starting with some piece of technology with which they are familiar, that does a part of the job. (Python ASTs are great example). The good news, is that part of the job is done. The bad news is that machinery has a zillion assumptions built into it, most of which you won't discover until you try to wrestle it into doing something else. At that point you find out the machinery is wired to do what it originally does, and will really, really resist your attempt to make it do something else. (I suspect trying to get the Python AST to model PHP is going to be a lot of fun).
The reason I started to build DMS originally was to build foundations that had very few such assumptions built in. It has some that give us headaches. So far, no black holes. (The hardest part of my job over the last 15 years is to try to prevent such assumptions from creeping in).
Lots of folks also make the mistake of assuming that if they can parse (and perhaps get an AST), they are well on the way to doing something complicated. One of the hard lessons is that you need symbol tables and flow analysis to do good program analysis or transformation. ASTs are necessary but not sufficient. This is the reason that Aho&Ullman's compiler book doesn't stop at chapter 2. (The OP has this right in that he is planning to build additional machinery beyond the AST). For more on this topic, see Life After Parsing.
The remark about "I don't need a perfect translation" is troublesome. What weak translators do is convert the "easy" 80% of the code, leaving the hard 20% to do by hand. If the application you intend to convert are pretty small, and you only intend to convert it once well, then that 20% is OK. If you want to convert many applications (or even the same one with minor changes over time), this is not nice. If you attempt to convert 100K SLOC then 20% is 20,000 original lines of code that are hard to translate, understand and modify in the context of another 80,000 lines of translated program you already don't understand. That takes a huge amount of effort. At the million line level, this is simply impossible in practice. (Amazingly there are people that distrust automated tools and insist on translating million line systems by hand; that's even harder and they normally find out painfully with long time delays, high costs and often outright failure.)
What you have to shoot for to translate large-scale systems is high nineties percentage conversion rates, or it is likely that you can't complete the manual part of the translation activity.
Another key consideration is size of code to be translated. It takes a lot of energy to build a working, robust translator, even with good tools. While it seems sexy and cool to build a translator instead of simply doing a manual conversion, for small code bases (e.g., up to about 100K SLOC in our experience) the economics simply don't justify it. Nobody likes this answer, but if you really have to translate just 10K SLOC of code, you are probably better off just biting the bullet and doing it. And yes, that's painful.
I consider our tools to be extremely good (but then, I'm pretty biased). And it is still very hard to build a good translator; it takes us about 1.5-2 man-years and we know how to use our tools. The difference is that with this much machinery, we succeed considerably more often than we fail.
My answer will address the specific task of parsing Python in order to translate it to another language, and not the higher-level aspects which Ira addressed well in his answer.
In short: do not use the parser module, there's an easier way.
The ast module, available since Python 2.6 is much more suitable for your needs, since it gives you a ready-made AST to work with. I've written an article on this last year, but in short, use the parse method of ast to parse Python source code into an AST. The parser module will give you a parse tree, not an AST. Be wary of the difference.
Now, since Python's ASTs are quite detailed, given an AST the front-end job isn't terribly hard. I suppose you can have a simple prototype for some parts of the functionality ready quite quickly. However, getting to a complete solution will take more time, mainly because the semantics of the languages are different. A simple subset of the language (functions, basic types and so on) can be readily translated, but once you get into the more complex layers, you'll need heavy machinery to emulate one language's core in another. For example consider Python's generators and list comprehensions which don't exist in PHP (to my best knowledge, which is admittedly poor when PHP is involved).
To give you one final tip, consider the 2to3 tool created by the Python devs to translate Python 2 code to Python 3 code. Front-end-wise, it has most of the elements you need to translate Python to something. However, since the cores of Python 2 and 3 are similar, no emulation machinery is required there.
Writing a translator isn't impossible, especially considering that Joel's Intern did it over a summer.
If you want to do one language, it's easy. If you want to do more, it's a little more difficult, but not too much. The hardest part is that, while any turing complete language can do what another turing complete language does, built-in data types can change what a language does phenomenally.
For instance:
word = 'This is not a word'
print word[::-2]
takes a lot of C++ code to duplicate (ok, well you can do it fairly short with some looping constructs, but still).
That's a bit of an aside, I guess.
Have you ever written a tokenizer/parser based on a language grammar? You'll probably want to learn how to do that if you haven't, because that's the main part of this project. What I would do is come up with a basic Turing complete syntax - something fairly similar to Python bytecode. Then you create a lexer/parser that takes a language grammar (perhaps using BNF), and based on the grammar, compiles the language into your intermediate language. Then what you'll want to do is do the reverse - create a parser from your language into target languages based on the grammar.
The most obvious problem I see is that at first you'll probably create horribly inefficient code, especially in more powerful* languages like Python.
But if you do it this way then you'll probably be able to figure out ways to optimize the output as you go along. To summarize:
read provided grammar
compile program into intermediate (but also Turing complete) syntax
compile intermediate program into final language (based on provided grammar)
...?
Profit!(?)
*by powerful I mean that this takes 4 lines:
myinput = raw_input("Enter something: ")
print myinput.replace('a', 'A')
print sum(ord(c) for c in myinput)
print myinput[::-1]
Show me another language that can do something like that in 4 lines, and I'll show you a language that's as powerful as Python.
There are a couple answers telling you not to bother. Well, how helpful is that? You want to learn? You can learn. This is compilation. It just so happens that your target language isn't machine code, but another high-level language. This is done all the time.
There's a relatively easy way to get started. First, go get http://sourceforge.net/projects/lime-php/ (if you want to work in PHP) or some such and go through the example code. Next, you can write a lexical analyzer using a sequence of regular expressions and feed tokens to the parser you generate. Your semantic actions can either output code directly in another language or build up some data structure (think objects, man) that you can massage and traverse to generate output code.
You're lucky with PHP and Python because in many respects they are the same language as each other, but with different syntax. The hard part is getting over the semantic differences between the grammar forms and data structures. For example, Python has lists and dictionaries, while PHP only has assoc arrays.
The "learner" approach is to build something that works OK for a restricted subset of the language (such as only print statements, simple math, and variable assignment), and then progressively remove limitations. That's basically what the "big" guys in the field all did.
Oh, and since you don't have static types in Python, it might be best to write and rely on PHP functions like "python_add" which adds numbers, strings, or objects according to the way Python does it.
Obviously, this can get much bigger if you let it.
I will second #EliBendersky point of view regarding using ast.parse instead of parser (which I did not know about before). I also warmly recommend you to review his blog. I used ast.parse to do Python->JavaScript translator (#https://bitbucket.org/amirouche/pythonium). I've come up with Pythonium design by somewhat reviewing other implementations and trying them on my own. I forked Pythonium from https://github.com/PythonJS/PythonJS which I also started, It's actually a complete rewrite . The overall design is inspired from PyPy and http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-89-1.pdf paper.
Everything I tried, from beginning to the best solution, even if it looks like Pythonium marketing it really isn't (don't hesitate to tell me if something doesn't seem correct to the netiquette):
Implement Python semantic in Plain Old JavaScript using prototype inheritance: AFAIK it's impossible to implement Python multiple inheritance using JS prototype object system. I did try to do it using other tricks later (cf. getattribute). As far as I know there is no implementation of Python multiple inheritance in JavaScript, the best that exists is Single inhertance + mixins and I'm not sure they handle diamond inheritance. Kind of similar to Skulpt but without google clojure.
I tried with Google clojure, just like Skulpt (compiler) instead of actually reading Skulpt code #fail. Anyway because of JS prototype based object system still impossible. Creating binding was very very difficult, you need to write JavaScript and a lot of boilerplate code (cf. https://github.com/skulpt/skulpt/issues/50 where I am the ghost). At that time there was no clear way to integrate the binding in the build system. I think that Skulpt is a library and you just have to include your .py files in the html to be executed, no compilation phase required to be done by the developer.
Tried pyjaco (compiler) but creating bindings (calling Javascript code from Python code) was very difficult, there was too much boilerplate code to create every time. Now I think pyjaco is the one that more near Pythonium. pyjaco is written in Python (ast.parse too) but a lot is written in JavaScript and it use prototype inheritance.
I never actually succeed at running Pyjamas #fail and never tried to read the code #fail again. But in my mind PyJamas was doing API->API tranlation (or framework to framework) and not Python to JavaScript translation. The JavaScript framework consume data that is already in the page or data from the server. Python code is only "plumbing". After that I discovered that pyjamas was actually a real python->js translator.
Still I think it's possible to do API->API (or framework->framework) translation and that's basicly what I do in Pythonium but at lower level. Probably Pyjamas use the same algorithm as Pythonium...
Then I discovered brython fully written in Javascript like Skulpt, no need for compilation and lot of fluff... but written in JavaScript.
Since the initial line written in the course of this project, I knew about PyPy, even the JavaScript backend for PyPy. Yep, you can, if you find it, directly generate a Python interpreter in JavaScript from PyPy. People say, it was a disaster. I read no where why. But I think the reason is that the intermediate language they use to implement the interpreter, RPython, is a subset of Python tailored to be translated to C (and maybe asm). Ira Baxter says you always make assumptions when you build something and probably you fine tune it to be the best at what it's meant to do in the case of PyPy: Python->C translation. Those assumptions might not be relevant in another context worse they can infere overhead otherwise said direct translation will most likely always be better.
Having the interpreter written in Python sounded like a (very) good idea. But I was more interested in a compiler for performance reasons also it's actually more easy to compile Python to JavaScript than interpret it.
I started PythonJS with the idea of putting together a subset of Python that I could easily translate to JavaScript. At first I didn't even bother to implement OO system because of past experience. The subset of Python that I achieved to translate to JavaScript are:
function with full parameters semantic both in definition and calling. This is the part I am most proud of.
while/if/elif/else
Python types were converted to JavaScript types (there is no python types of any kind)
for could iterate over Javascript arrays only (for a in array)
Transparent access to JavaScript: if you write Array in the Python code it will be translated to Array in javascript. This is the biggest achievement in terms of usability over its competitors.
You can pass function defined in Python source to javascript functions. Default arguments will be taken into account.
It add has special function called new which is translated to JavaScript new e.g: new(Python)(1, 2, spam, "egg") is translated to "new Python(1, 2, spam, "egg").
"var" are automatically handled by the translator. (very nice finding from Brett (PythonJS contributor).
global keyword
closures
lambdas
list comprehensions
imports are supported via requirejs
single class inheritance + mixin via classyjs
This seems like a lot but actually very narrow compared to full blown semantic of Python. It's really JavaScript with a Python syntax.
The generated JS is perfect ie. there is no overhead, it can not be improved in terms of performance by further editing it. If you can improve the generated code, you can do it from the Python source file too. Also, the compiler did not rely on any JS tricks that you can find in .js written by http://superherojs.com/, so it's very readable.
The direct descendant of this part of PythonJS is the Pythonium Veloce mode. The full implementation can be found # https://bitbucket.org/amirouche/pythonium/src/33898da731ee2d768ced392f1c369afd746c25d7/pythonium/veloce/veloce.py?at=master 793 SLOC + around 100 SLOC of shared code with the other translator.
An adapted version of pystones.py can be translated in Veloce mode cf. https://bitbucket.org/amirouche/pythonium/src/33898da731ee2d768ced392f1c369afd746c25d7/pystone/?at=master
After having setup basic Python->JavaScript translation I choosed another path to translate full Python to JavaScript. The way of glib doing object oriented class based code except the target language is JS so you have access to arrays, map-like objects and many other tricks and all that part was written in Python. IIRC there is no javascript code written by in Pythonium translator. Getting single inheritance is not difficult here are the difficult parts making Pythonium fully compliant with Python:
spam.egg in Python is always translated to getattribute(spam, "egg") I did not profile this in particular but I think that where it loose a lot of time and I'm not sure I can improve upon it with asm.js or anything else.
method resolution order: even with the algorithm written in Python, translating it to Python Veloce compatible code was a big endeavour.
getattributre: the actual getattribute resolution algorithm is kind of tricky and it still doesn't support data descriptors
metaclass class based: I know where to plug the code, but still...
last bu not least: some_callable(...) is always transalted to "call(some_callable)". AFAIK the translator doesn't use inference at all, so every time you do a call you need to check which kind of object it is to call it they way it's meant to be called.
This part is factored in https://bitbucket.org/amirouche/pythonium/src/33898da731ee2d768ced392f1c369afd746c25d7/pythonium/compliant/runtime.py?at=master It's written in Python compatible with Python Veloce.
The actual compliant translator https://bitbucket.org/amirouche/pythonium/src/33898da731ee2d768ced392f1c369afd746c25d7/pythonium/compliant/compliant.py?at=master doesn't generate JavaScript code directly and most importantly doesn't do ast->ast transformation. I tried the ast->ast thing and ast even if nicer than cst is not nice to work with even with ast.NodeTransformer and more importantly I don't need to do ast->ast.
Doing python ast to python ast in my case at least would maybe be a performance improvement since I sometime inspect the content of a block before generating the code associated with it, for instance:
var/global: to be able to var something I must know what I need to and not to var. Instead of generating a block tracking which variable are created in a given block and inserting it on top of the generated function block I just look for revelant variable assignation when I enter the block before actually visiting the child node to generate the associated code.
yield, generators have, as of yet, a special syntax in JS, so I need to know which Python function is a generator when I want to write the "var my_generator = function"
So I don't really visit each node once for each phase of the translation.
The overall process can be described as:
Python source code -> Python ast -> Python source code compatible with Veloce mode -> Python ast -> JavaScript source code
Python builtins are written in Python code (!), IIRC there is a few restrictions related to bootstraping types, but you have access to everything that can translate Pythonium in compliant mode. Have a look at https://bitbucket.org/amirouche/pythonium/src/33898da731ee2d768ced392f1c369afd746c25d7/pythonium/compliant/builtins/?at=master
Reading JS code generated from pythonium compliant can be understood but source maps will greatly help.
The valuable advice I can give you in the light of this experience are kind old farts:
extensively review the subject both in literature and existing projects closed source or free. When I reviewed the different existing projects I should have given it way more time and motivation.
ask questions! If I knew beforehand that PyPy backend was useless because of the overhead due to C/Javascript semantic mismatch. I would maybe had Pythonium idea way before 6 month ago maybe 3 years ago.
know what you want to do, have a target. For this project I had different objectives: pratice a bit a javascript, learn more of Python and be able to write Python code that would run in the browser (more and that below).
failure is experience
a small step is a step
start small
dream big
do demos
iterate
With Python Veloce mode only, I'm very happy! But along the way I discovered that what I was really looking for was liberating me and others from Javascript but more importantly being able to create in a comfortable way. This lead me to Scheme, DSL, Models and eventually domain specific models (cf. http://dsmforum.org/).
About what Ira Baxter response:
The estimations are not helpful at all. I took me more or less 6 month of free time for both PythonJS and Pythonium. So I can expect more from full time 6 month. I think we all know what 100 man-year in an enterprise context can mean and not mean at all...
When someone says something is hard or more often impossible, I answer that "it only takes time to find a solution for a problem that is impossible" otherwise said nothing is impossible except if it's proven impossible in this case a math proof...
If it's not proven impossible then it leaves room for imagination:
finding a proof proving it's impossible
and
If it is impossible there may be an "inferior" problem that can have a solution.
or
if it's not impossible, finding a solution
It's not just optimistic thinking. When I started Python->Javascript everybody was saying it was impossible. PyPy impossible. Metaclasses too hard. etc... I think that the only revolution that brings PyPy over Scheme->C paper (which is 25 years old) is some automatic JIT generation (based hints written in the RPython interpreter I think).
Most people that say that a thing is "hard" or "impossible" don't provide the reasons. C++ is hard to parse? I know that, still they are (free) C++ parser. Evil is in the detail? I know that. Saying it's impossible alone is not helpful, It's even worse than "not helpful" it's discouraging, and some people mean to discourage others. I heard about this question via https://stackoverflow.com/questions/22621164/how-to-automatically-generate-a-parser-code-to-code-translator-from-a-corpus.
What would be perfection for you? That's how you define next goal and maybe reach the overall goal.
I am more interested in knowing what kinds of patterns I could enforce
on the code to make it easier to translate (ie: IoC, SOA ?) the code
than how to do the translation.
I see no patterns that can not be translated from one language to another language at least in a less than perfect way. Since language to language translation is possible, you'd better aim for this first. Since, I think according to http://en.wikipedia.org/wiki/Graph_isomorphism_problem, translation between two computer languages is a tree or DAG isomorphism. Even if we already know that they are both turing complete, so...
Framework->Framework which I better visualize as API->API translation might still be something that you might keep in mind as a way to improve the generated code. E.g: Prolog as very specific syntax but still you can do Prolog like computation by describing the same graph in Python... If I was to implement a Prolog to Python translator I wouldn't implement unification in Python but in a C library and come up with a "Python syntax" that is very readable for a Pythonist. In the end, syntax is only "painting" for which we give a meaning (that's why I started scheme). Evil is in the detail of the language and I'm not talking about the syntax. The concepts that are used in the language getattribute hook (you can live without it) but required VM features like tail-recursion optimisation can be difficult to deal with. You don't care if the initial program doesn't use tail recursion and even if there is no tail recursion in the target language you can emulate it using greenlets/event loop.
For target and source languages, look for:
Big and specific ideas
Tiny and common shared ideas
From this will emerge:
Things that are easy to translate
Things that are difficult to translate
You will also probably be able to know what will be translated to fast and slow code.
There is also the question of the stdlib or any library but there is no clear answer, it depends of your goals.
Idiomatic code or readable generated code have also solutions...
Targeting a platform like PHP is much more easy than targeting browsers since you can provide C-implementation of slow and/or critical path.
Given you first project is translating Python to PHP, at least for the PHP3 subset I know of, customising veloce.py is your best bet. If you can implement veloce.py for PHP then probably you will be able to run the compliant mode... Also if you can translate PHP to the subset of PHP you can generate with php_veloce.py it means that you can translate PHP to the subset of Python that veloce.py can consume which would mean that you can translate PHP to Javascript. Just saying...
You can also have a look at those libraries:
https://bitbucket.org/logilab/astroid
https://bitbucket.org/logilab/pylint-brain
Also you might be interested by this blog post (and comments): https://www.rfk.id.au/blog/entry/pypy-js-poc-jit/
This Google Tech Talk from Ira Baxter is interesting https://www.youtube.com/watch?v=C-_dw9iEzhA
You could take a look at the Vala compiler, which translates Vala (a C#-like language) into C.

Categories