PHP - Cleanup the Junk - php

I have inherited a very messy project. There are at least 3 versions that I can tell in it.
Is there a utility that can trace the PHP code from the main index.php so that I can figure out what isn't being used and what is, or am I stuck doing a manual cleanup?
Thanks
*Update*
I don't think I've been clear about what I'm looking for, that or I'm not understanding how the products mentioned work. What I'm looking for is something that can run on a folder (directory) and step through the project and give me a report of which files are actually referenced or used (in the case of images, CSS, etc).
This project has several thousand files and it's a very small project. I'm trying to clean it up and when I do a "search in files" in my IDE I get 3 or 4 references and can't easily tell which one is the right one.
Hope that makes it a little clearer.

Cross referencing software really lets you explore which functions are used for what.
PHPXref is quite good..
For example Yoast used it to cross reference the Wordpress PHP code. Take a look at the Wordpress example of how powerful it is.
For example, start by browsing the WP trunk. Click on some of the file names on the left and observe how the required files are listed, along with defined classes and methods, etc., etc.

There are several utilities that can do this, what first comes mind is Zend Studio's built in Optimizer that will run through your files and issue notices on a per file basis, including unused variables, warnings, etc. Alternatively, you can run your program in E_STRICT and PHP will notify you of some of your issues.

Be very careful of such cleanup tools, especially in PHP or Javascript. They work reasonably well in languages like Java, but any language that allows Eval() can trip an automated tool up, sometimes in devilishly clever ways, depending on how clever the original code developer thought they were.

You need the inclued extension. You can generate include graphs using GraphViz, see below for example code.
There are some useful examples on PHP.net: http://www.php.net/manual/en/inclued.examples-implementation.php

You might want to check xdebug's code coverage, possibly as an auto_append. However, itÅ› rather limited and it would require you to have either 100% test-cases (which I doubt as you say the project is a mess), or the tenacity to go through every possible action on the site, and even then you'll have to apply good judgement whether you can remove a portion of code because it isn't used, or leave it there because a certain condition just hasn't been met yet in your cases. On a side note: stepping through the code with xdebug's remote debugger has really helped me in the past to quickly get the different mechanisms & flows in unknown projects.

I would try opening the whole project in NetBeans PHP, its a great tool which we use for huge projects. You can easily see warnings and notifications and also follow usage of functions/classes easily. Try it!
I would recommend against automatic cleanups and the likes. Even if the code seems to work afterwards, I wouldnt sleep very well at night...

Related

Understanding large php code, what techniques to use?

I have been handed over a large undocumented code of a application written in php as the original coder went AWOL. My task is to add new features but I can't do that without understanding the code.I started poking around. honestly, I am overwhelmed by the amount of source code. I have found:
Its well written based upon MVC architecture, DB persistence, Templating & OOP
modular, there is concept of URL based routing,basic templating
Uses custom written php framework which has no documentation.And there no source control history(oops!)
there over 500 files, with each file containing hundreds of line of code. And every file has 3-4 require_once statements which include tons of other files, so its kinda hard to tell which function/class/method is coming from where
Now I am looking for some techniques that I use to understand this code. for example, consider the following code snippet:
class SiteController extends Common {
private $shared;
private $view;
protected function init(){
$this->loadShared();
$this->loadView();
}
private function loadShared(){
$this->shared = new Home();
}
private function loadView(){
$this->view = new HomeView();
}
I want to know
where HomeView() & Home() are defined? Where does $this->shared & this->view come from? I checked the rest of the file, there is no method named shared or view. so obviously, they coming from one of hundreds of classes being included using require_once() But which one? how can I find out?
Can I get a list of all the functions or methods that are being executed? If yes, then how?
this class SiteController overrides a base Common class. But I unable to find out where is this Common class is located. How to tell?
Further, Please share some techniques that that be used to understand existing code written in php?
First, in this kind of situation, I try to get an overview of the application : some kind of global idea of :
What the application (not the code !) does
How the code is globally organized : where are the models, the templates, the controllers, ...
How each type of component is structured -- once you know how a Model class works, others will typically work the same way.
Once you have that global idea, a possibility to start understanding how the code works, if you have some time before you, is to use a PHP Debugger.
About that, Xdebug + Eclipse PDT is a possibility -- but pretty much all modern IDEs support that.
It'll allow you to go through the generation of a page step by step, line by line, understanding what is called, when, from where, ...
Of course, you will not do that for the whole application !
But as your application uses a Framework, there are high chances that all parts of the application work kind of the same way -- which means that really understanding one component should help understanding the other more easily.
As a couple of tools to understand what calls what and how and where, you might want to take a look at :
The inclued extension (quoting) : Allows you trace through and dump the hierarchy of file inclusions and class inheritance at runtime
Xdebug + KCacheGrind will allow you to generate call-graphs ; XHProf should do the same kind of thing.
Using your IDE (Eclipse PDT, Zend Studio, phpStorm, netbeans, ...), ctrl+click on a class/method should bring you to its declaration.
Also note that an application is not only code : it often find very useful to reverse-engineer the database, to generate a diagram of all tables.
If you are lucky, there are foreign keys in your database -- and you'll have links between tables, this way ; which will help you understand how they relate to each other.
You need an IDE. I use netbeans for PHP and it works great. This will allow you to find out where the homeview/home classes are by right clicking and selecting a "find where defined" option or something similar.
You can get a list. This is called the stack. Setting up a debugger like xdebug with the IDE will allow you to do this.
grep is the only thing makes me survive such codez
Look inside of the script where you found this code snippet for additional included or required pages that PHP imported into the main script. Those scripts should define those classes that are being instantiated.
Sorry, not sure if you can find which functions/methods have been executed. I know you can find if they exist, and you can find the generated output of them... but not sure if they have been executed.
It is important to note that SiteController doesn't override, the Common class, but it extends, or builds on top of it, like how a building is built on a foundation. The Common class is the foundation. Again, check the included and required scripts to see where Common was defines.
Hope that helps,
spryno724
I would start with:
throwing exception at certain points to see a stacktrace where the call originated.
grep for Class Common for example
create a directory listing to get a feeling for the organization of the software
use get_included_files(); to see what is actually used for a certain call
Start documenting what I find out
Start working with an IDE, like NetBeans, Eclipse or Zend Studio
Figuring out class hierarchies with maybe this "php: determining class hierarchy of an object at runtime" approach
You seem to realize that you can't read/digest every file, so you've got to focus on the important ones. Looks like you've started that process with SiteController.
Hopefully between reading the requires and using your IDE you can chase down the Home() and HomeView()
There might be a few key XML files that dictate the mappings from URLs to controller files, so you'll want to figure out how they work also.
I've worked with a poorly documented (but decently working) custom framework before, and your situation seems pretty similar. I found things pretty smooth once I understood the main controller and basically formed an understanding for how URL requests were processed.
1) You can use a search tool such as grep to find code, including definitions. But on a big code base, grep is slow, and it gives a lot of false positives because it has no understanding of the PHP language.
Our Search Engine is a GUI-based tool that indexes your source code to achieve extremely fast lookup, indexing by the langauge elements (variable names, constants, keywords, strings, ..) and allowing to formulate queries that honor the langauge structure (e.g., it ignores whitespace and comments unless you say you want to see them). A query shows hits in a hit window, and a click takes you to the file/line in which the hit occurs. With some tiny bit of additional configuration, you can go from the code window into your favorite editor.
2) Sometimes you want to know where specific functionality exists, but you have no clue what to search for. Here a test coverage tool can really help. Simple set up test coverage for the (working) application, and exercise the functionality manually; what is "covered" is potentially the code you care about. Exercise something which is NOT the feature; what is covered is NOT the code you want. This is way easier than trying to run a debugger to find the code of interest. Our PHP Test Coverage tool can provide you this coverage, and not only show you the covered code in GUI, but also do that "coverage subtraction" so that you can see just the relevant code.
Start from the entry point of the application (usually index.php) and go deeper on what gets called when.
Give PHPstorm a go, it's an ide with excellent code analyzing features, can go to definition of any class and variable, show inheritance hierarchy, find usages and many other useful stuff.
I'll also plug my own tool:
http://raveren.github.io/kint/
It's works with zero set up and is extremely useful to get a grip on what's going on where. Use Kint::trace(); to see a pretty execution backtrace and d(get_defined_vars()); to see what is defined in the current context and eventually you'll get there.
Screenshot:
(source: github.io)

parse/collator for php

I'm pretty much a newbie at php (at the "install an app and try to tweak it a bit" stage).
Is there a tool anywhere which can take a script which is spread over many files and show you all the code which is processed (for a given set of arguments passed to the script) in a single output?
For example, I want to make a call to zen cart from a script in a different language, which returns a category listing without any surrounding page. So I want to be able to trace what the actual process is to generate that then strip off all the unwanted bits to create a custom script.
One thing I've found very helpful when looking at new / complicated codebases is to use an IDE with some sort of code intelligence. I use php eclipse, and what it does is allow you to jump into function and variable definitions either by means of hyperlinking, or popups. This can be incredibly helpful for navigating through sprawling projects because you don't have to go through all the trouble to search by hand.
In your case, with php, the best thing to do is find the entry point for a page that pulls in your list of categories. Once you find that, you can use eclipse to expand out the various function calls it makes. Being a beginner, it's very helpful to read through code in this manner, as it exposes you to lots of different ways of doing things. An additional bonus of using something like eclipse is that it provides integration with the PHP manual. So anytime you encounter a function you don't know, you can hover over, see the manual, and also how it would be used in context.
What you want is called a "backwards slice" ("all the code that contributes to a specific computed result") in the computing theory literature. To compute the backward slice, something needs to parse the langauge, compute all the influences (control and dataflow) on a selected point in the program, and then display those points to you.
Slicing tools exist for langauges like C. They may exist for Java (as academic versions). I don't know of any that exist for PHP.
Another way to discover the code involved in an action is to run a test coverage tool. Such a tool marks all the code (across many files) that gets executed for a specific action (usually a "unit test" but test coverage tools really don't care). Then you simply exercise the action you care about, and look at the test coverage data. A graphical display will make it easy to see what code was executed; the part you want is buried in all the executed code.
A PHP Test Coverage tool does exist and will provide nice displays of the covered code.
If you are looking for a debugger of some sort, have a look at XDebug or ZendDebugger.

Is using multiple PHP includes a bad idea?

I'm in the process of creating a PHP site. It uses multiple PHP classes, which are currently all in one PHP include.
However, I am using Aptana IDE, and the file is now starting to crash it (it's around 400 lines). So I was wondering whether there would be any negative impact of including all the files seperately.
Current:
main file:
include("includes.php");
includes.php:
contains php classes
Suggested:
mainfile: main file:
include("includes.php");
includes.php:
include("class1.php");
include("class2.php")
Multiple PHP includes are fine, and 400 lines should not be a big deal. My concern would be with the Aptana IDE before I'd even consider my code to be the problem.
Breaking up your code into multiple PHP modules helps you to take a more object-oriented approach and simplifies your code base. I recommend it.
An IDE crashing because of a 400 line file? I'd find a new IDE.
However, it is better to separate classes into separate files. Perhaps not strictly one class per file, but only closely related classes in the same file.
For just two files, the cost won't be too great ; for hundreds of files, it might be a bit more... But, then, another problem to consider is "how do I determine what goes into which file ?"
Nice answer for that is "one class per file" ; and, for those, "one directory per functionnal item"
You might want to consider using an opcode cache, if you can install extensions on your server ; for instance, I almost always work using APC (see also PHP manual), which is quite easy to install, and really good for performances (it can sometimes divide by 2 the CPU load of a server ^^ )
Just as a sidenote : if Aptana can't handle 400 lines files, you should really think about using another IDE ^^
(Eclipse PDT is not bad if you have 2 GB of RAM -- eclipse-based, like Aptana, so shouldn't be too "new")
Personally, I like to include the files separately. If you include every class on every page, it just increases parsing overhead by processing lots of code that probably isn't even used on that page along with the associated overhead of reading the files from disk, etc.
It's negative in the sence that it requires more disk I/O. However, in a production stage you should use opcode cache anyway, and this will negate much of the negative impact.
On the positive side, you will achieve a better code structure, where each class belongs to a single file. This makes testing easier, and also allows you to auto-load classes on demand, thus reading only the necessary files.
I think your includes should generally only go one 'level' deep, unless you have a really good reason otherwise. What will happen is you will end up chasing down some issue and going on wild goose chases through include files, and you might even end up using stuff like "include_once" or "require_once", which is almost certainly a code smell.
Multiple includes are the best way of well organasing your code, i recommend it as well, but in some cases, ( as mine) only the first include that gets executed i dont know why im stuck with it

Visualise OO PHP code

Does something exist that I can point to my PHP project and it can look at all the files (or just the ones that I specify) and generate a diagram based on the objects and function calls?
It would be a good way to verify that my design is actually being implimented :)
Background:
I'm trying to build a PHP website using OO principles and while, so far, it is working I still have a ways to go and already the complexity is getting out of control.
I mean, I understand basically what's going on but (and I don't think I'm alone here) it's really helpful to me if I can visualise the system at once and see the flow so I can optimise, remove unnecessary things and of course, build on the foundations.
I could sit down with a pen&paper and draw it (and I have done that for parts) but if there was some program that would generate an image, it would be much simplier. Plus I could do it more often.
Thanks :)
This answer is I think still valid for PHP, but I am not sure if it is totally what you want. I know some of the tools (e.g. Doxygen) work with PHP
PHPDoc will create a class tree from your source code, but just in text (well, HTML). Not a pretty graph.
If you use profiling with xdebug you can get cachegrind files to open up with WinCacheGrind or similar. More info here.
You should check out nWire. The first nWire for PHP beta was just released. It is an interactive tool which lets you visualize practically every possible association in your code.

Tools for cleaning garbage PHP

I just inherited a 70k line PHP codebase that I now need to add enhancements onto. I've seen worse, at least this codebase uses an MVC architecture and is object oriented. However, there is no templating system and many classes are deprecated - only being called once. I think my method might be the following:
Find all the files on the live server that have not been touched in 48 hours and make them candidates for deletion (luckily there is a live server).
Implement a template system (Smarty) and try to find duplicate code in the templates.
Alot of the methods have copied and pasted code ... I don't know how much I want to mess with it.
My questions are: Are there steps that I should take or you would take? What is your method for dealing with this? Are there tools to help find duplicate PHP code?
Find all the files on the live server that have not been touched in 48 hours and make them candidates for deletion (luckily there is a live server)
By "touched" I'm assuming you'll stat the file to see if it's been accessed by any part of the system. I'd go a month and a half on this rather than 48 hours. In older PHP code bases you'll often find there's a bunch of code lying around that gets called via a local cron job once a week or once a month, or a third party is calling it remotely as a pseudo-service on a regular basis. By waiting 6 weeks be more likely to catch any and all files that are being called.
Implement a template system (Smarty) and try to find duplicate code in the templates.
Why? Serious question, is there a reason to implement a template system? (non-PHP savvy designers, developers who get you into trouble by including too much logic in the Views, or you're the one creating templates, and you know you work much faster in smarty than in PHP). If not, avoid it and just use PHP.
Also, how realistic is it to implement a pure smarty template system? I'd give favorable odds that old PHP systems like this are going to have a ton of "business logic" mixed in with their views that can't be implemented in pure smarty, and if you allowed mixed PHP/Smarty your developers will use PHP everytime.
Alot of the methods have copied and pasted code ... I don't know how much I want to mess with it.
I don't know of any code analysis tools that will do this out of the box, but it sould be possible to whip something up with the tokenizer functions.
What You Should Really Do
I don't want to dissuade you or demoralize you, but why do you want to cleanup this code? Right now it's doing what's is supposed to do. Stupidly, but it's doing it. Every re-factoring project is going to put current, undocumented, possibly business critical functionality at risk and at the end of that work you have an application that's doing the exact same thing. It's 70k lines of what sounds like shoddy code that only you care about fixing, no mater what other people are telling you their priorities are. If their priority was clean code, their code would already be clean. One person can't change a culture. Unless there's a straight forward business case for that code to be cleaned (open sourcing the project as a business strategy?), that legacy code isn't going anywhere.
Here's a different set of priorties to consider with legacy PHP applications
Is there a singleton database object or pair of objects that allows developers to easily setup seperate connections for read (slave) and write (master). Lot of legacy PHP applications will instantiate multiple connections to the same database in a single page call, which is a performance nightmare.
Is there a straight forward way for developers to avoid SQL injection? Give this to them for new code (parameterized SQL), and consider fixing legacy SQL to use this new method, but also consider security steps you can take on the network level.
Get a test framework of some kind wrapped around all the legacy code and treat it as a black-box. Use those tests to create a centralized API developers can use in place of the myriad function calls and copy/paste code they've been using.
Develop a centralized system for configuration values, most legacy PHP code is some awful combination of defines and class constants, which means any config changes mean a code push, which means potential DOOM.
Develop a lint that's hooked into the source control system to enforce code sanity for all new code, not just for style, but to make sure that business logic stays out of the view, that the SQL is being contructed in a safe way, that those old copy/paste libraries aren't being used, etc.
Develop a sane, trackable build and/or push system and stop people from hackin on code live in production
I don't know of any specific tools, but I have worked on re-factoring some fairly large PHP projects.
I would recommend a templating system, either Smarty or a strict PHP system that is clearly explained to anybody working on the project.
Take discrete, manageable sections and re-factor on a regular basis (e.g., this week, I'm going to re-write this). Don't bite off more than you can chew and don't plan to do a full rewrite.
Also, I do regular code searches (I use Eclipse and search through the files in my project) on suspect functions and files. Some people are too scared to make big changes, but I would rather err on the bold side rather than accept messy and poorly organized code. Just be prepared to test, test, test!
You need to identify a solid reason for refactoring. Removing duplicate code is not really a very good one; it needs to be coupled with a real desired improvement, such as reducing memory footprint (useful if the webservers are struggling).
Once you have that in mind, now you can start refactoring. And make sure you have a version-control repository, too. Just don't check in broken code.
Don't be too hasty about single-use classes. A lot of small PHP frameworks work like that. Often they could be abstracted better, though. Also, A lot of PHP code also doesn't understand data layer abstraction with the result that there is SQL code littered through the business logic or even the display code. This problem is often coupled with no custom database handler, which is a problem if you suddenly have to teach it about replication, or caching. This is the same abstraction problem from the other direction.
One very practical step: once you start abstracting repeated code away, you'll find reasons to have multiple files open. If you're using a shell and a Unix editor, then screen will help you immensely.

Categories