Tips for refactoring a 20K lines library [closed]

Tips for refactoring a 20K lines library [closed] - php

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I've already awarded a 100 point bounty to mario's answer, but might start a second 100 point bounty if I see new good answers coming in. This is why I'm keeping the question open and will not choose a final answer, despite having awarded the bounty to mario.
This might seem like a simple question (study the code and refactor) but I'm hoping those with lots more experience can give me some solid advice.
The library is an open source 20,000 line library that's all in a single file and which I haven't written myself. The code looks badly written and the single file is even a bigger problem, because it freezes eclipse for half a minute at least every time I want to make a change, which is one of the reasons I think it's worth it to refactor this library into smaller classes.
So aside from reading the code and trying to understand it, are there common (or not so common) tips when refactoring a library such as this? What do you advise to make my life a little easier?
Thanks to everyone for your comments.

A few generic principles apply:
Divide and conquer. Split the file into smaller, logical libraries and function groupings. You will learn more about the library this way, and make it easier to understand and test incrementally.
Remove duplication. Look for repeated functions and concepts, and replace them with standard library functions, or centralized functions within the library.
Add consistency. Smooth out parameters and naming.
Add unit tests. This is the most important part of refactoring a library. Use jUnit (or similar), and add tests that you can use to verify that the functions are both correct, and that they have not changed.
Add docs. Document your understanding of the consistent, improved library as you write your tests.

If the code is badly written, it is likely that it has a lot of cloning. Finding and getting rid of the clones would then likely make it a lot more maintainable as well as reducing its size.
You can find a variety of clone detectors, these specifically for PHP:
Bergmann's PHPCPD
SourceForge PMD
Our CloneDR
ranked in least-to-most capability order (IMHO with my strong personal self-interest in CloneDR) in terms of qualitatively different ability to detect interesting clones.
If the code is badly written, a lot of it might be dead. It would be worthwhile to find out which part executes in practice, and which does not. A test coverage tool can give you good insight into the answer for this question, even in the absence of tests (you simply exercise your program by hand). What the test coverage tool says executes, obviously isn't dead. What doesn't execute... might be worth further investigation to see if you can remove it. A test coverage tool is also useful to tell you how much of the code is exercised by your unit tests, as suggested by another answer. Finally, a test coverage tool can help you find where some of the functionality is: exercise the functionality from the outside, and whatever code the test coverage tool says is executed is probably relevant.
Our PHP Test Coverage Tool can collect test coverage data.

If it's an open source library, ask the developers. First it's very likely someone already has (attempted) a restructured version. And very occassionally the big bloated version of something was actually auto-generated from a more modular version.
I actually do that sometimes for one of my applications which is strictly pluginized, and allows a simple cat */*.php > monolithic.php, which eases distribution and handling. So ask if that might be the case there.
If you really want to restructure it, then use the time-proven incremental extension structure. Split up the class library into mutliple files, by segregating the original class. Split every ~ 2000 lines, and name the first part library0.php:
class library0 {
var $var1,$var2,$var3,$var4;
function method1();
function method2();
function method3();
function method4();
function method5();
The next part simple goes from there and holds the next few methods:
class library1 extends library0 {
function method6();
function method7();
function method8();
...
Do so until you have separated them all. Call the last file by its real name library.php, and class library extends library52 { should do it. That's so ridiculously simplistic, a regex script should be able to do it.
Now obviously, there are no memory savings here. And splitting it up like that buys you nothing in terms of structuring. With 20000 lines it's however difficult to get a quick overview and senseful grouping right the first time. So start with an arbitrary restructuring in lieu of an obvious plan. But going from there you could very well sort and put the least useful code into the last file, and use the lighter base classes whenever they suffice. You'll need a dependency chart however to see if this is workable, else errors might blow up at runtime.
(I haven't tried this approach with a huge project like that. But arbitrarily splitting something into three parts, and then reshuffling it for sensibility did work out. That one time.)

I assume you are planning to break the library up into thematically relevant classes. Definitely consider using autoloading. It's the best thing since sliced bread, and makes inter-dependencies easy to handle.
Document the code using phpDoc compatible comments from the start.

Calling Side Approach
If you know the library use is limited to a particular class, module, or project it can be easier to approach the problem from the calling side. You can then do the following to clean the code and refactor it. The point of approaching from the calling side is because there are very few calls into the library. The fewer the calls the (potentially) less code that is actually used in the lib.
Write the Calling Side Tests
Write a test that mimics the calls that are done against the library.
Bury the Dead Code
If there is a lot of dead code this will be a huge win. Trace the the actual calls into the library and remove everything else. Run the test and verify.
Refactor Whats Left
Since you have the tests it should be much easier to refactor (or even replace) the code in the library. You can then apply the standard refactoring rules ie. (de-duplication, simplification, consolidation, etc).

Apart from what was already stated I suggest to have a look at Martin Fowler's Catalog of Refactorings based on his book. The page also contains a large number of additional sources useful in understanding how refactoring should be approached. A more detailed catalog listing can be found at sourcemaking. Note that not all of these techniques and patterns can be applied to PHP code.
There is also a lot useful tools to assist you in the refactorings (and in general) at http://phpqatools.org. Use these to analze your code to find things like dead or duplicated code, high cyclomatic complexity, often executed code and so on. Not only will this give you a better overview of your code, but it will also tell you which portions of your code are critical (and better left untouched in the beginning) and which could be candidates for refactorings.
Whatever you do, do write Unit-Tests. You have to make sure you are not breaking code when refactoring. If the library is not unit-tested yet, add a test before you change any code. If you find you cannot write a test for a portion of code you want to change, check if doing a smaller refactoring in some other place might let you do so more easily. If not, do not attempt the refactoring until you can.

Write tests for the library such
that all the lines of the code is
covered(i.e 100% Coverage).
Use
TDD. Start from the higher
level module and re-factor(Top to
Bottom approach).
Run the tests mentioned in step 1. and verify with the results of step 2.
I understand that 100% coverage(as mentioned in step 1) does not necessarily mean that all the features have been covered at least we are making sure that whatever the o/p of the current system will be same as the o/p of new system.

A good book that answers your question with a lot of examples and details is: Working Effectively with Legacy Code, by Michael Feathers.

First of all, consider using a different IDE - Eclipse is notoriously terrible in terms of performance. Komodo is way faster. So is PhpStorm.
In terms of making the refactoring easier, I'd first try to identify the high-level picture - what functions are there? Are there classes? Can you put those classes into separate files just to start with?

http://www.amazon.com/Clean-Code-Handbook-Software-Craftsmanship/dp/0132350882
Refactoring depends from you goals and type of solution. This book will help you to understand basic concepts of right code.

If you problem include the headache of manually placing the functions in different files than may be below strategy can help.
get your library file ina php variable
$code = file_get_contents('path/yo/your/library.php');
eliminate tags
$code = str_replace('<?php' ,'' ,$code);
$code = str_replace('?>' ,'' ,$code);
separate all the functions
$code_array = explode('function',$code);
now body of all the functions and their names are in array
create separate files for each of the functions in folder 'functions'
foreach($code_array as $function)
{
$funcTemp = explode('(',$function); // getting function name
$function_name = trim($funcTemp[0]);
$function_text = '<?php function '.$function;
file_put_contents('functions/'.$function_name.'.php',$function_text)
}
now all the functions of your library are in the separate files in a common folder. files are named with the function names. now you can easily look up you functions in folder view and apply your strategies to manage them.
You can also implemet __call() function to use same formates
function __call($name,$params)
{
include_once('functions/'.$name.'.php');
$name($params); // this may be wrong ...
}
Hope it helps :)

Usually, a general rule of thumb is to remove repeated code. Also make sure to have useful documentation. If you're using Java, Javadoc is very useful, but a suitable equivalent is available for other languages.

Related

what is core purpose of IDE refractor option?

To change name of folder or file NET-BEANS and ECLIPSE IDE both offer an option i.e. REFRACTOR
What actually is the core purpose of this option? or other way around how can we get benefit from this option while development
PLEASE explain it with some hello word type example.

Refactoring is a major feature in agile software development. It came about as the third step in test-driven development:
Write a test
Write enough code to pass the test
Refactor
To refactor means the code is working, but it needs to be cleaned up to be more maintainable and extensible. This could be something simple like renaming a variable or extracting common code into a function or something complicated like extracting an interface or adding a design pattern.
I don't have numbers, but I think the majority of developers don't do TDD. Still, you should refactor often to clean up technical debt (another agile term) and make your code maintainable and extensible. It is such a huge help when you need to add a bunch of new features in a short amount of time.
So all these IDE's, regardless of language or platform, offer automated refactorings to make your life easier. For example, changing a method name in a Java class and then changing it everywhere it is referenced would be a huge pain to do manually.
Hope that helps.

According to here:
"Refactoring is the process of changing a software system in such a
way that it does not alter the external behavior of the code yet
improves its internal structure."
The most common usages for this is usually the renaming of variables to something which makes more sense, the fixing of indentations and addition of comments, such as JavaDoc/nDoc.
In certain scenario, one would also require to rewrite some logic, or entire modules of the code. This usually entails either a bad design or else, that the solution had been changed so much that a re-write will ease any future maintenance.

What should be the standard PHP code file lenth in LOC?

I do PHP coding a lot in my company and personal work. Usually my files get bigger, sometimes more than 2000-3000 lines long. Then, they get difficult to manage.
My Question: What should be (is) the standard length of a PHP code file in terms of lines-of-code. At what length do you guys split it up?
Note: No Object Oriented programming (I don't use classes). Please answer accordingly.
Clarification of not using classes:
I do use functions a lot.
I don't use classes because the code is legacy. I have to maintain that and add new features.
I was a C programmer before. So, going OO is somewhat tough for me. Like learning whole new way of doing things.

There is no good standard length. Some files grow bigger, some smaller.
A good guiding principle from Object Oriented Programming is separating tasks and concerns into classes, and splitting those classes into separate files.
That is the most logical separation, and allows using PHP 5's Autoloading. The basic principles may be worth adopting even if you don't want to get into serious OOP.
Related questions:
What are the advantages/disadvantages of monolithic PHP coding versus small specialized php scripts?

Code should not be split according to number of lines of code, it should be split according to functionality. Parts of your code that handle, say, templating, should go in different files (and possibly directories) than parts that handle, say, authentication. If you have a file that's thousands of lines long, it's almost certainly doing way too much and needs to be split up, if not refactored entirely.

Maybe you should start using classes then.
BTW, I definitely split the PHP code files at 1000 lines of code.

Use classes and OO programming. I have been to an workshop once "make love to your code" that stated to avoid functions that are longer as the space on your monitor (you should not scroll to look at the whole function)

Even quite large code files can be reasonably easy to manage if you organise them well. You should keep your functions short, keep related functions together, and name them well.
You will also find it easier to manage if you use an IDE with a function lookup table - I use Netbeans, and on the left hand side it gives me a panel with quick links to all the functions in my current file. It also gives me the ability to click on a line where a function is called and jump to the declaration (anwhere in the project).
On the other hand, if you have code files several thousand lines long which consist of a single function, then yes, the odds are it will be very hard to manage, an no amount of IDE cleverness will help.

How many lines of PHP code is too many for one file?

I'm creating a PHP file that does 2 mysql database calls and the rest of the script is if statements for things like file_exists and other simple variables. I have about 2000 lines of code in this file so far.
Is it better practice to include a separate file if a statement is true; or simply type the code directly in the if statement itself?
Is their a maximum number of lines of code for a single file that should be adhered to with PHP?

I would say there should not be any performance issue related to the number of lines in your php files, it can be as big as you need.
Now, for the patterns and best practices, I would say that you have to judge by yourself, I saw many well organized files of several thousand lines and a lot of actually small and difficult to read files.
My advise would be:
Judge the readability of the source code, always organize it well.
It's important to have a logical separation to some extent, if your file does both: heavy database access, writing, modification, html rendering, ajax and so on.. You may want to separate things or use object oriented approach.
Always search the balance between the logical separation and code. It should not be messy nor extra-neat with a lot of 10-line files

2000 lines of code in a single file is not exactly bad from a computer point of view but in most situations is probably avoidable, take a look into the MVC design pattern, it'll help you to better organize your code.
Also, bear in mind that including (a lot of) files will slow down the execution of your code.

You may want to read a book like Clean Code by Bob Martin. Here are a few nuggets from that book:
A class should have one responsibility
A function should do one thing and do it well
With PHP, if you aren't using the Class approach; you're going to run into duplication problems. Do yourself a favor and do some reading on the subject; it'll save you a lot more time in extending and maintenance.

Line count is not a good indicator of performance. Make sure that your code is organized efficiently, divided into logical classes or blocks and that you don't combine unrelated code into single modules.
One of the problems with a language like PHP is that, barring some creative caching, every line of every included file must be tokenized, zipped through a parse tree and turned into meaningful instructions every time the hosting page is requested. Compiled platforms like .NET and Java do not suffer from this performance killer.
Also, since one of the other posters mentioned MVC as a way to keep files short: good code organization is a function of experience and common sense and is in no way tied to any particular pattern or architecture. MVC is interesting, but isn't a solution to this problem.

Do you need to focus on the number of lines? No, not necessarily. Just make sure your code is organized, efficient, and not unnecessarily verbose.

It really doesn't matter, so long as you have documented your code properly, modularised as much as possible, and checked for any inefficiencies. You may well have a 10,000 line file. Although I usually split at around 500-1000 for each section of an application.

2k lines sound too much to me... Though it depends what code style you are following, e.g. many linebreaks, many little functions or good api-contract comments can increase the size though they are good practice. Also good code formatting can increase lines.
Regarding PHP it would be good to know: Is it 2k lines with just one class or just one big include with non-OOP PHP code? Is it mixed with template statements and programm logic (like I find often in PHP code)?
Usually I don't count these lines, when to split. They just went into habits. If code gets confusing I react and refactor. Still having looked into some code we as a team wrote recently, I can see some patterns:
extract function/method if size is bigger than 20LOC (without comments) and usage of if/else clauses
extract to another class if size >200-300LOC
extract to another package/folder if artifacts >10
Still it depends what the kind of code I have. For instance if loads of logic is involved (if/else/switch/for), the LOC per function decreases. If there is hardly any logic involved (simple stupid one-path code statements) the limits increase. In the end the most-important rule is: Would a human understand the code. Will she/he be able to read it well.

I don't know any useful way to split code that's that simple, particularly if it all belongs together semantically.
It is probably more interesting to think about whether you can eliminate some of the code by refactoring. For example, if you often use a particular combination of checks with slightly different variables, it might help to outsource the combination of checks into a function and call it wherever appropriate.
I remember seeing a project once that was well-written for the most part, but it had a problem of that kind. For example, the code for parsing its configuration file was duplicated like this:
if (file_exists("configfile")) {
/* tons of code here */
} else if (file_exists("/etc/configfile")) {
/* almost the same code again */
}
That's an extreme example but you get the idea.

How to fix an old coding style php script

Is there any advice on how to start fixing an old-fashioned-style php script?
A few days ago I received an offer for developing an old PHP project, and by old-fashioned I mean the structure did not use OOP coding method and it doesn't have a definite framework.
I am confused on where to start, and wanted to know what methods there are for developing an old script.
Note: They don't want to spend lots of money on starting a new project.
So what methods would you suggest for updating an old php script?

Joel Spolsky writes:
"[Netscape made] the single worst strategic mistake that any software company can make: They decided to rewrite the code from scratch."
So, whatever your course of action, priority is to work with the existing code. Refactoring will be one of the best methods you can use.
What can you not do, if the code base is not updated, that you absolutely must? How much and of what particularly do you need to upgrade for that action to be possible? Consider these two questions.

It depends what you mean by "old". Old as in written for PHP 4? Or old as in non-OOP? (Or both?)
Old as in PHP4:
As long as you sift through it and either suppress warnings or actually fix deprecated function calls everything should be fine. This is just simply boring work. Easy and cheap.
Old as in non-OOP:
One could theoretically develop a very stable and scalable app without OOP or a definite MVC (or other) framework. As a matter of fact, if the app is small in scale, there's no reason to add the spaghetti and meatball complexity of OOP or a framework. Re-writing everything in OOP with some framework is hard and expensive. And quite probably overkill.

Can you give us more detail, perhaps an example.
Even procedural code has elements of OOP in it. You can identify variables and procedures that relate to the same entity. You could go about rewriting it, but they're going to have a hard time finding value in it, especially if they are frugal, as you suggested.

When I do this, it's a multi step process. Typically, there's an existing product to keep running. Rewriting from scratch is rarely an option, even though you end doing it eventually.
Begin to ditch manual include statements and implement an autoloader, where possible (takes many passes)
Create a helper script to simulate magic quotes & register globals. This is so you can turn it off in PHP, while keeping the existing code running
Gradually remove excessive strip_slashes or add_slashes calls, if applicable. The helper script allows you to do this per file.
Ensure that your variables have proper scoping
Separate out your presentation code. Consider Smarty or alternate template system
Move the DB calls to PDO and use parameter substitution for everything
Look at the code and think about stubbing out a front controller
I then look at the project and determine how I'm going to alter the logic itself. Often, if there are no functions at all, my first pass is to wrap common behaviors into static methods. Get as much reuse without too much effort, so I'm not concerned with organization yet.
After the redundancy is reduced, then I get to organization. It's at this phase that I start planning out my class models and refactoring the functions into clean methods. This is also the time for automated tests (phpunit). Once I'm reasonably confident, I add some controllers and integrate the templates, then I'm done... barring one or two more passes.
For me, it's all about identifying where I am, where I want to be, and making a plan that can be executed in several small steps. Everybody has their own objectives, so there's no magic plan to follow except your own.

Perhaps Your code right now looks like this
And you want that it looks like this
Well if its just a script and not the whole Project i would convert it to OOP coding standard.

Read their code. Talk to them.
Look at the requested change in terms of the existing code. Talk to them.
Decide how little of it you change to do what what want. Talk to them.
Do that. Talk to them.
When they ask for functionality that can be more easily done by re-writing than by modifying, do that.
Work with an IDE that can assist with refactoring.

Tools for cleaning garbage PHP

I just inherited a 70k line PHP codebase that I now need to add enhancements onto. I've seen worse, at least this codebase uses an MVC architecture and is object oriented. However, there is no templating system and many classes are deprecated - only being called once. I think my method might be the following:
Find all the files on the live server that have not been touched in 48 hours and make them candidates for deletion (luckily there is a live server).
Implement a template system (Smarty) and try to find duplicate code in the templates.
Alot of the methods have copied and pasted code ... I don't know how much I want to mess with it.
My questions are: Are there steps that I should take or you would take? What is your method for dealing with this? Are there tools to help find duplicate PHP code?

Find all the files on the live server that have not been touched in 48 hours and make them candidates for deletion (luckily there is a live server)
By "touched" I'm assuming you'll stat the file to see if it's been accessed by any part of the system. I'd go a month and a half on this rather than 48 hours. In older PHP code bases you'll often find there's a bunch of code lying around that gets called via a local cron job once a week or once a month, or a third party is calling it remotely as a pseudo-service on a regular basis. By waiting 6 weeks be more likely to catch any and all files that are being called.
Implement a template system (Smarty) and try to find duplicate code in the templates.
Why? Serious question, is there a reason to implement a template system? (non-PHP savvy designers, developers who get you into trouble by including too much logic in the Views, or you're the one creating templates, and you know you work much faster in smarty than in PHP). If not, avoid it and just use PHP.
Also, how realistic is it to implement a pure smarty template system? I'd give favorable odds that old PHP systems like this are going to have a ton of "business logic" mixed in with their views that can't be implemented in pure smarty, and if you allowed mixed PHP/Smarty your developers will use PHP everytime.
Alot of the methods have copied and pasted code ... I don't know how much I want to mess with it.
I don't know of any code analysis tools that will do this out of the box, but it sould be possible to whip something up with the tokenizer functions.
What You Should Really Do
I don't want to dissuade you or demoralize you, but why do you want to cleanup this code? Right now it's doing what's is supposed to do. Stupidly, but it's doing it. Every re-factoring project is going to put current, undocumented, possibly business critical functionality at risk and at the end of that work you have an application that's doing the exact same thing. It's 70k lines of what sounds like shoddy code that only you care about fixing, no mater what other people are telling you their priorities are. If their priority was clean code, their code would already be clean. One person can't change a culture. Unless there's a straight forward business case for that code to be cleaned (open sourcing the project as a business strategy?), that legacy code isn't going anywhere.
Here's a different set of priorties to consider with legacy PHP applications
Is there a singleton database object or pair of objects that allows developers to easily setup seperate connections for read (slave) and write (master). Lot of legacy PHP applications will instantiate multiple connections to the same database in a single page call, which is a performance nightmare.
Is there a straight forward way for developers to avoid SQL injection? Give this to them for new code (parameterized SQL), and consider fixing legacy SQL to use this new method, but also consider security steps you can take on the network level.
Get a test framework of some kind wrapped around all the legacy code and treat it as a black-box. Use those tests to create a centralized API developers can use in place of the myriad function calls and copy/paste code they've been using.
Develop a centralized system for configuration values, most legacy PHP code is some awful combination of defines and class constants, which means any config changes mean a code push, which means potential DOOM.
Develop a lint that's hooked into the source control system to enforce code sanity for all new code, not just for style, but to make sure that business logic stays out of the view, that the SQL is being contructed in a safe way, that those old copy/paste libraries aren't being used, etc.
Develop a sane, trackable build and/or push system and stop people from hackin on code live in production

I don't know of any specific tools, but I have worked on re-factoring some fairly large PHP projects.
I would recommend a templating system, either Smarty or a strict PHP system that is clearly explained to anybody working on the project.
Take discrete, manageable sections and re-factor on a regular basis (e.g., this week, I'm going to re-write this). Don't bite off more than you can chew and don't plan to do a full rewrite.
Also, I do regular code searches (I use Eclipse and search through the files in my project) on suspect functions and files. Some people are too scared to make big changes, but I would rather err on the bold side rather than accept messy and poorly organized code. Just be prepared to test, test, test!

You need to identify a solid reason for refactoring. Removing duplicate code is not really a very good one; it needs to be coupled with a real desired improvement, such as reducing memory footprint (useful if the webservers are struggling).
Once you have that in mind, now you can start refactoring. And make sure you have a version-control repository, too. Just don't check in broken code.
Don't be too hasty about single-use classes. A lot of small PHP frameworks work like that. Often they could be abstracted better, though. Also, A lot of PHP code also doesn't understand data layer abstraction with the result that there is SQL code littered through the business logic or even the display code. This problem is often coupled with no custom database handler, which is a problem if you suddenly have to teach it about replication, or caching. This is the same abstraction problem from the other direction.
One very practical step: once you start abstracting repeated code away, you'll find reasons to have multiple files open. If you're using a shell and a Unix editor, then screen will help you immensely.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.