Identifying repeated code within PHP project

Identifying repeated code within PHP project - php

I have a single PHP file within a legacy project that is at least a few thousand lines long. It is predominantly separated up into a number of different conditional blocks by a switch statement with about 10 cases. Within each case there is what appears to be a very similar - if not exact duplicate - block of code. What methods are available for me identifying these blocks of code as being either the same - or close to the same - so I can abstract that code out and begin to refactor the entire file? I know this is possible in very manual terms (separate each case statement in the code into individual files and Diff) but i'm interested in what tools i could be using to speed this process up.
Thanks.

You can use phpcpd.
phpcpd is a Copy/Paste Detector (CPD) for PHP code. It scans a PHP project for duplicated code.
Further resources:
http://qualityassuranceinphpprojects.com/pages/tools.html

You can use phpunit PMD (Project Mess Detector) to detect duplicated blocks of code.
It also can compute the Cyclomatic complexity of your code.
Here is a screenshot of the pmd tab in phpuc:

See our PHP Clone Detector tool.
This finds both exact copies and near misses, in spite of reformatting, insertion/deletion of comments, replacement of variable names, addition/replacments of subblocks etc.
PHPCPD as far as I can tell finds only (token) sequences which are exactly the same. That misses a lot of clones, since the most common operation after copy-paste is edit-to-customize. So it would miss the very clones the OP is trying to find.

You could put the blocks in separate files and just run diff on them?
However, I think in the end you will need to go through everything manually anyway, since it sounds like this code requires a lot of refactoring, and even if there are differences you will probably need to evaluate whether this is intentional or a bug.

Related

List or single line array/parameters: Does one or the other perform better?

A little bit of a generic question but it has been playing on my mind for a while.
Whilst learning php coding, to help me create a WordPress Theme from scratch, I have noticed that some arrays/parameters are kept to a single line whilst others are listed underneath one an other. Personally, I prefer listing the arrays underneath one and other as I feel this helps with readability and generally just looks tidier - Especially, if the array is long.
Does anyone know if listing arrays/parameters have any performance 'ill effects' such as slowing down the page load speed etc? As far as I can see, it is just a coder's preference. Is this a correct assumption?

Code formatting has no effect on performance.
Even if you claim that a larger file takes longer to read, if you are using at least PHP 5.5 then PHP will use an opcode cache - it will cache how it parsed your files for subsequent requests, eliminating any formatting that you have in your file.

php speed with if statements

So I'm working on a project written in old-style (no OOP) PHP with no full rewrite in the near future. One of the problems with it currently is that its slow—much of the time is spent requireing over 100 files based on where it is in the boot process.
I was wondering if we could condense this (on deployment, not development of course) into a single file or two with all the require'd text just built in. However, since there are so many lines of code that aren't used for each page, I'm wondering if doing this would backfire.
At its core, I think, it's a question of whether:
<?php
echo 'hello world!';
?>
is any faster than
<?php
if(FALSE) {
// thousands of lines of code here
}
echo 'hello world!';
?>
And if so, how much slower?
(Also, if what I've outlined above is a bad idea for some other reasons, please let me know.)

The difference between the two will be negligible. If most of the execution time is currently spent requiring files you're likely to see a significant boost by using an optcode cache like APC, if you are not already.
Other than that - benchmark, find out exactly where the bottlenecks are. In my experience requires are often the slowest part of an old-style procedural PHP app, but even with many included files I'd be surprised if these all added up to a 'slow' app.
Edit: ok, a quick 'n dirty benchmark. I created three 'hello world' PHP scripts like the example. The first (basic.php) was just echoing the string. The second (complex.php) included an if false statement that contained ~5000 lines of PHP code pasted in from another app. The third (require.php) included the same if statement but required in the ~5000 lines of code from another file.
Page generation time (as measured by microtime()) between basic.php and complex.php was around ~0.000004 seconds, so really not significant. Some more comprehensive results from apache bench:
without APC with APC
req/sec avg (ms) req/sec avg (ms)
basic.php: 7819.87 1.277 6960.49 1.437
complex.php: 346.82 2.883 352.12 2.840
require.php: 6819.24 1.446 5995.49 1.668
APC's not doing a lot here but using up memory, but it's likely to be a different picture in a real world app.

require does have some overhead. 100 requires is probably a lot. Parsing an entire file that has the 100 includes is probably slow too. The overhead from require might cost you more, but it is hard to say. It might not cost you enough.
All benchmarks are evil, but here is what I did:
ran a single include of a file that was about 8000 lines (didn't do anything useful each line, just declares a variable). Compared to the time it takes to run an include of an 80 line file (same declarations) 100 times. Results were inconclusive.
Is the including of the files really causing the problem? Is there not something in the script execution that can be optimized? Caching may be an option..

Keep in mind that PHP will parse all the code it sees, even if it's not run.
It will still take relatively long to process the a file too, and from experience, lots of code will eat up considerable amounts of memory even though they're not executed.
Opcode caching as suggested by #Tim should be your first port of call.
If that is out of the question (e.g. due to server limitations): If the functions are somehow separable into categories, one possibility to make things a bit faster and lighter could be (ab)using PHP's Autoloading by putting the functions into separate files as methods of static classes.
function xyz() { ... }
would become
class generic_tools
{
public static function xyz() { ... }
}
and any call to xyz() is replaced by generic_tools::xyz();
The call would then trigger the inclusion of (e.g.) generic_tools.class.php on demand, instead of including everything at once.
This would require rewriting the function calls to static method calls, which may be dead easy or a bit more difficult (if function calls are cooked up dynamically or something). But beyond that, no refactoring would be needed, because you're not really using any OOP mechanisms.
How much this will actually help strongly depends on the app's architecture and how intertwined the functions are with each other.

Which is better performance in PHP?

I generally include 1 functions file into the hader of my site, now this site is pretty high traffic and I just like to make every little thing the best that I can, so my question here is,
Is it better to include multiple smaller function type files with just the code that's needed for that page or does it really make no difference to just load it all as 1 big file, my current functions file has all the functions for my whole site, it's about 4,000 lines long and is loaded on every single page load sitewide, is that bad?

It's difficult to say. 4,000 lines isn't that large in the realms of file parsing. In terms of code management, that's starting to get on the unwieldy side, but you're not likely to see much of a measurable performance difference by breaking it up into 2, 5 or 10 files, and having pages include only the few they need (it's better coding practice, but that's a separate issue). Your differential in number-of-lines read vs. number-of-files that the parser needs to open doesn't seem large enough to warrant anything significant. My initial reaction is that this is probably not an issue you need to worry about.
On the opposite side of the coin, I worked on an enterprise-level project where some operations had an include() tree that often extended into the hundreds of files. Profiling these operations indicated that the time taken by the include() calls alone made up 2-3 seconds of a 10 second load operation (this was PHP4).

If you can install extensions on your server, you should take a look at APC (see also).
It is free, by the way ;-) ; but you must be admin of your server to install it ; so it's generally not provided on shared hosting...
It is what is called an "opcode cache".
Basically, when a PHP script is called, two things happen :
the script is "compiled" into opcodes
the opcodes are executed
APC keeps the opcodes in RAM ; so the file doesn't have to be re-compiled each time it is called -- and that's a great thing for both CPU-load and performances.
To answer the question a bit more :
4,000 lines is not that much, speaking of performances ; Open a couple of files of any big application / Framework, and you'll rapidly get to a couple thousand of lines
a really important thing to take into account is maintenability : what will be easier to work with for you and your team ?
loading many small files might imply many system calls, which are slow ; but those would probably be cached by the OS... So probably not that relevant
If you are doing even 1 database query, this one (including network round-trip between PHP server and DB server) will probably take more time than the parsing of a couple thousand lines ;-)

I think it would be better if you could split the functions file up into components that is appropriate for each page; and call for those components in the appropriate pages. Just my 2 cents!
p/s: I'm a PHP amateur and I'm trying my hands on making a PHP site; I'm not using any functions. So can you enlighten me on what functions would you need for a site?

In my experience having a large include file which gets included everywhere can actually kill performance. I worked on a browser game where we had all game rules as dynamically generated PHP (among others) and the file weighed in at around 500 KiB. It definitely affected performance and we considered generating a PHP extension instead.
However, as usual, I'd say you should do what you're doing now until it is a performance problem and then optimize as needed.

If you load a 4000 line file and use maybe 1 function that is 10 lines, then yes I would say it is inefficient. Even if you used lots of functions of a combined 1000 lines, it is still inefficient.
My suggestion would be to group related functions together and store them in separate files. That way if a page only deals with, for example, database functions you can load just your database functions file/library.
Anothe reason for splitting the functions up is maintainability. If you need to change a function you need to find it in your monalithic include file. You may also have functions that are very, very similar but don't even realise it. Sorting functions by what they do allows you to compare them and get rid of things you don't need or merge two functions into one more general purpose function.

Most of the time Disc IO is what will kill your server so I think the lesser files you fetch from disc the better. Furthermore if it is possible to install APC then the file will be stored compiled into memory which is a big win.

Generally it is better, file management wise, to break stuff down into smaller files because you only need to load the files that you actually use. But, at 4,000 lines, it probably won't make too much of a difference.
I'd suggest a solution similar to this
function inc_lib($name)
{
include("/path/to/lib".$name.".lib.php");
}
function inc_class($name)
{
include("/path/to/lib".$name.".class.php");
}

How to debug, and protect against, infinite loops in PHP?

I recently ran up against a problem that challenged my programming abilities, and it was a very accidental infinite loop. I had rewritten some code to dry it up and changed a function that was being repeatedly called by the exact methods it called; an elementary issue, certainly. Apache decided to solve the problem by crashing, and the log noted nothing but "spawned child process". The problem was that I never actually finished debugging the issue that day, it cropped up in the afternoon, and had to solve it today.
In the end, my solution was simple: log the logic manually and see what happened. The problem was immediately apparent when I had a log file consisting of two unique lines, followed by two lines that were repeated some two hundred times apiece.
What are some ways to protect ourselves against infinite loops? And, when that fails (and it will), what's the fastest way to track it down? Is it indeed the log file that is most effective, or something else?
Your answer could be language agnostic if it were a best practice, but I'd rather stick with PHP specific techniques and code.

You could use a debugger such as xdebug, and walk through your code that you suspect contains the infinite loop.
Xdebug - Debugger and Profiler Tool for PHP
You can also set
max_execution_time
to limit the time the infinite loop will burn before crashing.

I sometimes find the safest method is to incorporate a limit check in the loop. The type of loop construct doesn't matter. In this example I chose a 'while' statement:
$max_loop_iterations = 10000;
$i=0;
$test=true;
while ($test) {
if ($i++ == $max_loop_iterations) {
echo "too many iterations...";
break;
}
...
}
A reasonable value for $max_loop_iterations might be based upon:
a constant value set at runtime
a computed value based upon the size of an input
or perhaps a computed value based upon relative runtime speed
Hope this helps,
- N

Unit tests might be a good idea, too. You might want to try PHPUnit.

can you trace execution and dump out a call graph? infinite loops will be obvious, and you can easily pick out the ones that are on purpose (a huge mainloop) vs an accident (local abababa... loop or loop that never returns back to mainloop)
i've heard of software that does this for large complex programs, but i couldn't find a screenshot for you. Someone traced a program like 3dsmax and dumped out a really pretty call graph.

write simple code, so bugs will be more apparent? I wonder if your infinite loop is due to some ridiculous yarn-ball of control structures that no human being can possibly understand. Thus of course you f'ed it up.
all infinite loops aren't bad (e.g. while (!quit)).

Singular Value Decomposition (SVD) in PHP

I would like to implement Singular Value Decomposition (SVD) in PHP. I know that there are several external libraries which could do this for me. But I have two questions concerning PHP, though:
1) Do you think it's possible and/or reasonable to code the SVD in PHP?
2) If (1) is yes: Can you help me to code it in PHP?
I've already coded some parts of SVD by myself. Here's the code which I made comments to the course of action in. Some parts of this code aren't completely correct.
It would be great if you could help me. Thank you very much in advance!

SVD-python
Is a very clear, parsimonious implementation of the SVD.
It's practically psuedocode and should be fairly easy to understand
and compare/draw on for your php implementation, even if you don't know much python.
SVD-python
That said, as others have mentioned I wouldn't expect to be able to do very heavy-duty LSA with php implementation what sounds like a pretty limited web-host.
Cheers
Edit:
The module above doesn't do anything all by itself, but there is an example included in the
opening comments. Assuming you downloaded the python module, and it was accessible (e.g. in the same folder), you
could implement a trivial example as follow,
#!/usr/bin/python
import svd
import math
a = [[22.,10., 2., 3., 7.],
[14., 7.,10., 0., 8.],
[-1.,13.,-1.,-11., 3.],
[-3.,-2.,13., -2., 4.],
[ 9., 8., 1., -2., 4.],
[ 9., 1.,-7., 5.,-1.],
[ 2.,-6., 6., 5., 1.],
[ 4., 5., 0., -2., 2.]]
u,w,vt = svd.svd(a)
print w
Here 'w' contains your list of singular values.
Of course this only gets you part of the way to latent semantic analysis and its relatives.
You usually want to reduce the number of singular values, then employ some appropriate distance
metric to measure the similarity between your documents, or words, or documents and words, etc.
The cosine of the angle between your resultant vectors is pretty popular.
Latent Semantic Mapping (pdf)
is by far the clearest, most concise and informative paper I've read on the remaining steps you
need to work out following the SVD.
Edit2: also note that if you're working with very large term-document matrices (I'm assuming this
is what you are doing) it is almost certainly going to be far more efficient to perform the decomposition
in an offline mode, and then perform only the comparisons in a live fashion in response to requests.
while svd-python is great for learning, the svdlibc is more what you would want for such heavy
computation.
finally as mentioned in the bellegarda paper above, remember that you don't have to recompute the
svd every single time you get a new document or request. depending on what you are trying to do you could
probably get away with performing the svd once every week or so, in an offline mode, a local machine,
and then uploading the results (size/bandwidth concerns notwithstanding).
anyway good luck!

Be careful when you say "I don't care what the time limits are". SVD is an O(N^3) operation (or O(MN^2) if it's a rectangular m*n matrix) which means that you could very easily be in a situation where your problem can take a very long time. If the 100*100 case takes one minute, the 1000*1000 case would 10^3 minutes, or nearly 17 hours (and probably worse, realistically, as you're likely to be out of cache). With something like PHP, the prefactor -- the number multiplying the N^3 in order to calculate the required FLOP count, could be very, very large.
Having said that, of course it's possible to code it in PHP -- the language has the required data structures and operations.

I know this is an old Q, but here's my 2-bits:
1) A true SVD is much slower than the calculus-inspired approximations used, eg, in the Netflix prize. See: http://www.sifter.org/~simon/journal/20061211.html
There's an implementation (in C) here:
http://www.timelydevelopment.com/demos/NetflixPrize.aspx
2) C would be faster but PHP can certainly do it.
PHP Architect author Cal Evans: "PHP is a web scripting language... [but] I’ve used PHP as a scripting language for writing the DOS equivalent of BATCH files or the Linux equivalent of shell scripts. I’ve found that most of what I need to do can be accomplished from within PHP. There is even a project to allow you to build desktop applications via PHP, the PHP-GTK project."

Regarding question 1: It definitely is possible. Whether it's reasonable depends on your scenario: How big are your matrices? How often do you intend to run the code? Is it run in a web site or from the command line?
If you do care about speed, I would suggest writing a simple extension that wraps calls to the GNU Scientific Library.

Yes it's posible, but implementing SVD in php ins't the optimal approach. As you can see here PHP is slower than C and also slower than C++, so maybe it was better if you could do it in one of this languages and call them as a function to get your results. You can find an implementation of the algorithm here, so you can guide yourself trough it.
About the function calling can use:
The exec() Function
The system function is quite useful and powerful, but one of the biggest problems with it is that all resulting text from the program goes directly to the output stream. There will be situations where you might like to format the resulting text and display it in some different way, or not display it at all.
The system() Function
The system function in PHP takes a string argument with the command to execute as well as any arguments you wish passed to that command. This function executes the specified command, and dumps any resulting text to the output stream (either the HTTP output in a web server situation, or the console if you are running PHP as a command line tool). The return of this function is the last line of output from the program, if it emits text output.
The passthru() Function
One fascinating function that PHP provides similar to those we have seen so far is the passthru function. This function, like the others, executes the program you tell it to. However, it then proceeds to immediately send the raw output from this program to the output stream with which PHP is currently working (i.e. either HTTP in a web server scenario, or the shell in a command line version of PHP).

Yes. this is perfectly possible to be implemented in PHP.
I don't know what the reasonable time frame for execution and how large it can compute.
I would probably have to implement the algorithm to get a rought idea.
Yes I can help you code it. But why do you need help? Doesn't the code you wrote work?
Just as an aside question. What version of PHP do you use?

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.