I have a piece of code here which I need either assurance, or "no no no!" about in regards to if I'm thinking about this in the right or entirely wrong way.
This has to deal with cutting a variable of binary data at a specific spot, and also dealing with multi-byte overloaded functions. For example substr is actually mb_substr and strlen is mb_strlen etc.
Our server is set to UTF-8 internal encoding, and so theres this weird little thing I do to circumvent it for this binary data manipulation:
// $binary_data is the incoming variable with binary
// $clip_size is generally 16, 32 or 64 etc
$curenc = mb_internal_encoding();// this should be "UTF-8"
mb_internal_encoding('ISO-8859-1');// change so mb_ overloading doesnt screw this up
if (strlen($binary_data) >= $clip_size) {
$first_hunk = substr($binary_data,0,$clip_size);
$rest_of_it = substr($binary_data,$clip_size);
} else {
// skip since its shorter than expected
}
mb_internal_encoding($curenc);// put this back now
I can't really show input and output results, since its binary data. But tests using the above appear to be working just fine and nothing is breaking...
However, parts of my brain are screaming "what are you doing... this can't be the way to handle this"!
Notes:
The binary data coming in, is a concatenation of those two parts to begin with.
The first part's size is always known (but changes).
The second part's size is entirely unknown.
This is pretty darn close to encryption and stuffing the IV on front and ripping it off again (which oddly, I found some old code which does this same thing lol ugh).
So, I guess my question is:
Is this actually fine to be doing?
Or is there something super obvious I'm overlooking?
However, parts of my brain are screaming "what are you doing... this can't be the way to handle this"!
Your brain is right, you shouldn't be doing that in PHP in the first place. :)
Is this actually fine to be doing?
It depends the purpose of your code.
I can't see any reason of the top of my head to cut a binary like that. So my first instinct would be "no no no!" use unpack() to properly parse the binary into usable variables.
That being said if you just need to split your binary because reasons, then I guess this is fine. As long as your tests confirm that the code is working for you, I can't see any problem.
As a side note, I don't use mbstring overloading exactly for this kind of use case - i.e. for whenever you need the default string functions.
MY SOLUTION TO THE WORRY
I dislike answering my own questions... but I wanted to share what I have decided on nonetheless.
Although what I had, "worked", I still wanted to change the hack-job-altering of the charset encoding. It was old code I admit, but for some reason, I never looked at hex2bin bin2hex for doing this. So I decided to change it to use those.
The resulting new code:
// $clip_size remains the same value for continuity later,
// only spot-adjusted here... which is why the *2.
$hex_data = bin2hex( $binary_data );
$first_hunk = hex2bin( substr($hex_data,0,($clip_size*2)) );
$rest_of_it = hex2bin( substr($hex_data,($clip_size*2)) );
if ( !empty($rest_of_it) ) { /* process the result for reasons */ }
Using the hex functions, turns the mess into something mb will not screw with either way. A 1 million bench loop, showed the process wasn't anything to be worried about (and its safer to run in parallel to itself than the mb_encoding mangle method).
So I'm going with this. It sits better in my mind, and resolves my question for now... until I revisit this old code again in a few years and go "what was I thinking ?!".
Related
Is there any reason for this behavior/implementation ?Example:
$array = array("index_of_an_array" => "value");
class Foo {
private $index_of_an_array;
function __construct() {}
}
$foo = new Foo();
$array = (array)$foo;
$key = str_replace("Foo", "", array_keys($array)[0]);
echo $array[$key];
Gives us an error which is complete:
NOTICE Undefined index: on line number 9
Example #2:
echo date("Y\0/m/d");
Outputs:
2016
BUT! echo or var_dump(), for example, and some other functions, would output the string "as it is", just \0 bytes are being hidden by browsers.
$string = "index-of\0-an-array";
$strgin2 = "Y\0/m/d";
echo $string;
echo $string2;
var_dump($string);
var_dump($string2);
Outputs:
index-of-an-array
"Y/m/d"
string(18) "index-of-an-array"
string(6) "Y/m/d"
Notice, that $string lenght is 18, but 17 characters are shown.
EDIT
From possible duplicate and php manual:
The key can either be an integer or a string. The value can be of any type.
Strings containing valid integers will be cast to the integer type. E.g. the key "8" will actually be stored under 8. On the other hand "08" will not be cast, as it isn't a valid decimal integer. So in short, any string can be a key. And a string can contain any binary data (up to 2GB). Therefore, a key can be any binary data (since a string can be any binary data).
From php string details:
There are no limitations on the values the string can be composed of;
in particular, bytes with value 0 (“NUL bytes”) are allowed anywhere
in the string (however, a few functions, said in this manual not to be
“binary safe”, may hand off the strings to libraries that ignore data
after a NUL byte.)
But I still do not understand why the language is designed this way? Are there reasons for this behavior/implementation? Why PHP does'nt handle input as binary safe everywhere but just in some functions?
From comment:
The reason is simply that many PHP functions like printf use the C library's implementation behind the scenes, because the PHP developers were lazy.
Arent those such as echo, var_dump, print_r ? In other words, functions that output something. They are in fact binary safe if we take a look at my first example. Makes no sense to me to implement some binary-safe and binary-unsafe functions for output. Or just use some as they are in std lib in C and write some completely new functions.
The short answer to "why" is simply history.
PHP was originally written as a way to script C functions so they could be called easily while generating HTML. Therefore PHP strings were just C strings, which are a set of any bytes. So in modern PHP terms we would say nothing was binary-safe, simply because it wasn't planned to be anything else.
Early PHP was not intended to be a new programming language, and grew organically, with Lerdorf noting in retrospect: "I don’t know how to stop it, there was never any intent to write a programming language […] I have absolutely no idea how to write a programming language, I just kept adding the next logical step on the way."
Over time the language grew to support more elaborate string-processing functions, many taking the string's specific bytes into account and becoming "binary-safe". According to the recently written formal PHP specification:
As to how the bytes in a string translate into characters is unspecified. Although a user of a string might choose to ascribe special semantics to bytes having the value \0, from PHP's perspective, such null bytes have no special meaning. PHP does not assume strings contain any specific data or assign special values to any bytes or sequences.
As a language that has grown organically, there hasn't been a move to universally treat strings in a manner different from C. Therefore functions and libraries are binary-safe on a case-by-case basis.
Fist Example from Question
Your first example is a confusing because the error message is the part that's terminating on the null character not because the string is being handled incorrectly by the array. The original code you posted with the error message follows:
$array = array("index-of-an-array" => "value");
$string = "index-of\0-an-array";
echo $array[$string];
Notice: Undefined index: index-of in
Note, the error message above has been truncated index-of due to the null character, the array is working as expected because if you try it this way it will work just fine:
$array = array("index-of\0-an-array" => "value");
$string = "index-of\0-an-array";
echo $array[$string];
The error message correctly identified the that the two keys were wrong, which
they are
"index-of\0-an-array" != "index-of-an-array"
The problem is that the error message printed out everything up to the null character. If that's the case then it might be considered a bug by some.
Second Example is starting plumb the depths of PHP :)
I've added some code to it so we can see what's happening
<?php
class Foo {
public $index_public;
protected $index_prot;
private $index_priv;
function __construct() {
$this->index_public = 0;
$this->index_prot = 1;
$this->index_priv = 2;
}
}
$foo = new Foo();
$array = (array)$foo;
print_r($foo);
print_r($array);
//echo $array["\0Foo\0index_of_an_array2"];//This prints 2
//echo $foo->{"\0Foo\0index_of_an_array2"};//This fails
var_dump($array);
echo array_keys($array)[0] . "\n";
echo $array["\0Foo\0index_priv"] . "\n";
echo $array["\0*\0index_prot"] . "\n";
The above codes output is
Foo Object
(
[index_public] => 0
[index_prot:protected] => 1
[index_priv:Foo:private] => 2
)
Array
(
[index_public] => 0
[*index_prot] => 1
[Fooindex_priv] => 2
)
array(3) {
'index_public' =>
int(0)
'\0*\0index_prot' =>
int(1)
'\0Foo\0index_priv' =>
int(2)
}
index_public
2
1
The PHP developers choose to use the \0 character as a way to split member variable types. Note, protected fields use a * to indicate that the member variable may actually belong to many classes. It's also used to protect private access ie this code would not work.
echo $foo->{"\0Foo\0index_priv"}; //This fails
but once you cast it to an array then there is no such protection ie this works
echo $array["\0Foo\0index_priv"]; //This prints 2
Is there any reason for this behavior/implementation?
Yes. On any system that you need to interface with you need to make system
calls, if you want the current time or to convert a date etc you need to talk
to the operating system and this means calling the OS API, in the case of Linux
this API is in C.
PHP was original developed as a thin wrapper around C quite a few languages
start out this way and evolve, PHP is no exception.
Is there any reason for this behavior/implementation?
In the absence of any backwards compatibility issues I'd say some of the choices are less than optimal but my suspicion is that backwards compatibility is a large factor.
But I still do not understand why the language is designed this way?
Backwards compatibility is almost always the reason why features that people don't like remain in a language. Over time languages evolve and remove things but it's incremental and prioritized. If you had asked all the PHP developers do they want better binary string handling for some functions or a JIT compiler I think a JIT might win which it did in PHP 7. Note, the people doing the actual work ultimately decide what they work on and working on a JIT compiler is more fun than fixing libraries that do things in seemingly odd ways.
I'm not aware of a any language implementor that doesn't wish they'd done some things differently from the outset. Anyone implementing a compiler before a
language is popular is under a lot of pressure to get something that works for
them and that means cutting corners, not all languages in existence today had a
huge company backing them, most often it was a small dedicated team and they
made mistakes, some were lucky enough to get paid to do it. Calling them lazy
is a bit unfair.
All language have dark corners warts and boils and features you'll eventually hate. Some more than others and PHP has a bad rep because it has/had a lot more than most. Note, PHP 5 is a vast leap forward from PHP 4. I'd imagine that PHP 7 will improve things even more.
Anyone that thinks their favorite language is free from problems is delusional and has almost certainly not plumbed the depths of the tool their using to any great depth.
Functions in PHP which internally operate with C strings are "not binary safe" in PHP terminology. C string is an array of bytes ending with byte 0. When a PHP function internally uses C strings, it reads one by one character and when it encounters byte 0 it considers it as an end of string. Byte 0 tells C string functions where is the end of string since C string does not contain any information about string length.
"Not binary safe" means that, if function which operates with C string is somehow handed a C string not terminated with byte 0, behavior is unpredictable because function will read/write bytes beyond end of the string, adding garbage to string and/or potentially crashing PHP.
In C++, for example, we have string object. This object also contains an array of characters, but it has also a length field which it updates on any length change. So it does not require byte 0 to tell it where the end is. This is why string object can contain any number of 0 bytes, although this is generally not valid since it should contain only valid characters.
In order for this to be corrected, the whole PHP core, including any modules which operate with C strings, need to be rewritten in order to send "non binary safe" functions to history. The amount of job needed for this is huge and all the modules' creators need to produce new code for their modules. This can introduce new bugs and instabilities into the whole story.
Issue with byte 0 and "non binary safe" functions is not that much critical to justify rewriting PHP and PHP modules code. Maybe in some newer PHP version where some things need to be coded from scratch it would make sense to correct this.
Until then, you just need to know that any arbitrary binary data put to some string by using binary-safe functions needs to have byte 0 added at the end. Usually you will notice this when there is unexpected garbage at end of your string or PHP crashes.
I wrote function in PHP that generates croatian IBAN for given bank account. I can easily rewrite for returning any IBAN. Problem is that I think it is not optimized nor elegant. This is the function:
function IBAN_generator($acc){
if(strlen($acc)!=23)
return;
$temp_str=substr($acc,0,3);
$remainder =$temp_str % 97;
for($i=3;$i<=22;$i++)
{
$remainder =$remainder .substr($acc,$i,1);
$remainder = $remainder % 97;
}
$con_num = 98 - $remainder;
if ($con_num<10)
{
$con_num="0".$con_num;
}
$IBAN="HR".$con_num.substr($acc,0,17);
return $IBAN;
}
Is there a better way to generate IBAN?
At first glance it doesn't seem you can make it much faster, it's just simple sequence of string appending.
Unless you have to use it thousands of times and that represents a bottleneck for your application, I'd not waste time make it better, it probably takes a few microseconds, and just upgrading the PHP version would probably make improvements much better than code changes you'd implement.
If you really have to make it faster, possible solutions are
- writing it the function into an extension
- APC op code caching (it should make things generally fast when interpreting the code so globally increase the speed)
- caching results in memory (only if your application runs the same input many times, that is not probably a common case for a simple algorithm like this one)
If you want to play with it and try to make it faster, careful, you could alter the logic and introduce a bug. Always use a unit test, or write some test cases before changing it, always a good practice
You might have a look at this repo:
https://github.com/jschaedl/Iban
On your part is to add the rules for croatia.
After that you should be able to use it for your country.
Greetz
To microoptimize, the substr has O(n) time complexity, so your loop has O(n2). To avoid it, use str_split and access the chars of the generated array by index.
Then, to make it more elegant, the for loop can be replaced with array_reduce.
Also, generally avoid string concatenation in loop, it has O(n2) time and memory complexity. IBAN is short and PHP is not run on microcomputers, so it is not a issue here. But if you ever worked with longer strings, generate an array and then implode it*.
And, of course, if you are consistent with spacing around = and after for/if, it is also more elegant ;-)
* In Javascript I once tried to generate HTML table of Hangul alphabet by iterative string concatenation. It crashed the browser consuming 1GB+ memory.
I'm currently using the following two methods in my class to get the job done:
function xseek($h,$pos){
rewind($h);
if($pos>0)
fread($h,$pos);
}
function find($str){
return $this->startingindex($this->name,$str);
}
function startingindex($a,$b){
$lim = 1 + filesize($a) - strlen($b)/2;
$h = fopen($a,"rb");
rewind($h);
for($i=0;$i<$lim;$i++){
$this->xseek($h,$i);
if($b==strtoupper(bin2hex(fread($h,strlen($b)/2)))){
fclose($h);
return $i;
}
}
fclose($h);
return -1;
}
I realize this is quite inefficient, especially for PHP, but I'm not allowed any other language on my hosting plan.
I ran a couple tests, and when the hex string is towards the beginning of the file, it runs quickly and returns the offset. When the hex string isn't found, however, the page hangs for a while. This kills me inside because last time I tested with PHP and had hanging pages, my webhost shut my site down for 24 hours due to too much cpu time.
Is there a better way to accomplish this (finding a hex string's offset in a file)? Is there certain aspects of this that could be improved to speed up execution?
I would read the entire contents of the file into one hex string and use strrpos, but I was getting errors about maximum memory being exceeded. Would this be a better method if I chopped the file up and searched large pieces with strrpos?
edit:
To specify, I'm dealing with a settings file for a game. The settings and their values are in a block where there is a 32-bit int before the setting, then the setting, a 32-bit int before the value, and then the value. Both ints represent the lengths of the following strings. For example, if the setting was "test" and the value was "0", it would look like (in hex): 00000004746573740000000130. Now that you mention it, this does seem like a bad way to go about it. What would you recommend?
edit 2:
I tried a file that was below the maximum memory I'm allowed and tried strrpos, but it was very much slower than the way I've been trying.
edit 3: in reply to Charles:
What's unknown is the length of the settings block and where it starts. What I do know is what the first and last settings USUALLY are. I've been using these searching methods to find the location of the first and last setting and determine the length of the settings block. I also know where the parent block starts. The settings block is generally no more than 50 bytes into its parent, so I could start the search for the first setting there and limit how far it will search. The problem is that I also need to find the last setting. The length of the settings block is variable and could be any length. I could read the file the way I assume the game does, by reading the size of the setting, reading the setting, reading the size of the value, reading the value, etc. until I reached a byte with value -1, or FF in hex. Would a combination of limiting the search for the first setting and reading the settings properly make this much more efficient?
You have a lot of garbage code. For example, this code is doing nearly nothing:
function xseek($h,$pos){
rewind($h);
if($pos>0)
fread($h,$pos);
}
because it reads everytime from the begining of the file. Furthemore, why do you need to read something if you are not returning it? May be you looke for fseek()?
If you need to find a hex string in binary file, may be better to use something like this: http://pastebin.com/fpDBdsvV (tell me if there some bugs/problems).
But, if you are parsing game's settings file, I'd advise you to use fseek(), fread() and unpack() to seek to a place of where setting is, read portion of bytes and unpack it to PHP's variable types.
I want to synchronize two directories. And I use
file_get_contents($source) === file_get_contents($dest)
to compare two files. Is there any problem to do this?
I would rather do something like this:
function files_are_equal($a, $b)
{
// Check if filesize is different
if(filesize($a) !== filesize($b))
return false;
// Check if content is different
$ah = fopen($a, 'rb');
$bh = fopen($b, 'rb');
$result = true;
while(!feof($ah))
{
if(fread($ah, 8192) != fread($bh, 8192))
{
$result = false;
break;
}
}
fclose($ah);
fclose($bh);
return $result;
}
This checks if the filesize is the same, and if it is it goes through the file step by step.
Checking the modified time check can be a quick way in some cases, but it doesn't really tell you anything other than that the files have been modified at different times. They still might have the same content.
Using sha1 or md5 might be a good idea, but this requires going through the whole file to create that hash. If this hash is something that could be stored and used later, then it's a different story probably, but yeah...
Use sha1_file() instead. It's faster and works fine if you just need to see whether the files differ. If the files are large, comparing the whole strings to each other can be very heavy. As sha1_file() returns an 40 character representation of the file, comparing files will be very fast.
You can also consider other methods like comparing filemtime or filesize, but this will give you guaranteed results even if there's just one bit that's changed.
Memory: e.g. you have a 32 MB memory limit, and the files are 20 MB each. Unrecoverable fatal error while trying to allocate memory. This can be solved by checking the files by smaller parts.
Speed: string comparisons are not the fastest thing in the world, calculating a sha1 hash should be faster (if you want to be 110% sure, you can compare the files byte-by-byte when hash matches, but you'll rule out all the cases where content and hash change (99%+ cases))
Efficiency: do some preliminary checks - e.g. there's no point comparing two files if their size differs.
Ths will work, but is inherently more inefficient than calculating checksum for both files and comparing these. Good candidates for checksum algorithms are SHA1 and MD5.
http://php.net/sha1_file
http://php.net/md5_file
if (sha1_file($source) == sha1_file($dest)) {
/* ... */
}
Seems a bit heavy. This will load both files completely as strings and then compare.
I think you might be better off opening both files manually and ticking through them, perhaps just doing a filesize check first.
There isn't anything wrong with what you are doing here, accept it is a little inefficient. Getting the contents of each file and comparing them, especially with larger files or binary data, you may run into problems.
I would take a look at filetime (last modified) and filesize, and run some tests to see if that works for you. It should be all you need at a fraction of the computation power.
Check first for the obvious:
Compare size
Compare file type (mime-type).
Compare content.
(add comparison of date, file name and other metadata to this obvious list if those are also not supposed to be similar).
When comparing content hashing sounds not very efficient like #Oli says in his comment. If the files are different they most likely will be different already in the beginning. Calculating a hash of two 50 Mb files and then comparing the hash sounds like a waste of time if the second bit is already different...
Check this post on php.net. Looks very similar to that of #Svish but it also compares file mime-type. A smart addition if you ask me.
Something I noticed is there is a lack of the N! factor. In other words - to do the filesize() function you would first have to check every file against all of the other files. Why? What if the first file and the second file are different sizes but the third file is the same size.
So first - you need to get a list of all of the files you are going to work with If you want to do the filesize type of thing - then use the COMPLETE / string as the key for an array and then store the filesize() information. Then you sort the array so all files which are the same size are lined up. THEN you can check file sizes. However - this does not mean they really are the same - only that they are the same size.
You need to do something like the sha1_file() command and, like above, make an array where the keys are the / names are the keys and the values is the value returned. Sort those, and then just do a simple walk through the array storing the sha1_file() value to test against. So is A==B? Yes. Do any additional tests, then get rid of the SECOND file and continue.
Why am I commenting? I'm working on this same problem and I just found out my program did not work correctly. So now I'm going to go correct it using the sha1_file() function. :-)
First, please note, that I am interested in how something like this would work, and am not intending to build it for a client etc, as I'm sure there may already be open source implementations.
How do the algorithms work which detect plagiarism in uploaded text? Does it use regex to send all words to an index, strip out known words like 'the', 'a', etc and then see how many words are the same in different essays? Does it them have a magic number of identical words which flag it as a possible duplicate? Does it use levenshtein()?
My language of choice is PHP.
UPDATE
I'm thinking of not checking for plagiarism globally, but more say in 30 uploaded essays from a class. In case students have gotten together on a strictly one person assignment.
Here is an online site that claims to do so: http://www.plagiarism.org/
Good plagiarism detection will apply heuristics based on the type of document (e.g. an essay or program code in a specific language).
However, you can also apply a general solution. Have a look at the Normalized Compression Distance (NCD). Obviously you cannot exactly calculate a text's Kolmogorov complexity, but you can approach it be simply compressing the text.
A smaller NCD indicates that two texts are more similar. Some compression
algorithms will give better results than others. Luckily PHP provides support
for several compression algorithms, so you can have your NCD-driven plagiarism
detection code running in no-time. Below I'll give example code which uses
Zlib:
PHP:
function ncd($x, $y) {
$cx = strlen(gzcompress($x));
$cy = strlen(gzcompress($y));
return (strlen(gzcompress($x . $y)) - min($cx, $cy)) / max($cx, $cy);
}
print(ncd('this is a test', 'this was a test'));
print(ncd('this is a test', 'this text is completely different'));
Python:
>>> from zlib import compress as c
>>> def ncd(x, y):
... cx, cy = len(c(x)), len(c(y))
... return (len(c(x + y)) - min(cx, cy)) / max(cx, cy)
...
>>> ncd('this is a test', 'this was a test')
0.30434782608695654
>>> ncd('this is a test', 'this text is completely different')
0.74358974358974361
Note that for larger texts (read: actual files) the results will be much more
pronounced. Give it a try and report your experiences!
I think that this problem is complicated, and doesn't have one best solution.
You can detect exact duplication of words at the whole document level (ie someone downloads an entire essay from the web) all the way down to the phrase level. Doing this at the document level is pretty easy - the most trivial solution would take the checksum of each document submitted and compare it against a list of checksums of known documents. After that you could try to detect plagiarism of ideas, or find sentences that were copied directly then changed slightly in order to throw off software like this.
To get something that works at the phrase level you might need to get more sophisticated if want any level of efficiency. For example, you could look for differences in style of writing between paragraphs, and focus your attention to paragraphs that feel "out of place" compared to the rest of a paper.
There are lots of papers on this subject out there, so I suspect there is no one perfect solution yet. For example, these 2 papers give introductions to some of the general issues with this kind of software,and have plenty of references that you could dig deeper into if you'd like.
http://ir.shef.ac.uk/cloughie/papers/pas_plagiarism.pdf
http://proceedings.informingscience.org/InSITE2007/IISITv4p601-614Dreh383.pdf
Well, you first of all have to understand what you're up against.
Word-for-word plagiarism should be ridiculously easy to spot. The most naive approach would be to take word tuples of sufficient length and compare them against your corpus. The sufficient length can be incredibly low. Compare Google results:
"I think" => 454,000,000
"I think this" => 329,000,000
"I think this is" => 227,000,000
"I think this is plagiarism" => 5
So even with that approach you have a very high chance to find a good match or two (fun fact: most criminals are really dumb).
If the plagiarist used synonyms, changed word ordering and so on, obviously it gets a bit more difficult. You would have to store synonyms as well and try to normalise grammatical structure a bit to keep the same approach working. The same goes for spelling, of course (i.e. try to match by normalisation or try to account for the deviations in your matching, as in the NCD approaches posted in the other answers).
However the biggest problem is conceptual plagiarism. That is really hard and there are no obvious solutions without parsing the semantics of each sentence (i.e. sufficiently complex AI).
The truth is, though, that you only need to find SOME kind of match. You don't need to find an exact match in order to find a relevant text in your corpus. The final assessment should always be made by a human anyway, so it's okay if you find an inexact match.
Plagiarists are mostly stupid and lazy, so their copies will be stupid and lazy, too. Some put an incredible amount of effort into their work, but those works are often non-obvious plagiarism in the first place, so it's hard to track down programmatically (i.e. if a human has trouble recognising plagiarism with both texts presented side-by-side, a computer most likely will, too). For all the other 80%-or-so, the dumb approach is good enough.
It really depends on "plagarised from where".
If you are talking about within the context of a single site, that's vastly different from across the web, or the library of congres, or ...
http://www.copyscape.com/ pretty much proves it can be done.
Basic concept seems to be
do a google search for some uncommon
word sequences
For each result, do a detailed analysis
The detailed analysis portion can certainly be similar, since it is a 1 to 1 comparison, but locating and obtaining source documents is the key factor.
(This is a Wiki! Please edit here with corrections or enhancings)
For better results on not-so-big strings:
There are problems with the direct uso of the NCD formula on strings or little texts. NCD(X,X) is not zero (!). To remove this artifact subtract the self comparison.
See similar_NCD_gzip() demo at http://leis.saocarlos.sp.gov.br/SIMILAR.php
function similar_NCD_gzip($sx, $sy, $prec=0, $MAXLEN=90000) {
# NCD with gzip artifact correctoin and percentual return.
# sx,sy = strings to compare.
# Use $prec=-1 for result range [0-1], $pres=0 for percentual,
# $pres=1 or =2,3... for better precision (not a reliable)
# Use MAXLEN=-1 or a aprox. compress lenght.
# For NCD definition see http://arxiv.org/abs/0809.2553
# (c) Krauss (2010).
$x = $min = strlen(gzcompress($sx));
$y = $max = strlen(gzcompress($sy));
$xy= strlen(gzcompress($sx.$sy));
$a = $sx;
if ($x>$y) { # swap min/max
$min = $y;
$max = $x;
$a = $sy;
}
$res = ($xy-$min)/$max; # NCD definition.
# Optional correction (for little strings):
if ($MAXLEN<0 || $xy<$MAXLEN) {
$aa= strlen(gzcompress($a.$a));
$ref = ($aa-$min)/$min;
$res = $res - $ref; # correction
}
return ($prec<0)? $res: 100*round($res,2+$prec);
}