I need to compare names which can be written in several ways. For example, a name like St. Thomas is sometimes written like St-Thomas or Sant Thomas. Preferably, I'm looking to build a function that gives a percentage of 'equalness' to a comparison, like some forums do (this post is 5% edited for example).
PHP has two (main) built-in functions for this.
levenshtein which counts how many changes (remove/add/replacements) are needed to produce string2 from string1. (lower is better)
and
similar_text which returns the number of matching characters (higher is better). Note that you can pass a reference as the third parameter and it'll give you a percentage.
<?php
$originalPost = "Here's my question to stack overflou. Thanks /h2ooooooo";
$editedPost = "Question to stack overflow.";
$matchingCharacters = similar_text($originalPost, $editedPost, $matchingPercentage);
var_dump($matchingCharacters); //int(25)
var_dump($matchingPercentage); //float(60.975609756098) (hence edited 40%)
?>
The edit distance between two strings of characters generally refers to the Levenshtein distance.
http://php.net/manual/en/function.levenshtein.php
$v1 = 'pupil';
$v2 = 'people';
# TRUE if $v1 & $v2 have similar pronunciation
soundex($v1) == soundex($v2);
# Same but it use a more accurate comparison algorithm
metaphone($v1) == metaphone($v2);
# Calculate how many common characters between 2 strings
# Percent store the percentage of common chars
$common = similar_text($v1, $v2, $percent);
# Compute the difference of 2 text
$diff = levenshtein($v1, $v2);
So, either levenshtein($v1, $v2) or similar_text($v1, $v2, $percent) will do it for you but still there is tradeoff. The complexity of the levenshtein() algorithm is O(m*n), where n and m are the length of v1 and v2 (rather good when compared to similar_text(), which is O(max(n,m)**3), but still expensive).
Check out levenshtein(), which does what you want and is comparatively efficient (but not extremely efficient):
http://www.php.net/manual/en/function.levenshtein.php
You can use different approaches.
You can use the similar_text() function to check for similarity.
OR
You can use levenshtein() function to find out...
The Levenshtein distance is defined as the minimal number of characters you have to replace, insert or delete to transform str1 into str2
And then check for a reasonable threshold for your check.
Related
It seems so easy to find the percentage between two strings using php code, I just use
int similar_text ( string $first , string $second [, float &$percent ]
but assume that I have two strings for example:
1- Sponsors back away from Sharapova after failed drug test
2- Maria Sharapova failed drugs test at Australian Open
With similar_text tool I got 53.7% but it doesn't make any sense because the two strings are talking about "failed drug test" for "Sharapova" and the percent should be more than 53.7%.
My question is: is there any way to find the real similarity percent between two strings?
I have implemented several algorithms that will search for duplicates and they can be quite similar.
The approach I am usually using is the following:
normalize the strings
use a comparison algorithm (e.g. similar_text, levenshtein, etc.)
It appears to me that in implementing step 1) you will be able to improve your results drastically.
Example of normalization algorithm (I use "Sponsors back away from Sharapova after failed drug test" for the details):
1) lowercase the string
-> "sponsors back away from sharapova after failed drug test"
2) explode string in words
-> [sponsors, back, away, from, sharapova, after, failed, drug, test]
3) remove noisy words (like propositions, e.g. in, for, that, this, etc.). This step can be customized to your needs
-> [sponsors, sharapova, failed, drug, test]
4) sort the array alphabetically (optional, but this can help implementing the algorithm...)
-> [drug, failed, sharapova, sponsors, test]
Applying the very same algorithm to your other string, you would obtain:
[australian, drugs, failed, maria, open, sharapova, test]
This will help you elaborate a clever algorithm. For example:
for each word in the first string, search the highest similarity in the words of the second string
accumulate the highest similarity
divide the accumulated similarity by the number of words
$words1 = ['drug', 'failed', 'sharapova', 'sponsors', 'test'];
$words2 = ['australian', 'drugs', 'failed', 'maria', 'open', 'sharapova', 'test'];
$nbWords1 = count($words1);
$stringSimilarity = 0;
foreach($words1 as $word1){
$max = null;
$similarity = null;
foreach($words2 as $word2){
similar_text($word1, $word2, $similarity);
if($similarity > $max){ //1)
$max = $similarity;
}
}
$stringSimilarity += $max; //2)
}
var_dump(($stringSimilarity/$nbWords1)); //3)
Running this code will give you 84.83660130719. Not bad, I think ^^. I am sure this algorithm can be further refined, but this is a good start... Also here, we are basically computing the average similarity percentage for each words, you may want a different final approach... tune for your needs ;-)
I've been googling for the past 2 hours, and I cannot find a list of php built in functions time and space complexity. I have the isAnagramOfPalindrome problem to solve with the following maximum allowed complexity:
expected worst-case time complexity is O(N)
expected worst-case space complexity is O(1) (not counting the storage required for input arguments).
where N is the input string length. Here is my simplest solution, but I don't know if it is within the complexity limits.
class Solution {
// Function to determine if the input string can make a palindrome by rearranging it
static public function isAnagramOfPalindrome($S) {
// here I am counting how many characters have odd number of occurrences
$odds = count(array_filter(count_chars($S, 1), function($var) {
return($var & 1);
}));
// If the string length is odd, then a palindrome would have 1 character with odd number occurrences
// If the string length is even, all characters should have even number of occurrences
return (int)($odds == (strlen($S) & 1));
}
}
echo Solution :: isAnagramOfPalindrome($_POST['input']);
Anyone have an idea where to find this kind of information?
EDIT
I found out that array_filter has O(N) complexity, and count has O(1) complexity. Now I need to find info on count_chars, but a full list would be very convenient for future porblems.
EDIT 2
After some research on space and time complexity in general, I found out that this code has O(N) time complexity and O(1) space complexity because:
The count_chars will loop N times (full length of the input string, giving it a start complexity of O(N) ). This is generating an array with limited maximum number of fields (26 precisely, the number of different characters), and then it is applying a filter on this array, which means the filter will loop 26 times at most. When pushing the input length towards infinity, this loop is insignificant and it is seen as a constant. Count also applies to this generated constant array, and besides, it is insignificant because the count function complexity is O(1). Hence, the time complexity of the algorithm is O(N).
It goes the same with space complexity. When calculating space complexity, we do not count the input, only the objects generated in the process. These objects are the 26-elements array and the count variable, and both are treated as constants because their size cannot increase over this point, not matter how big the input is. So we can say that the algorithm has a space complexity of O(1).
Anyway, that list would be still valuable so we do not have to look inside the php source code. :)
A probable reason for not including this information is that is is likely to change per release, as improvements are made / optimizations for a general case.
PHP is built on C, Some of the functions are simply wrappers around the c counterparts, for example hypot a google search, a look at man hypot, in the docs for he math lib
http://www.gnu.org/software/libc/manual/html_node/Exponents-and-Logarithms.html#Exponents-and-Logarithms
The source actually provides no better info
https://github.com/lattera/glibc/blob/a2f34833b1042d5d8eeb263b4cf4caaea138c4ad/math/w_hypot.c (Not official, Just easy to link to)
Not to mention, This is only glibc, Windows will have a different implementation. So there MAY even be a different big O per OS that PHP is compiled on
Another reason could be because it would confuse most developers.
Most developers I know would simply choose a function with the "best" big O
a maximum doesnt always mean its slower
http://www.sorting-algorithms.com/
Has a good visual prop of whats happening with some functions, ie bubble sort is a "slow" sort, Yet its one of the fastest for nearly sorted data.
Quick sort is what many will use, which is actually very slow for nearly sorted data.
Big O is worst case - PHP may decide between a release that they should optimize for a certain condition and that will change the big O of the function and theres no easy way to document that.
There is a partial list here (which I guess you have seen)
List of Big-O for PHP functions
Which does list some of the more common PHP functions.
For this particular example....
Its fairly easy to solve without using the built in functions.
Example code
function isPalAnagram($string) {
$string = str_replace(" ", "", $string);
$len = strlen($string);
$oddCount = $len & 1;
$string = str_split($string);
while ($len > 0 && $oddCount >= 0) {
$current = reset($string);
$replace_count = 0;
foreach($string as $key => &$char) {
if ($char === $current){
unset($string[$key]);
$len--;
$replace_count++;
continue;
}
}
$oddCount -= ($replace_count & 1);
}
return ($len - $oddCount) === 0;
}
Using the fact that there can not be more than 1 odd count, you can return early from the array.
I think mine is also O(N) time because its worst case is O(N) as far as I can tell.
Test
$a = microtime(true);
for($i=1; $i<100000; $i++) {
testMethod("the quick brown fox jumped over the lazy dog");
testMethod("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa");
testMethod("testest");
}
printf("Took %s seconds, %s memory", microtime(true) - $a, memory_get_peak_usage(true));
Tests run using really old hardware
My way
Took 64.125452041626 seconds, 262144 memory
Your way
Took 112.96145009995 seconds, 262144 memory
I'm fairly sure that my way is not the quickest way either.
I actually cant see much info either for languages other than PHP (Java for example).
I know a lot of this post is speculating about why its not there and theres not a lot drawing from credible sources, I hope its an partially explained why big O isnt listed in the documentation page though
I have a unique situation in the sense that what I am asking is for my own convenience rather than the end user of my application.
I am attempting to create an application that tests peoples' IQ scores (I know they are irrelevant and not much use to anyone), nothing too serious, just a project of mine to keep me busy in between assignments.
I am writing it locally in WAMP with PHP. I have found that there are a lot of available IQ questions and answers on the internet that I can use for my project. I have also noticed that there are a lot of the same questions but they are worded slightly differently.
Is there any third party PHP library that I can utilize in order to stop me from including "two" of the same questions in my application?
Some examples of questions that are the "same" but programatically are considered different;
The average of 20 numbers is zero. Of them, at the most, how many may be greater than zero?
The average of 20 numbers is zero. Of them how many may be greater than zero?
The average of 20 numbers is zero. Of them how many may be greater than zero, at the most?
Obviously you can see that PHP itself using operators cannot accomplish this and myself trying to differentiate between the similarities in the questions is far greater than my programming skill.
I have looked into plagiarism software but have not found any open source PHP projects.
Is there a simpler solution?
Thanks
** EDIT **
One idea I had was before inserting a question use explode at every space, and then in the resulting array match it against other questions that have also had the same function applied. The more matches the more equal the questions are?
I am a newcomer to PHP does this sound feasible?
As acfrancis has already answered: it doesn't get much simpler than using the built in levenshtein function.
However, to answer your final question: yes, doing it the way that you suggest is feasible and not too difficult.
Code
function checkQuestions($para1, $para2){
$arr1 = array_unique(array_filter(explode(' ', preg_replace('/[^a-zA-Z0-9]/', ' ', strtolower($para1)))));
$arr2 = array_unique(array_filter(explode(' ', preg_replace('/[^a-zA-Z0-9]/', ' ', strtolower($para2)))));
$intersect = array_intersect($arr1, $arr2);
$p1 = count($arr1); //Number of words in para1
$p2 = count($arr2); //Number of words in para2
$in = count($intersect); //Number of words in intersect
$lowest = ($p1 < $p2) ? $p1 : $p2; //Which is smaller p1 or p2?
return array(
'Average' => number_format((100 / (($p1+$p2) / 2)) * $in, 2), //Percentage the same compared to average length of questions
'Smallest' => number_format((100 / $lowest) * $in, 2) //Percentage the same compared to shortest question
);
}
Explanation
We define a function that accepts two arguments (the arguments being the questions that we're comparing).
We filter the input and convert to an array
Make the input lower case strtolower
Filter out non-alpha-numeric characters preg_replace
We explode the filtered string on spaces
We filter the created array
Remove blanks array_filter
Remove duplicates array_unique
Repeat 2-4 for the second question
Find matching words in both arrays and move to new array $intersect
Count number of words in each of the three arrays $p1, $p2, and $in
Calculate percentage similarity and return
You'd then need to set a threshold for how similar the questions had to be before being deemed the same e.g. 80%.
N.B.
The function returns an array of two values. The first comparing the length to the average of the two input questions the second only to the shortest. You could modify it returns a single value.
I used number_format for the percentages... But you'd be fine with returning int probably
Examples
Example 1
$question1 = 'The average of 20 numbers is zero. Of them, at the most, how many may be greater than zero?';
$question2 = 'The average of 20 numbers is zero. Of them how many may be greater than zero?';
if(checkQuestions($question1, $question2)['Average'] >= 80){
echo "Questions are the same...";
}
else{
echo "Questions are not the same...";
}
//Output: Questions are the same...
Example 2
$para1 = 'The average of 20 numbers is zero. Of them, at the most, how many may be greater than zero?';
$para2 = 'The average of 20 numbers is zero. Of them how many may be greater than zero?';
$para3 = 'The average of 20 numbers is zero. Of them how many may be greater than zero, at the most?';
var_dump(checkQuestions($para1, $para2));
var_dump(checkQuestions($para1, $para3));
var_dump(checkQuestions($para2, $para3));
/**
Output:
array(2) {
["Average"]=>
string(5) "93.33"
["Smallest"]=>
string(6) "100.00"
}
array(2) {
["Average"]=>
string(6) "100.00"
["Smallest"]=>
string(6) "100.00"
}
array(2) {
["Average"]=>
string(5) "93.33"
["Smallest"]=>
string(6) "100.00"
}
*/
Try using the Levenstein Distance algorithm:
http://php.net/manual/en/function.levenshtein.php
I've used it (in C#, not PHP) for a similar problem and it works quite well. The trick I found is to divide the Levenstein Distance by the length of the first sentence (in characters). That will give you a rough percentage of change required to convert question 1 into question 2 (for example).
In my experience, if you get anything less than 50-60% (i.e., less than 0.5 or 0.6), the sentences are the same. It might seem high but notice that 100% isn't the maximum. For example, to convert the string "z" into "abcdefghi" requires around 10 character changes (that's the Levenstein Distance: remove z then add abcdefghi) or a change of 1,000% as per the calculation above. With big enough changes, you can convert any random string into any other random string.
In PHP I have a 64 bit number which represents tasks that must be completed. A second 64 bit number represents the tasks which have been completed:
$pack_code = 1001111100100000000000000011111101001111100100000000000000011111
$veri_code = 0000000000000000000000000001110000000000000000000000000000111110
I need to compare the two and provide a percentage of tasks completed figure. I could loop through both and find how many bits are set, but I don't know if this is the fastest way?
Assuming that these are actually strings, perhaps something like:
$pack_code = '1001111100100000000000000011111101001111100100000000000000011111';
$veri_code = '0000000000000000000000000001110000000000000000000000000000111110';
$matches = array_intersect_assoc(str_split($pack_code),str_split($veri_code));
$finished_matches = array_intersect($matches,array(1));
$percentage = (count($finished_matches) / 64) * 100
Because you're getting the numbers as hex strings instead of ones and zeros, you'll need to do a bit of extra work.
PHP does not reliably support numbers over 32 bits as integers. 64-bit support requires being compiled and running on a 64-bit machine. This means that attempts to represent a 64-bit integer may fail depending on your environment. For this reason, it will be important to ensure that PHP only ever deals with these numbers as strings. This won't be hard, as hex strings coming out of the database will be, well, strings, not ints.
There are a few options here. The first would be using the GMP extension's gmp_xor function, which performs a bitwise-XOR operation on two numbers. The resulting number will have bits turned on when the two numbers have opposing bits in that location, and off when the two numbers have identical bits in that location. Then it's just a matter of counting the bits to get the remaining task count.
Another option would be transforming the number-as-a-string into a string of ones and zeros, as you've represented in your question. If you have GMP, you can use gmp_init to read it as a base-16 number, and use gmp_strval to return it as a base-2 number.
If you don't have GMP, this function provided in another answer (scroll to "Step 2") can accurately transform a string-as-number into anything between base-2 and 36. It will be slower than using GMP.
In both of these cases, you'd end up with a string of ones and zeros and can use code like that posted by #Mark Baker to get the difference.
Optimization in this case is not worth of considering. I'm 100% sure that you don't really care whether your scrip will be generated 0.00000014 sec. faster, am I right?
Just loop through each bit of that number, compare it with another and you're done.
Remember words of Donald Knuth:
We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.
This code utilizes the GNU Multi Precision library, which is supported by PHP, and since it is implemented in C, should be fast enough, and supports arbitrary precision.
$pack_code = gmp_init("1001111100100000000000000011111101001111100100000000000000011111", 2);
$veri_code = gmp_init("0000000000000000000000000001110000000000000000000000000000111110", 2);
$number_of_different_bits = gmp_popcount(gmp_xor($pack_code, $veri_code));
$a = 11111;
echo sprintf('%032b',$a)."\n";
$b = 12345;
echo sprintf('%032b',$b)."\n";
$c = $a & $b;
echo sprintf('%032b',$c)."\n";
$n=0;
while($c)
{
$n += $c & 1;
$c = $c >> 1;
}
echo $n."\n";
Output:
00000000000000000010101101100111
00000000000000000011000000111001
00000000000000000010000000100001
3
Given your PHP-setuo can handle 64bit, this can be easily extended.
If not you can sidestep this restriction using GNU Multiple Precision
You could also split up the HEx-Representation and then operate on those coresponding parts parts instead. As you need just the local fact of 1 or 0 and not which number actually is represented! I think that would solve your problem best.
For example:
0xF1A35C and 0xD546C1
you just compare the binary version of F and D, 1 and 5, A and 4, ...
First, please note, that I am interested in how something like this would work, and am not intending to build it for a client etc, as I'm sure there may already be open source implementations.
How do the algorithms work which detect plagiarism in uploaded text? Does it use regex to send all words to an index, strip out known words like 'the', 'a', etc and then see how many words are the same in different essays? Does it them have a magic number of identical words which flag it as a possible duplicate? Does it use levenshtein()?
My language of choice is PHP.
UPDATE
I'm thinking of not checking for plagiarism globally, but more say in 30 uploaded essays from a class. In case students have gotten together on a strictly one person assignment.
Here is an online site that claims to do so: http://www.plagiarism.org/
Good plagiarism detection will apply heuristics based on the type of document (e.g. an essay or program code in a specific language).
However, you can also apply a general solution. Have a look at the Normalized Compression Distance (NCD). Obviously you cannot exactly calculate a text's Kolmogorov complexity, but you can approach it be simply compressing the text.
A smaller NCD indicates that two texts are more similar. Some compression
algorithms will give better results than others. Luckily PHP provides support
for several compression algorithms, so you can have your NCD-driven plagiarism
detection code running in no-time. Below I'll give example code which uses
Zlib:
PHP:
function ncd($x, $y) {
$cx = strlen(gzcompress($x));
$cy = strlen(gzcompress($y));
return (strlen(gzcompress($x . $y)) - min($cx, $cy)) / max($cx, $cy);
}
print(ncd('this is a test', 'this was a test'));
print(ncd('this is a test', 'this text is completely different'));
Python:
>>> from zlib import compress as c
>>> def ncd(x, y):
... cx, cy = len(c(x)), len(c(y))
... return (len(c(x + y)) - min(cx, cy)) / max(cx, cy)
...
>>> ncd('this is a test', 'this was a test')
0.30434782608695654
>>> ncd('this is a test', 'this text is completely different')
0.74358974358974361
Note that for larger texts (read: actual files) the results will be much more
pronounced. Give it a try and report your experiences!
I think that this problem is complicated, and doesn't have one best solution.
You can detect exact duplication of words at the whole document level (ie someone downloads an entire essay from the web) all the way down to the phrase level. Doing this at the document level is pretty easy - the most trivial solution would take the checksum of each document submitted and compare it against a list of checksums of known documents. After that you could try to detect plagiarism of ideas, or find sentences that were copied directly then changed slightly in order to throw off software like this.
To get something that works at the phrase level you might need to get more sophisticated if want any level of efficiency. For example, you could look for differences in style of writing between paragraphs, and focus your attention to paragraphs that feel "out of place" compared to the rest of a paper.
There are lots of papers on this subject out there, so I suspect there is no one perfect solution yet. For example, these 2 papers give introductions to some of the general issues with this kind of software,and have plenty of references that you could dig deeper into if you'd like.
http://ir.shef.ac.uk/cloughie/papers/pas_plagiarism.pdf
http://proceedings.informingscience.org/InSITE2007/IISITv4p601-614Dreh383.pdf
Well, you first of all have to understand what you're up against.
Word-for-word plagiarism should be ridiculously easy to spot. The most naive approach would be to take word tuples of sufficient length and compare them against your corpus. The sufficient length can be incredibly low. Compare Google results:
"I think" => 454,000,000
"I think this" => 329,000,000
"I think this is" => 227,000,000
"I think this is plagiarism" => 5
So even with that approach you have a very high chance to find a good match or two (fun fact: most criminals are really dumb).
If the plagiarist used synonyms, changed word ordering and so on, obviously it gets a bit more difficult. You would have to store synonyms as well and try to normalise grammatical structure a bit to keep the same approach working. The same goes for spelling, of course (i.e. try to match by normalisation or try to account for the deviations in your matching, as in the NCD approaches posted in the other answers).
However the biggest problem is conceptual plagiarism. That is really hard and there are no obvious solutions without parsing the semantics of each sentence (i.e. sufficiently complex AI).
The truth is, though, that you only need to find SOME kind of match. You don't need to find an exact match in order to find a relevant text in your corpus. The final assessment should always be made by a human anyway, so it's okay if you find an inexact match.
Plagiarists are mostly stupid and lazy, so their copies will be stupid and lazy, too. Some put an incredible amount of effort into their work, but those works are often non-obvious plagiarism in the first place, so it's hard to track down programmatically (i.e. if a human has trouble recognising plagiarism with both texts presented side-by-side, a computer most likely will, too). For all the other 80%-or-so, the dumb approach is good enough.
It really depends on "plagarised from where".
If you are talking about within the context of a single site, that's vastly different from across the web, or the library of congres, or ...
http://www.copyscape.com/ pretty much proves it can be done.
Basic concept seems to be
do a google search for some uncommon
word sequences
For each result, do a detailed analysis
The detailed analysis portion can certainly be similar, since it is a 1 to 1 comparison, but locating and obtaining source documents is the key factor.
(This is a Wiki! Please edit here with corrections or enhancings)
For better results on not-so-big strings:
There are problems with the direct uso of the NCD formula on strings or little texts. NCD(X,X) is not zero (!). To remove this artifact subtract the self comparison.
See similar_NCD_gzip() demo at http://leis.saocarlos.sp.gov.br/SIMILAR.php
function similar_NCD_gzip($sx, $sy, $prec=0, $MAXLEN=90000) {
# NCD with gzip artifact correctoin and percentual return.
# sx,sy = strings to compare.
# Use $prec=-1 for result range [0-1], $pres=0 for percentual,
# $pres=1 or =2,3... for better precision (not a reliable)
# Use MAXLEN=-1 or a aprox. compress lenght.
# For NCD definition see http://arxiv.org/abs/0809.2553
# (c) Krauss (2010).
$x = $min = strlen(gzcompress($sx));
$y = $max = strlen(gzcompress($sy));
$xy= strlen(gzcompress($sx.$sy));
$a = $sx;
if ($x>$y) { # swap min/max
$min = $y;
$max = $x;
$a = $sy;
}
$res = ($xy-$min)/$max; # NCD definition.
# Optional correction (for little strings):
if ($MAXLEN<0 || $xy<$MAXLEN) {
$aa= strlen(gzcompress($a.$a));
$ref = ($aa-$min)/$min;
$res = $res - $ref; # correction
}
return ($prec<0)? $res: 100*round($res,2+$prec);
}