php spellcheck iteration optimization - php

Having recently begun working on a project which might need (good) scaling possibilities, I've come up with the following question:
Not taking into account the levensthein algorithm (I'm working with/on different variations), I iterate through each dictionary word and calculate the levensthein distance between the dictionary word and each of the words in my input string. Something along the lines of:
<?php
$input_words = array("this", "is", "a", "test");
foreach ($dictionary_words as $dictionary_word) {
foreach ($input_words as $input_word) {
$ld = levenshtein($input_word, $accepted_word);
if ($ld < $distances[$input_word] || $distances[$word] == NULL) {
$distances[$input_word] = $ld;
if ($ld == 0)
continue;
}
}
}
?>
My question is on best practise: Execution time is ~1-2 seconds.
I'm thinking of running a "dictionary server" which, upon startup, loads the dictionary words into memory and then iterates as part of the spell check (as described above) when a request is recieved. Will this decrease exec time or is the slow part the iteration (for loops)? If so, is there anything I can do to optimize properly?
Google's "Did you mean: ?" doesn't take several seconds to check the same input string ;)
Thanks in advance, and happy New Year.

Read Norvig's How to Write a Spelling Corrector. Although the article uses Python, others have implemented it in PHP here and here.

You'd do well to implement your Dictionary as a Binary Tree or another more efficient data-structure. The tree will reduce lookup times enormously.

Related

PHP built in functions complexity (isAnagramOfPalindrome function)

I've been googling for the past 2 hours, and I cannot find a list of php built in functions time and space complexity. I have the isAnagramOfPalindrome problem to solve with the following maximum allowed complexity:
expected worst-case time complexity is O(N)
expected worst-case space complexity is O(1) (not counting the storage required for input arguments).
where N is the input string length. Here is my simplest solution, but I don't know if it is within the complexity limits.
class Solution {
// Function to determine if the input string can make a palindrome by rearranging it
static public function isAnagramOfPalindrome($S) {
// here I am counting how many characters have odd number of occurrences
$odds = count(array_filter(count_chars($S, 1), function($var) {
return($var & 1);
}));
// If the string length is odd, then a palindrome would have 1 character with odd number occurrences
// If the string length is even, all characters should have even number of occurrences
return (int)($odds == (strlen($S) & 1));
}
}
echo Solution :: isAnagramOfPalindrome($_POST['input']);
Anyone have an idea where to find this kind of information?
EDIT
I found out that array_filter has O(N) complexity, and count has O(1) complexity. Now I need to find info on count_chars, but a full list would be very convenient for future porblems.
EDIT 2
After some research on space and time complexity in general, I found out that this code has O(N) time complexity and O(1) space complexity because:
The count_chars will loop N times (full length of the input string, giving it a start complexity of O(N) ). This is generating an array with limited maximum number of fields (26 precisely, the number of different characters), and then it is applying a filter on this array, which means the filter will loop 26 times at most. When pushing the input length towards infinity, this loop is insignificant and it is seen as a constant. Count also applies to this generated constant array, and besides, it is insignificant because the count function complexity is O(1). Hence, the time complexity of the algorithm is O(N).
It goes the same with space complexity. When calculating space complexity, we do not count the input, only the objects generated in the process. These objects are the 26-elements array and the count variable, and both are treated as constants because their size cannot increase over this point, not matter how big the input is. So we can say that the algorithm has a space complexity of O(1).
Anyway, that list would be still valuable so we do not have to look inside the php source code. :)
A probable reason for not including this information is that is is likely to change per release, as improvements are made / optimizations for a general case.
PHP is built on C, Some of the functions are simply wrappers around the c counterparts, for example hypot a google search, a look at man hypot, in the docs for he math lib
http://www.gnu.org/software/libc/manual/html_node/Exponents-and-Logarithms.html#Exponents-and-Logarithms
The source actually provides no better info
https://github.com/lattera/glibc/blob/a2f34833b1042d5d8eeb263b4cf4caaea138c4ad/math/w_hypot.c (Not official, Just easy to link to)
Not to mention, This is only glibc, Windows will have a different implementation. So there MAY even be a different big O per OS that PHP is compiled on
Another reason could be because it would confuse most developers.
Most developers I know would simply choose a function with the "best" big O
a maximum doesnt always mean its slower
http://www.sorting-algorithms.com/
Has a good visual prop of whats happening with some functions, ie bubble sort is a "slow" sort, Yet its one of the fastest for nearly sorted data.
Quick sort is what many will use, which is actually very slow for nearly sorted data.
Big O is worst case - PHP may decide between a release that they should optimize for a certain condition and that will change the big O of the function and theres no easy way to document that.
There is a partial list here (which I guess you have seen)
List of Big-O for PHP functions
Which does list some of the more common PHP functions.
For this particular example....
Its fairly easy to solve without using the built in functions.
Example code
function isPalAnagram($string) {
$string = str_replace(" ", "", $string);
$len = strlen($string);
$oddCount = $len & 1;
$string = str_split($string);
while ($len > 0 && $oddCount >= 0) {
$current = reset($string);
$replace_count = 0;
foreach($string as $key => &$char) {
if ($char === $current){
unset($string[$key]);
$len--;
$replace_count++;
continue;
}
}
$oddCount -= ($replace_count & 1);
}
return ($len - $oddCount) === 0;
}
Using the fact that there can not be more than 1 odd count, you can return early from the array.
I think mine is also O(N) time because its worst case is O(N) as far as I can tell.
Test
$a = microtime(true);
for($i=1; $i<100000; $i++) {
testMethod("the quick brown fox jumped over the lazy dog");
testMethod("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa");
testMethod("testest");
}
printf("Took %s seconds, %s memory", microtime(true) - $a, memory_get_peak_usage(true));
Tests run using really old hardware
My way
Took 64.125452041626 seconds, 262144 memory
Your way
Took 112.96145009995 seconds, 262144 memory
I'm fairly sure that my way is not the quickest way either.
I actually cant see much info either for languages other than PHP (Java for example).
I know a lot of this post is speculating about why its not there and theres not a lot drawing from credible sources, I hope its an partially explained why big O isnt listed in the documentation page though

PHP - Checking Character Type and Length and Trimming Leading Zeros

I have this code that works well, just inquiring to see if there is a better, shorter way to do it:
$sString = $_GET["s"];
if (ctype_digit($sString) == true) {
if (strlen($sString) == 5){
$sString = ltrim($sString, '0');
}
}
The purpose of this code is to remove the leading zeros from a search query before passing it to a search engine if it is 5 digits and. The exact scenario is A) All digits, B) Length equals 5, and C) It has leading zeros.
Better/Shorter means in execution time...
$sString = $_GET["s"];
if (preg_match('/^\d{5}$/', $sString)) {
$sString = ltrim($sString, '0');
}
This is shorter (and makes some assumptions you're not clear about in your question). However, shorter is not always better. Readability is sacrificed here, as well as some performance I guess (which will be negligable in a single execution, could be "costly" when run in tight loops though) etc. You ask for a "better, shorter way to do it"; then you need to define "better" and "shorter". Shorter in terms of lines/characters of code? Shorter in execution time? Better as in more readable? Better as in "most best practices" applied?
Edit:
So you've added the following:
The purpose of this code is to remove the leading zeros from a search
query before passing it to a search engine if it is 5 digits and. The
exact scenario is A) All digits, B) Length equals 5, and C) It has
leading zeros.
A) Ok, that clears things up (You sure characters like '+' or '-' won't be required? I assume these 'inputs' will be things like (5 digit) invoicenumbers, productnumbers etc.?)
B) What to do on inputs of length 1, 2, 3, 4, 6, 7, ... 28? 97? Nothing? Keep input intact?
Better/Shorter means in execution time...
Could you explain why you'd want to "optimize" this tiny piece of code? Unless you run this thousands of times in a tight loop the effects of optimizing this will be negligible. (Mumbles something about premature, optimization, root, evil). What is it that you're hoping to "gain" in "optimizing" this piece of code?
I haven't profiled/measured it yet, but my (educated) guess is that my preg_match is more 'expensive' in terms of execution time than your original code. It is "shorter" in terms of code though (but see above about sacrificing readability etc.).
Long story short: this code is not worth optimizing*; any I/O, database queries etc. will "cost" a hundred, maybe even thousands of times more. Go optimize that first
* Unless you have "optimized" everything else as far as possible (and even then, a database query might be more efficient when written in another way so the execution plan is mor eefficient or I/O might be saved using (more) caching etc. etc.). The fraftion of a millisecond (at best) you're about to save optimizing this code better be worth it! And even then you'd probably need to consider switching to better hardware, another platform or other programming language for example.
Edit 2:
I have quick'n'dirty "profiled" your code:
$start = microtime(true);
for ($i=0; $i<1000000;$i++) {
$sString = '01234';
if (ctype_digit($sString) == true) {
if (strlen($sString) == 5){
$sString = ltrim($sString, '0');
}
}
}
echo microtime(true) - $start;
Output: 0.806390047073
So, that's 0.8 seconds for 1 million(!) iterations. My code, "profiled" the same way:
$start = microtime(true);
for ($i=0; $i<1000000;$i++) {
$sString = '01234';
if (preg_match('/^\d{5}$/', $sString)) {
$sString = ltrim($sString, '0');
}
}
echo microtime(true) - $start;
Output: 1.09024000168
So, it's slower. As I guessed/predicted.
Note my explicitly mentioning "quick'n'dirty": for good/accurate profiling you need better tools and if you do use a poor-man's profiling method like I demonstrated above then at least make sure you average your numbers over a few dozen runs etc. to make sure your numbers are consistent and reliable.
But, either solution takes, worst case, less than 0,0000011 to run. If this code runs "Tens of Thousands of times a day" (assuming 100.000 times) you'll save exactly 0.11 seconds over an entire day IF you got it down to 0! If this 0.11 seconds is a problem you have a bigger problem at hand, not these few lines of code. It's just not worth "optimizing" (hell, it's not even worth discussing; the time you and I have taken going back-and-forth over this will not be "earned back" in at least 50 years).
Lesson learned here:
Measure / Profile before optimizing!
A little shorter:
if (ctype_digit($sString) && strlen($sString) == 5) {
$sString = ltrim($sString, '0');
}
It also matters if you want "digits" 0-9 or a "numeric value", which may be -1, +0123.45e6, 0xf4c3b00c, 0b10100111001 etc.. in which case use is_numeric.
I guess you could also just do this to remove 0s:
$sString = (int)$sString;

Optimize PHP script for generating IBAN

I wrote function in PHP that generates croatian IBAN for given bank account. I can easily rewrite for returning any IBAN. Problem is that I think it is not optimized nor elegant. This is the function:
function IBAN_generator($acc){
if(strlen($acc)!=23)
return;
$temp_str=substr($acc,0,3);
$remainder =$temp_str % 97;
for($i=3;$i<=22;$i++)
{
$remainder =$remainder .substr($acc,$i,1);
$remainder = $remainder % 97;
}
$con_num = 98 - $remainder;
if ($con_num<10)
{
$con_num="0".$con_num;
}
$IBAN="HR".$con_num.substr($acc,0,17);
return $IBAN;
}
Is there a better way to generate IBAN?
At first glance it doesn't seem you can make it much faster, it's just simple sequence of string appending.
Unless you have to use it thousands of times and that represents a bottleneck for your application, I'd not waste time make it better, it probably takes a few microseconds, and just upgrading the PHP version would probably make improvements much better than code changes you'd implement.
If you really have to make it faster, possible solutions are
- writing it the function into an extension
- APC op code caching (it should make things generally fast when interpreting the code so globally increase the speed)
- caching results in memory (only if your application runs the same input many times, that is not probably a common case for a simple algorithm like this one)
If you want to play with it and try to make it faster, careful, you could alter the logic and introduce a bug. Always use a unit test, or write some test cases before changing it, always a good practice
You might have a look at this repo:
https://github.com/jschaedl/Iban
On your part is to add the rules for croatia.
After that you should be able to use it for your country.
Greetz
To microoptimize, the substr has O(n) time complexity, so your loop has O(n2). To avoid it, use str_split and access the chars of the generated array by index.
Then, to make it more elegant, the for loop can be replaced with array_reduce.
Also, generally avoid string concatenation in loop, it has O(n2) time and memory complexity. IBAN is short and PHP is not run on microcomputers, so it is not a issue here. But if you ever worked with longer strings, generate an array and then implode it*.
And, of course, if you are consistent with spacing around = and after for/if, it is also more elegant ;-)
* In Javascript I once tried to generate HTML table of Hangul alphabet by iterative string concatenation. It crashed the browser consuming 1GB+ memory.

Is any solution the correct solution?

I always think to myself after solving a programming challenge that I have been tied up with for some time, "It works, thats good enough".
I don't think this is really the correct mindset, in my opinion and I think I should always be trying to code with the greatest performance.
Anyway, with this said, I just tried a ProjectEuler question. Specifically question #2.
How could I have improved this solution. I feel like its really verbose. Like I'm passing the previous number in recursion.
<?php
/* Each new term in the Fibonacci sequence is generated by adding the previous two
terms. By starting with 1 and 2, the first 10 terms will be:
1, 2, 3, 5, 8, 13, 21, 34, 55, 89, ...
Find the sum of all the even-valued terms in the sequence which do not exceed
four million.
*/
function fibonacci ( $number, $previous = 1 ) {
global $answer;
$fibonacci = $number + $previous;
if($fibonacci > 4000000) return;
if($fibonacci % 2 == 0) {
$answer = is_numeric($answer) ? $answer + $fibonacci : $fibonacci;
}
return fibonacci($fibonacci, $number);
}
fibonacci(1);
echo $answer;
?>
Note this isn't homework. I left school hundreds of years ago. I am just feeling bored and going through the Project Euler questions
I always think to myself after solving
a programming challenge that I have
been tied up with for some time, "It
works, thats good enough".
I don't think this is really the
correct mindset, in my opinion and I
think I should always be trying to
code with the greatest performance.
One of the classic things presented in Code Complete is that programmers, given a goal, can create an "optimum" computer program using one of many metrics, but its impossible to optimize for all of the parameters at once. Parameters such as
Code Readabilty
Understandability of Code Output
Length of Code (lines)
Speed of Code Execution (performance)
Speed of writing code
Feel free to optimize for any one of these parameters, but keep in mind that optimizing for all of them at the same time can be an exercise in frustration, or result in an overdesigned system.
You should ask yourself: what are your goals? What is "good enough" in this situation? If you're just learning and want to make things more optimized, by all means go for it, just be aware that a perfect program takes infinite time to build, and time is valuable in and of itself.
You can avoid the mod 2 section by doing the operation three times (every third element is even), so that it reads:
$fibonacci = 3*$number + 2*$previous;
and the new input to fibonacci is ($fibonnacci,2*$number+$previous)
I'm not familiar with php, so this is just general algorithm advice, I don't know if it's the right syntax. It's practically the same operation, it just substitutes a few multiplications for moduluses and additions.
Also, make sure that you start with $number as even and the $previous as the odd one that precedes it in the sequence (you could start with $number as 2, $previous as 1, and have the sum also start at 2).
Forget about Fibonacci (Problem 2), i say just advance in Euler. Don't waste time finding the optimal code for every question.
If your answer achieves the One minute rule then you are good to try the next one. After few problems, things will get harder and you will be optimizing the code while you write to achieve that goal
Others on here have said it as well "This is part of the problem with example questions vs real business problems"
The answer to that question is very difficult to answer for a number of reasons:
Language plays a huge role. Some languages are much more suited to some problems and so if you are faced with a mismatch you are going to find your solution "less than eloquent"
It depends on how much time you have to solve the problem, the more time to solve the problem the more likely it is you will come to a solution you like (though the reverse is occasionally true as well too much time makes you over think)
It depends on your level of satisfaction overall. I have worked on several projects where I thought parts where great and coded beautifully, and other parts where utter garbage, but they were outside of what I had time to address.
I guess the bottom line is if you think its a good solution, and your customer/purchaser/team/etc agree then its a good solution for the time. You might change your mind in the future but for now its a good solution.
Use the guideline that the code to solve the problem shouldn't take more than about a minute to execute. That's the most important thing for Euler problems, IMO.
Beyond that, just make sure it's readable - make sure that you can easily see how the code works. This way, you can more easily see how things worked if you ever get a problem like one of the Euler problems you solved, which in turn lets you solve that problem more quickly - because you already know how you should solve it.
You can set other criteria for yourself, but I think that's going above and beyond the intention of Euler problems - to me, the context of the problems seem far more suitable for focusing on efficiency and readability than anything else
I didn't actually test this ... but there was something i personally would have attempted to solve in this solution before calling it "done".
Avoiding globals as much as possible by implementing recursion with a sum argument
EDIT: Update according to nnythm's algorithm recommendation (cool!)
function fibonacci ( $number, $previous, $sum ) {
if($fibonacci > 4000000) { return $sum; }
else {
$fibonacci = 3*$number + 2*$previous;
return fibonacci($fibonnacci,2*$number+$previous,$sum+$fibonacci);
}
}
echo fibonacci(2,1,2);
[shrug]
A solution should be evaluated by the requirements. If all requirements are satisfied, then everything else is moxy. If all requirements are met, and you are personally dissatisfied with the solution, then perhaps the requirements need re-evaluation. That's about as far as you can take this meta-physical question, because we start getting into things like project management and business :S
Ahem, regarding your Euler-Project question, just my two-pence:
Consider refactoring to iterative, as opposed to recursive
Notice every third term in the series is even? No need to modulo once you are given your starting term
For example
public const ulong TermLimit = 4000000;
public static ulong CalculateSumOfEvenTermsTo (ulong termLimit)
{
// sum!
ulong sum = 0;
// initial conditions
ulong prevTerm = 1;
ulong currTerm = 1;
ulong swapTerm = 0;
// unroll first even term, [odd + odd = even]
swapTerm = currTerm + prevTerm;
prevTerm = currTerm;
currTerm = swapTerm;
// begin iterative sum,
for (; currTerm < termLimit;)
{
// we have ensured currTerm is even,
// and loop condition ensures it is
// less than limit
sum += currTerm;
// next odd term, [odd + even = odd]
swapTerm = currTerm + prevTerm;
prevTerm = currTerm;
currTerm = swapTerm;
// next odd term, [even + odd = odd]
swapTerm = currTerm + prevTerm;
prevTerm = currTerm;
currTerm = swapTerm;
// next even term, [odd + odd = even]
swapTerm = currTerm + prevTerm;
prevTerm = currTerm;
currTerm = swapTerm;
}
return sum;
}
So, perhaps more lines of code, but [practically] guaranteed to be faster. An iterative approach is not as "elegant", but saves recursive method calls and saves stack space. Second, unrolling term generation [that is, explicitly expanding a loop] reduces the number of times you would have had to perform modulus operation and test "is even" conditional. Expanding also reduces the number of times your end conditional [if current term is less than limit] is evaluated.
Is it "better", no, it's just "another" solution.
Apologies for the C#, not familiar with php, but I am sure you could translate it fairly well.
Hope this helps, :)
Cheers
It is completely your choice, whether you are happy with a solution or whether you want to improve it further. There are many project Euler problems where a brute force solution would take too long, and where you will have to look for a clever algorithm.
Problem 2 doesn't require any optimisation. Your solution is already more than fast enough.
Still let me explain what kind of optimisation is possible. Often it helps to do some research on the subject. E.g. the wiki page on Fibonacci numbers contains this formula
fib(n) = (phi^n - (1-phi)^n)/sqrt(5)
where phi is the golden ratio. I.e.
phi = (sqrt(5)+1)/2.
If you use that fib(n) is approximately phi^n/sqrt(5) then you can find the index of the largest Fibonacci number smaller than M by
n = floor(log(M * sqrt(5)) / log(phi)).
E.g. for M=4000000, we get n=33, hence fib(33) the largest Fibonacci number smaller than 4000000. It can be observed that fib(n) is even if n is a multiple of 3. Hence the sum of the even Fibonacci numbers is
fib(0) + fib(3) + fib(6) + ... + fib(3k)
To find a closed form we use the formula above from the wikipedia page and notice that
the sum is essentially just two geometric series. The math isn't completely trivial, but using these ideas it can be shown that
fib(0) + fib(3) + fib(6) + ... + fib(3k) = (fib(3k + 2) - 1) /2 .
Since fib(n) has size O(n), the straight forward solution has a complexity of O(n^2).
Using the closed formula above together with a fast method to evaluate Fibonacci numbers
has a complexity of O(n log(n)^(1+epsilon)). For small numbers either solution is of course fine.

How would you code an anti plagiarism site?

First, please note, that I am interested in how something like this would work, and am not intending to build it for a client etc, as I'm sure there may already be open source implementations.
How do the algorithms work which detect plagiarism in uploaded text? Does it use regex to send all words to an index, strip out known words like 'the', 'a', etc and then see how many words are the same in different essays? Does it them have a magic number of identical words which flag it as a possible duplicate? Does it use levenshtein()?
My language of choice is PHP.
UPDATE
I'm thinking of not checking for plagiarism globally, but more say in 30 uploaded essays from a class. In case students have gotten together on a strictly one person assignment.
Here is an online site that claims to do so: http://www.plagiarism.org/
Good plagiarism detection will apply heuristics based on the type of document (e.g. an essay or program code in a specific language).
However, you can also apply a general solution. Have a look at the Normalized Compression Distance (NCD). Obviously you cannot exactly calculate a text's Kolmogorov complexity, but you can approach it be simply compressing the text.
A smaller NCD indicates that two texts are more similar. Some compression
algorithms will give better results than others. Luckily PHP provides support
for several compression algorithms, so you can have your NCD-driven plagiarism
detection code running in no-time. Below I'll give example code which uses
Zlib:
PHP:
function ncd($x, $y) {
$cx = strlen(gzcompress($x));
$cy = strlen(gzcompress($y));
return (strlen(gzcompress($x . $y)) - min($cx, $cy)) / max($cx, $cy);
}
print(ncd('this is a test', 'this was a test'));
print(ncd('this is a test', 'this text is completely different'));
Python:
>>> from zlib import compress as c
>>> def ncd(x, y):
... cx, cy = len(c(x)), len(c(y))
... return (len(c(x + y)) - min(cx, cy)) / max(cx, cy)
...
>>> ncd('this is a test', 'this was a test')
0.30434782608695654
>>> ncd('this is a test', 'this text is completely different')
0.74358974358974361
Note that for larger texts (read: actual files) the results will be much more
pronounced. Give it a try and report your experiences!
I think that this problem is complicated, and doesn't have one best solution.
You can detect exact duplication of words at the whole document level (ie someone downloads an entire essay from the web) all the way down to the phrase level. Doing this at the document level is pretty easy - the most trivial solution would take the checksum of each document submitted and compare it against a list of checksums of known documents. After that you could try to detect plagiarism of ideas, or find sentences that were copied directly then changed slightly in order to throw off software like this.
To get something that works at the phrase level you might need to get more sophisticated if want any level of efficiency. For example, you could look for differences in style of writing between paragraphs, and focus your attention to paragraphs that feel "out of place" compared to the rest of a paper.
There are lots of papers on this subject out there, so I suspect there is no one perfect solution yet. For example, these 2 papers give introductions to some of the general issues with this kind of software,and have plenty of references that you could dig deeper into if you'd like.
http://ir.shef.ac.uk/cloughie/papers/pas_plagiarism.pdf
http://proceedings.informingscience.org/InSITE2007/IISITv4p601-614Dreh383.pdf
Well, you first of all have to understand what you're up against.
Word-for-word plagiarism should be ridiculously easy to spot. The most naive approach would be to take word tuples of sufficient length and compare them against your corpus. The sufficient length can be incredibly low. Compare Google results:
"I think" => 454,000,000
"I think this" => 329,000,000
"I think this is" => 227,000,000
"I think this is plagiarism" => 5
So even with that approach you have a very high chance to find a good match or two (fun fact: most criminals are really dumb).
If the plagiarist used synonyms, changed word ordering and so on, obviously it gets a bit more difficult. You would have to store synonyms as well and try to normalise grammatical structure a bit to keep the same approach working. The same goes for spelling, of course (i.e. try to match by normalisation or try to account for the deviations in your matching, as in the NCD approaches posted in the other answers).
However the biggest problem is conceptual plagiarism. That is really hard and there are no obvious solutions without parsing the semantics of each sentence (i.e. sufficiently complex AI).
The truth is, though, that you only need to find SOME kind of match. You don't need to find an exact match in order to find a relevant text in your corpus. The final assessment should always be made by a human anyway, so it's okay if you find an inexact match.
Plagiarists are mostly stupid and lazy, so their copies will be stupid and lazy, too. Some put an incredible amount of effort into their work, but those works are often non-obvious plagiarism in the first place, so it's hard to track down programmatically (i.e. if a human has trouble recognising plagiarism with both texts presented side-by-side, a computer most likely will, too). For all the other 80%-or-so, the dumb approach is good enough.
It really depends on "plagarised from where".
If you are talking about within the context of a single site, that's vastly different from across the web, or the library of congres, or ...
http://www.copyscape.com/ pretty much proves it can be done.
Basic concept seems to be
do a google search for some uncommon
word sequences
For each result, do a detailed analysis
The detailed analysis portion can certainly be similar, since it is a 1 to 1 comparison, but locating and obtaining source documents is the key factor.
                           (This is a Wiki! Please edit here with corrections or enhancings)
For better results on not-so-big strings:
There are problems with the direct uso of the NCD formula on strings or little texts. NCD(X,X) is not zero (!). To remove this artifact subtract the self comparison.
See similar_NCD_gzip() demo at http://leis.saocarlos.sp.gov.br/SIMILAR.php
function similar_NCD_gzip($sx, $sy, $prec=0, $MAXLEN=90000) {
# NCD with gzip artifact correctoin and percentual return.
# sx,sy = strings to compare.
# Use $prec=-1 for result range [0-1], $pres=0 for percentual,
# $pres=1 or =2,3... for better precision (not a reliable)
# Use MAXLEN=-1 or a aprox. compress lenght.
# For NCD definition see http://arxiv.org/abs/0809.2553
# (c) Krauss (2010).
$x = $min = strlen(gzcompress($sx));
$y = $max = strlen(gzcompress($sy));
$xy= strlen(gzcompress($sx.$sy));
$a = $sx;
if ($x>$y) { # swap min/max
$min = $y;
$max = $x;
$a = $sy;
}
$res = ($xy-$min)/$max; # NCD definition.
# Optional correction (for little strings):
if ($MAXLEN<0 || $xy<$MAXLEN) {
$aa= strlen(gzcompress($a.$a));
$ref = ($aa-$min)/$min;
$res = $res - $ref; # correction
}
return ($prec<0)? $res: 100*round($res,2+$prec);
}

Categories