Forgive me if this isn't a programming oriented question.
Lets say we have two sentences
[1]=This is a test idea
[2]=This is an experimental idea
If I jumble up [1]
[1]= a This idea test is
Would this count as plagiarism? What sort of logic do I have to apply to detect plagiarism.
I'm not making a complexed plagiarism service, but a rather simple one what can catch obvious plagiarism.
My logic is somewhat like this
<?php
$str1= "This is a test idea.";
$str2= "This is an experimental idea.";
echo "$str1<br>$str2<br>";
$str1Array = explode(" ",$str1);
$str2Array = explode(" ",$str2);
if(count($str1Array) > count($str2Array))
$max=count($str1Array);
else
$max=count($str2Array);
$word_seq = array();
$word_seq_history = array();
$c=0;
$plag_count=0;
for ($i = 0; $i < $max; $i++) {
$lev = levenshtein($str1Array[$i], $str2Array[$i]); // check for an exact match
if ($lev == 0) {
$c+=1;// (exact match)
//echo "<br>$c";
$word = $str1Array[$i];
array_push($word_seq,$word);
}
else
{
if($lev != 0){
if($c>=2)
$plag_count+= count($word_seq);
$current_seq = implode(" ", $word_seq);
array_push($word_seq_history,$current_seq);
echo $current_seq;
$c=0;
$word_seq= array();
}
}
}
echo "plag_count:";
echo $plag_count;
echo "max:";
echo $max;
echo "<br>" ;
echo ($plag_count/$max)*100;
?>
Output:
String 1: "This is a test idea."
String 2: "This is an experimental idea."
Words_Same:2 max:5
Plagiarism: 40%
Do I need to change it or is it fine the way it is?
What I would do to detect plagiarism in a very basic way is to first calibrate my system: ie first do a lot of comparisons with files from which you're sure aren't plagiated
1) compare a bunch of files with each other, detect the plagiarism rate with your function. Get out the words that are the most comonly used (let's say drop your rate up to XX%, trial and error here), put this words in your database and give them a weight of 0. Do this again without this words up to (less than XX%) (with regular expressions you can filter this words) and give them a weight of 1. And so on... Until you reach a plagiarism rate of nearly zero.
2) calculate the 'new' percent by sum(weight of words in your db that appear in the text)/ (the total weight of all your words) (and give the words that do not already come up in your database a weight of 10) = your rate
3) test it with plagiated stuff, if not ok, change a few parameters (weights)
I think this method, if used to check longer passages, will show a high level of correlation just because of common words, especially articles, prepositions, "be" verbs, and other common/overused words. If you're writing about a variety of subjects, be it code or Shakespeare, you're likely to run across a jargon sets that are common to many genuinely unique papers. I think you may need to look at an alternate approach. Have you done any research into plagiarism and its detection?
Related
I am looking for the most efficient algorithm in PHP to check if a string was made from dictionary words only or not.
Example:
thissentencewasmadefromenglishwords
thisonecontainsyxxxyxsomegarbagexaatoo
pure
thisisalsobadxyyyaazzz
Output:
thissentencewasmadefromenglishwords
pure
a.txt
contains the dictionary words
b.txt
contains the strings: one in every line, without spaces made from a..z chars only
Another way to do this is to employ the Aho-Corasick string matching algorithm. The basic idea is to read in your dictionary of words and from that create the Aho-Corasick tree structure. Then, you run each string you want to split into words through the search function.
The beauty of this approach is that creating the tree is a one time cost. You can then use it for all of the strings you're testing. The search function runs in O(n) (n being the length of the string), plus the number of matches found. It's really quite efficient.
Output from the search function will be a list of string matches, telling you which words match at what positions.
The Wikipedia article does not give a great explanation of the Aho-Corasick algorithm. I prefer the original paper, which is quite approachable. See Efficient String Matching: An Aid to Bibliographic Search.
So, for example, given your first string:
thissentencewasmadefromenglishwords
You would get (in part):
this, 0
his, 1
sent, 4
ten, 7
etc.
Now, sort the list of matches by position. It will be almost sorted when you get it from the string matcher, but not quite.
Once the list is sorted by position, the first thing you do is make sure that there is a match at position 0. If there is not, then the string fails the test. If there is (and there might be multiple matches at position 0), you take the length of the matched string and see if there's a string match at that position. Add that match's length and see if there's a match at the next position, etc.
If the strings you're testing aren't very long, then you can use a brute force algorithm like that. It would be more efficient, though, to construct a hash map of the matches, indexed by position. Of course, there could be multiple matches for a particular position, so you have to take that into account. But looking to see if there is a match at a position would be very fast.
It's some work, of course, to implement the Aho-Corasick algorithm. A quick Google search shows that there are php implementations available. How well they work, I don't know.
In the average case, this should be very quick. Again, it depends on how long your strings are. But you're helped by there being relatively few matches at any one position. You could probably construct strings that would exhibit pathologically poor runtimes, but you'd probably have to try real hard. And again, even a pathological case isn't going to take too terribly long if the string is short.
This is a problem that can be solved using Dynamic Programming, based on the next formulas:
f(0) = true
f(i) = OR { f(i-j) AND Dictionary.contais(s.substring(i-j,i) } for each j=1,...,i
First, load your file into a dictionary, then use the DP solution for the above formula.
Pseudo code is something like: (Hope I have no "off by one" for indices..)
check(word):
f = new boolean[word.length() + 1)
f[0] = true
for i from 1 to word.length() + 1:
f[i] = false
for j from 1 to i-1:
if dictionary.contains(word.substring(j-1,i-1)) AND f[j]:
f[i] = true
return f[word.length()
I recommend a recursive approach. Something like this:
<?php
$wordsToCheck = array(
'otherword',
'word1andother',
'word1',
'word1word2',
'word1word3',
'word1word2word3'
);
$wordList = array(
'word1',
'word2',
'word3'
);
$results = array();
function onlyListedWords($word, $wordList) {
if (in_array($word, $wordList)) {
return true;
} else {
$length = strlen($word);
$wordTemp = $word;
$part = '';
for ($i=0; $i < $length; $i++) {
$part .= $wordTemp[$i];
if (in_array($part, $wordList)) {
if ($i == $length - 1) {
return true;
} else {
$wordTemp = substr($wordTemp, $i + 1);
return onlyListedWords($wordTemp, $wordList);
}
}
}
}
}
foreach ($wordsToCheck as $word) {
if (onlyListedWords($word, $wordList))
$results[] = $word;
}
var_dump($results);
?>
can anyone suggest me a better method(or most preferred method) to find the match percentage between two strings(i.e. how closely those two strings(eg. name) are related in terms of percentage) using fuzzy logic.? can anyone help me to write the code? really i am wondering where to start..
$str1 = 'Hello';
$str2 = 'Hello, World!';
$percent;
similar_text($str1, $str2, $percentage);
http://php.net/manual/en/function.similar-text.php
Word Comparator
Here's a comparison based on words - it's a lot faster than character-based ones, plus it often makes more sense to compare human text by words. However, word lengths do matter; this algorithm takes this into consideration, for better results. Check test results at the end; I think they're pretty much what a human would say.
function wordSimilarity($s1,$s2) {
$words1 = preg_split('/\s+/',$s1);
$words2 = preg_split('/\s+/',$s2);
$diffs1 = array_diff($words2,$words1);
$diffs2 = array_diff($words1,$words2);
$diffsLength = strlen(join("",$diffs1).join("",$diffs2));
$wordsLength = strlen(join("",$words1).join("",$words2));
if(!$wordsLength) return 0;
$differenceRate = ( $diffsLength / $wordsLength );
$similarityRate = 1 - $differenceRate;
return $similarityRate;
}
This function gives you a floating point value between 0 and 1 where 1 is total similarity.
Let's see some tests
$test = "this is something you've never done before";
wordSimilarity($test,"this is something you've never done before"); // 1.000
wordSimilarity($test,"this is something"); // 0.588
wordSimilarity($test,"this is nothing you have ever done"); // 0.312
wordSimilarity($test,"leave me alone with lorem ipsum"); // 0.000
wordSimilarity($test,"before you do something you've never done"); // 0.845
wordSimilarity($test,"never have i ever done this"); // 0.448
I'm developing a documents system that, each time that a new one is created, it has to detect and discard duplicates in a database of about 500.000 records.
For now, I'm using a search engine to retrieve the 20 most similar documents, and compare them with the new one that we're trying to create. The problem is that I have to check if the new document is similar (that's easy with similar_text), or even if it's contained inside the other text, all this operations considering that the text may have been partly changed by the user (here is the problem). How I can do that?
For example:
<?php
$new = "the wild lion";
$candidates = array(
'the dangerous lion lives in Africa',//$new is contained into this one, but has changed 'wild' to 'dangerous', it has to be detected as duplicate
'rhinoceros are native to Africa and three to southern Asia.'
);
foreach ( $candidates as $candidate ) {
if( $candidate is similar or $new is contained in it) {
//Duplicated!!
}
}
Of course, in my system the documents are longer than 3 words :)
This is the temporal solution I'm using:
function contained($text1, $text2, $factor = 0.9) {
//Split into words
$pattern= '/((^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/u';
$words1 = preg_split($pattern, mb_strtolower($text1), -1, PREG_SPLIT_NO_EMPTY);
$words2 = preg_split($pattern, mb_strtolower($text2), -1, PREG_SPLIT_NO_EMPTY);
//Set long and short text
if (count($words1) > count($words2)) {
$long = $words1;
$short = $words2;
} else {
$long = $words2;
$short = $words1;
}
//Count the number of words of the short text that also are in the long
$count = 0;
foreach ($short as $word) {
if (in_array($word, $long)) {
$count++;
}
}
return ($count / count($short)) > $factor;
}
A few Ideas, that you could potentially undertake or investigate further are:
Indexing the documents and then searching for similar documents. So Open source Indexing/Search systems such as Solr, Sphinx or Zend Search Lucene could come in handy.
You could use the sim hashing algorithm or shingling . Briefly the simhash algorithm will let you compute similar hash values for similar documents. So you could then store this value against each document and check how similar various documents are.
Other algorithms that you may find helpful to get some ideas from are:
1 . Levenshtein distance
2 . Bayesian filtering - SO Questions re Bayesian filtering. First link in this list item points to the Bayesian spam filtering article on Wiki, but this algorithm can be adapted to what you are trying to do.
I want to parse a sentence into words but some sentences have two words that can be combined into one and result in a different meaning.
For example:
Eminem is a hip hop star.
If I parse it by splitting the words by space I will get
Eminem
is
a
**hip**
**hop**
star
but I want something like this:
Eminem
is
a
**hip hop**
star
This is just an example; there might be some other word combinations listed as a word in a dictionary.
How can I parse this easily?
I have a dictionary in a MySQL database. Is there any API to do this?
No API's I know of. However you could try the SQL like clause.
$words = explode(' ', 'Eminem is a hip hop star');
$len = count($words);
$fixed = array();
for($x = 0; $x < $len; $x++) {
//LIKE 'hip %' will match hip hop
$q = mysql_query("SELECT word FROM dict WHERE word LIKE '".$words[$x]." %'");
//Combine current and next word
$combined = $words[$x].' '.$words[($x+1)];
while( $result = mysql_fetch_array($q)) {
if($result['word'] == $combined) { //Word is in dictionary
$fixed[] = $combined;
$x++;
} else { //Word isn't in dictionary
$fixed[] = $words[$x];
}
}
}
*Please excuse my lack of PDO. I'm lazy right now.
EDIT: I've done some thinking. While the code above isn't optimal, the optimized version I've come up with probably can't do very much better. The fact of the matter is regardless of how you approach the problem, you will need to compare every word in your input sentence to your dictionary and perform additional computations. I see two approaches you can take depending on hardware limits.
Both of these methods assume a dict table with (example) structure:
+--+-----+------+
|id|first|second|
+--+-----+------+
|01|hip |hop |
+--+-----+------+
|02|grade|school|
+--+-----+------+
Option 1: Your webserver has lots of available RAM (and a decent processor)
The idea here is to completely bypass the database layer by caching the dictionary in PHP's memory (with APC or memcache, the latter if you plan to run on several severs). This will place all the load on your webserver, however it could be significantly faster since accessing cached data from the RAM is much faster than querying your DB.
(Again, I've left out PDO and Sanitization for simplicity's sake)
// Step One: Cache Dictionary..the entire dictionary
// This could be run on server start-up or before every user input
if(!apc_exists('words')) {
$words = array();
$q = mysql_query('SELECT first, second FROM dict');
while($res = mysql_fetch_array($q)) {
$words[] = array_values($res);
}
apc_store('words', serialize($words)); //You could use memcache if you want
}
// Step Two: Compare cached dictionary to user input
$data = explode(' ', 'Eminem is a hip hop star');
$words = apc_fetch('words');
$count = count($data);
for($x = 0; $x < $count; $x++) { //Simpler to use a for loop
foreach($words as $word) { //Match against each word
if($data[$x] == $word[0] && $data[$x+1] == $word[1]) {
$data[$x] .= ' '.$word[1];
array_splice($data, $x, 1);
$count--;
}
}
}
Option 2: Fast SQL Server
The second option involves querying each of the words in the input text from the SQL server. For example, for the sentence "Eminem is hip hop" you would create a query that looked like SELECT * FROM dict WHERE (first = 'Eminem' && second = 'is') || (first = 'is' && second = 'hip') || (first = 'hip' && second = 'hop'). Then to fix the array of words you would simply loop through MySQL's results and fuse the appropriate words together. If you are willing to take this route, it might be more efficient to cache commonly used words and fix them before querying the database. This way you can eliminate conditions from your query.
rand(1,N) but excluding array(a,b,c,..),
is there already a built-in function that I don't know or do I have to implement it myself(how?) ?
UPDATE
The qualified solution should have gold performance whether the size of the excluded array is big or not.
No built-in function, but you could do this:
function randWithout($from, $to, array $exceptions) {
sort($exceptions); // lets us use break; in the foreach reliably
$number = rand($from, $to - count($exceptions)); // or mt_rand()
foreach ($exceptions as $exception) {
if ($number >= $exception) {
$number++; // make up for the gap
} else /*if ($number < $exception)*/ {
break;
}
}
return $number;
}
That's off the top of my head, so it could use polishing - but at least you can't end up in an infinite-loop scenario, even hypothetically.
Note: The function breaks if $exceptions exhausts your range - e.g. calling randWithout(1, 2, array(1,2)) or randWithout(1, 2, array(0,1,2,3)) will not yield anything sensible (obviously), but in that case, the returned number will be outside the $from-$to range, so it's easy to catch.
If $exceptions is guaranteed to be sorted already, sort($exceptions); can be removed.
Eye-candy: Somewhat minimalistic visualisation of the algorithm.
I don't think there's such a function built-in ; you'll probably have to code it yourself.
To code this, you have two solutions :
Use a loop, to call rand() or mt_rand() until it returns a correct value
which means calling rand() several times, in the worst case
but this should work OK if N is big, and you don't have many forbidden values.
Build an array that contains only legal values
And use array_rand to pick one value from it
which will work fine if N is small
Depending on exactly what you need, and why, this approach might be an interesting alternative.
$numbers = array_diff(range(1, N), array(a, b, c));
// Either (not a real answer, but could be useful, depending on your circumstances)
shuffle($numbers); // $numbers is now a randomly-sorted array containing all the numbers that interest you
// Or:
$x = $numbers[array_rand($numbers)]; // $x is now a random number selected from the set of numbers you're interested in
So, if you don't need to generate the set of potential numbers each time, but are generating the set once and then picking a bunch of random number from the same set, this could be a good way to go.
The simplest way...
<?php
function rand_except($min, $max, $excepting = array()) {
$num = mt_rand($min, $max);
return in_array($num, $excepting) ? rand_except($min, $max, $excepting) : $num;
}
?>
What you need to do is calculate an array of skipped locations so you can pick a random position in a continuous array of length M = N - #of exceptions and easily map it back to the original array with holes. This will require time and space equal to the skipped array. I don't know php from a hole in the ground so forgive the textual semi-psudo code example.
Make a new array Offset[] the same length as the Exceptions array.
in Offset[i] store the first index in the imagined non-holey array that would have skipped i elements in the original array.
Now to pick a random element. Select a random number, r, in 0..M the number of remaining elements.
Find i such that Offset[i] <= r < Offest[i+i] this is easy with a binary search
Return r + i
Now, that is just a sketch you will need to deal with the ends of the arrays and if things are indexed form 0 or 1 and all that jazz. If you are clever you can actually compute the Offset array on the fly from the original, it is a bit less clear that way though.
Maybe its too late for answer, but I found this piece of code somewhere in my mind when trying to get random data from Database based on random ID excluding some number.
$excludedData = array(); // This is your excluded number
$maxVal = $this->db->count_all_results("game_pertanyaan"); // Get the maximum number based on my database
$randomNum = rand(1, $maxVal); // Make first initiation, I think you can put this directly in the while > in_array paramater, seems working as well, it's up to you
while (in_array($randomNum, $excludedData)) {
$randomNum = rand(1, $maxVal);
}
$randomNum; //Your random number excluding some number you choose
This is the fastest & best performance way to do it :
$all = range($Min,$Max);
$diff = array_diff($all,$Exclude);
shuffle($diff );
$data = array_slice($diff,0,$quantity);