I am trying to build unique random phrases from text for detecting plagiarism. The idea is author will submit an article and then php will build phrases from text which will be used for plagiarism detection
Consider following sentence:
This is a very long and boring article and this article is plagiarized.
Based upon the above text, system will determine how many phrases will be generated i.e. 20 words long article will have 3 phrases. Max generated phrase can be minimum two words long and maximum 3 words long. The returned output will be like this
very long
article is plagiarized
I wrote following code
$words = str_word_count($text, 1);
$total_phrases_required = count($words) /2;
//build phrases
I need hint how to complete rest of the part.
You could break up text into two arrays of sentences and then use a function like the similar_text function to recursively check for similar strings.
Another idea, to find outright pauperism. You could break down text into sentences again. But then put into a database and run a query that selects count of index column and groups by sentence column. If any results comes back greater than 1, you to have an exact match for that sentence.
Related
Is there a predefined library/function in php or python that would let me extract 2-3 paragraphs out of a complete document based on the proximity of the keywords to be found in the document.
Let's say i have 5 keywords, A,B,C,D,E. And I have an essay containing multiple occurrences of all these keywords.
I would like to extract a few paragraphs from it which contain the closest occurrences of the keywords.
Probably split/explode on new lines and the enumerate through the ones you want.
text = """Test paragraph 1.
Test paragraph 2.
Test paragraph 3.
Test paragraph 4."""
paragraphs = text.split("\n\n")
print(paragraphs[1:3])
I need to create a script searching for words with 'blanks', which basically are % in sql.
$numberofblanks = 1; //max 13
$searchedword = "WORD";
$searchedwordsorted = "DORW";
Results given should be:
WORDY
WORLD
CROWD
SWORD
WORDS
DOWRY
ROWED
DROWN
DOWER
ROWDY
%word, w%ord, wo%rd, wor%d, word% would do, but what about more complicated queries, with 2 or more blanks?
Also was wondering if $searchedwordsorted is any helpful or it doesn't really matter and it's just waste of space in my table.
Thank you kindly for your help guys.
.mike
First I want to correct an error in your question. In your queries you mean _ not %. The % means any number of characters (zero or more). Use _ to mean exactly one character.
Now on to the solution... you don't actually need the sorted word stored in the database. You could just do this:
SELECT word
FROM dictionary
WHERE CHAR_LENGTH(word) = 6
AND word LIKE '%W%'
AND word LIKE '%O%'
AND word LIKE '%R%'
AND word LIKE '%D%'
If you have duplicate letters in your input, need to handle this correctly to ensure that all results contain all the duplicated letters. For example if the input is FOO__ you need to check that each word matches both %F% and %O%O%.
SELECT word
FROM dictionary
WHERE CHAR_LENGTH(word) = 5
AND word LIKE '%F%'
AND word LIKE '%O%O%'
Note that this approach will require a full scan of the table so it will not be particularly efficient. You could improve things slightly by storing the length of each word in a separate column and indexing that column.
If you have sortedword then you can improve performance by omitting the % between duplicated letters since you know that they will appear consecutively in sortedword. This could improve performance bceause it reduces the amount of backtracking required for failed matches.
SELECT word
FROM dictionary
WHERE CHAR_LENGTH(word) = 5
AND sortedword LIKE '%F%'
AND sortedword LIKE '%OO%'
Another approach that requires sortedword to be present is as follows:
SELECT word
FROM dictionary
WHERE CHAR_LENGTH(word) = 5
AND sortedword LIKE '%D%O%R%W%'
Again this requires a full scan of the table. Again, if you have repeated letters you don't need the % between them.
SELECT word
FROM dictionary
WHERE CHAR_LENGTH(word) = 5
AND sortedword LIKE '%F%OO%'
I am in the process of learning MySQL and querying, and right now working with PHP to begin with.
For learning purposes I chose a small anagram solver kind of project to begin with.
I found a very old English language word list on the internet freely available to use as the DB.
I tried querying, find in set and full-text search matching but failed.
How can I:
Match a result letter by letter?
For example, let's say that I have the letters S-L-A-O-G to match against the database entry.
Since I have a large database which surely contains many words, I want to have in return of the query:
lag
goal
goals
slag
log
... and so on.
Without having any other results which might have a letter used twice.
How would I solve this with SQL?
Thank you very much for your time.
$str_search = 'SLAOG';
SELECT word
FROM table_name
WHERE word REGEXP '^[{$str_search}]+$' # '^[SLAOG]+$'
// Filter the results in php afterwards
// Loop START
$arr = array();
for($i = 0; $i < strlen($row->word); $i++) {
$h = substr($str_search, $i, 0);
preg_match_all("/{$h}/", $row->word, $arr_matches);
preg_match_all("/{$h}/", $str_search, $arr_matches2);
if (count($arr_matches[0]) > count($arr_matches2[0]))
FALSE; // Amount doesn't add up
}
// Loop END
Basicly run a REGEXP on given words and filter result based on how many occurencies the word compared with the search word.
The REGEXP checks all columns, from beginning to end, with a combination of given words. This may result in more rows then you need, but it will give a nice filter nonetheless.
The loop part is to filter words where a letter is used more times then in the search string. I run a preg_match_all() on each letter in found the word and the search word to check the amount of occurencies, and compare them with count().
If you want a quick and dirty solution....
Split the word you're trying to get anagrams for into individual letters. Assign each letter an individual prime number value, and multiply them all together; eg:
C - 2
A - 3
T - 5
For a total of 30
Then step through your dictionary list, and do the same operation on each word in that. If your target word's value is divisible exactly by the dictionary word's value, then you know that the dictionary word has only letters that occur in your target word.
You can speed it up by pre-calculating the dictionary values, and then querying for just the right values:
SELECT * FROM dictionary WHERE ($searchWordTotal % wordTotal) = 0
(searchWordTotal is the total for the word you're looking for, and wordTotal is the one from the database)
I should get around to writing this properly one of these days....
since you only want words with the letters given, and no others, but you dont need to use all the letters, then i suggest logic like this:
* take your candidate word,
* do a string replace of the first occurrence of each letter in your match set,
* set the new value to null
* then finally wrap all that in a strlength to see if there are any characters left.
you can do all that in sql - but a little procedure will probably look more familiar to most coders.
So I have a database of words between 3 and 20 characters long. I want to code something in PHP that finds all of the smaller words that are contained within a larger word. For example, in the word "inward" there are the words "rain", "win", "rid", etc.
At first I thought about adding a field to the Words tables (Words3 through Words20, denoting the number of letters in the words), something like "LetterCount"... for example, "rally" would be represented as 10000000000200000100000010: 1 instances of the letter A, 0 instances of the letter B, ... 2 instances of the letter L, etc. Then, go through all the words in each table (or one table if the target length of found words was specified) and compare the LetterCount of each word to the LetterCount of the source word ("inward" in the example above).
But then I started thinking that that would place too much of a load on the MySQL database as well as the PHP script, calling each and every word's LetterCount, comparing each and every digit to that of the source word, etc.
Is there an easier, perhaps more intuitive way of doing this? I'm open to using stored procedures if it will help with overhead in any way. Just some suggestions would be greatly appreciated. Thanks!
Here is a simple solution that should be pretty efficient, but will only work up to certain size of words (probably about 15-20 characters it will break down, depending on whether the letters making up the word are low-frequency letters with lower values or high-frequency letters with higher values):
Assign each letter a prime number according to it's frequency. So e is 2, t = 3, a = 5, etc. using frequency values from here or some similar source.
Precalculate the value of each word in your word list by multiplying the prime values for the letters in the word, and store in the table in a bigint data type column. For instance, tea would have a value of 3*2*5=30. If a word has repeated letters, repeat the factor, so that teat should have a value of 3*2*5*3=90.
When checking if a word, such as rain, is contained inside of another word, such as inward, it's sufficient to check if the value for rain divides the value for inward. In this case, inward = 14213045, rain = 7315, and 14213045 is divisible by 7315, so the word rain is inside the word inward.
A bigint column maxes out at 9223372036854775807, which should be fine up to about 15-20 characters (depending on the frequencies of letters in the word). For instance, I picked up the first 20-letter word from here, which is anitinstitutionalism, and has a value of 6901041299724096525 which would just barely fit inside the bigint column. However, the 14-letter word xylopyrography has a value of 635285791503081662905, which is too big. You might have to handle the really large ones as special cases using an alternate method, but hopefully there's few enough of them that it would still be relatively efficient.
The query would work something like the demo I've prepared here: http://www.sqlfiddle.com/#!2/9bd27/8
Basically - I want to calculate the "Proximity" of various terms.
By "proximity" I means Specifically the number of spaces/characters/words that sit between them.
Example:
Terms = Word1 / Word2
Chunk = "blah Word1 blah blah blah blah blah Word2 blah"
Proximity = Word1-Word2:5
THe script would see the 2 terms, locate them and then see the distance based on the words that lay between them.
A more advanced version would be to examine the semantic structure - and identify whether the terms occur within the same semantic element, or a sibling, or a parent etc.
Thus proximity discovery of terms may be within the same paragraph, or in sequential paragraphs, or under the same "parent" (heading) but otherwise separate etc.
Further - introducing things like word stemming/relationships/soundings at a later date may be useful too.
.
I've looked around the net (Google, here, php forums, php script sites).
Not seeing anything like it.
I can see tools on some sites that do similar (limited) - usually SEO based tools.
I want to be able to apply this to "text" in general ... as I may apply it to uploaded word/txt files etc.
I'm not seeing any real examples - so I can only assume it's mroe than a trifle to code it.
The question is - how can I do this?
How would I handle variant order of the words (Word1+Word2 / Word2+Word1)?
How could I handle identifying proximity within/outside of the same element/structure?
Hoping someone can shed some light/make some suggestions.
If you need to do a lot of this kind of search on a given text, you could begin by indexing the whole text into a database containing the word, its position in the text, and the paragraph number (if needed). Then, you could select all the Word1 and Word2 positions, and it shouldn't be too hard to infer the minimal distance.
Edit:
Here is a try for a simple algorithm for a one-shot, without using database.
Remove any html and punctuation to keep only the words
Search for the first occurrence of Word1
Count the number of words (or chars, or spaces) until you reach next occurrence of Word2
If you reach Word1 again before reaching Word2, restart the counter
Record the distance, then continue to repeat steps 2-5 to get other occurrences of Word1 and Word2