How to check if a text is contained into another? - php

I'm developing a documents system that, each time that a new one is created, it has to detect and discard duplicates in a database of about 500.000 records.
For now, I'm using a search engine to retrieve the 20 most similar documents, and compare them with the new one that we're trying to create. The problem is that I have to check if the new document is similar (that's easy with similar_text), or even if it's contained inside the other text, all this operations considering that the text may have been partly changed by the user (here is the problem). How I can do that?
For example:
<?php
$new = "the wild lion";
$candidates = array(
'the dangerous lion lives in Africa',//$new is contained into this one, but has changed 'wild' to 'dangerous', it has to be detected as duplicate
'rhinoceros are native to Africa and three to southern Asia.'
);
foreach ( $candidates as $candidate ) {
if( $candidate is similar or $new is contained in it) {
//Duplicated!!
}
}
Of course, in my system the documents are longer than 3 words :)

This is the temporal solution I'm using:
function contained($text1, $text2, $factor = 0.9) {
//Split into words
$pattern= '/((^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/u';
$words1 = preg_split($pattern, mb_strtolower($text1), -1, PREG_SPLIT_NO_EMPTY);
$words2 = preg_split($pattern, mb_strtolower($text2), -1, PREG_SPLIT_NO_EMPTY);
//Set long and short text
if (count($words1) > count($words2)) {
$long = $words1;
$short = $words2;
} else {
$long = $words2;
$short = $words1;
}
//Count the number of words of the short text that also are in the long
$count = 0;
foreach ($short as $word) {
if (in_array($word, $long)) {
$count++;
}
}
return ($count / count($short)) > $factor;
}

A few Ideas, that you could potentially undertake or investigate further are:
Indexing the documents and then searching for similar documents. So Open source Indexing/Search systems such as Solr, Sphinx or Zend Search Lucene could come in handy.
You could use the sim hashing algorithm or shingling . Briefly the simhash algorithm will let you compute similar hash values for similar documents. So you could then store this value against each document and check how similar various documents are.
Other algorithms that you may find helpful to get some ideas from are:
1 . Levenshtein distance
2 . Bayesian filtering - SO Questions re Bayesian filtering. First link in this list item points to the Bayesian spam filtering article on Wiki, but this algorithm can be adapted to what you are trying to do.

Related

Parsing a mixed-delimiter data set

I've got a source file that contains some data in a few formats that I need to parse. I'm writing an ETL process that will have to match other data.
Most of the data is in the format city, state (US standard, more or less). Some cities are grouped across heavier population areas with multiple cities combined.
Most of the data looks like this (call this 1):
Elkhart, IN
Some places have multiple cities, delimited by a dash (call this 2):
Hickory-Lenoir-Morganton, NC
It's still not too complicated when the cities are in different states (call this 3):
Steubenville, OH-Weirton, WV
This one threw me for a loop; it makes sense but it flushes the previous formats (call this 4):
Kingsport, TN-Johnson City, TN-Bristol, VA-TN
In that example, Bristol is in both VA and TN. Then there's this (call this 5):
Mayagüez/Aguadilla-Ponce, PR
I'm okay with replacing the slash with a dash and processing the same as a previous example. That contains a diacritic as well and the rest of my data are diacritic-free. I'm okay with stripping the diacritic off, that seems to be somewhat straightforward in PHP.
Then there's my final example (call this 6):
Scranton--Wilkes-Barre--Hazleton, PA
The city name contains a dash so the delimiter between city names is a double dash.
What I'd like to produce is, given any of the above examples and a few hundred other lines that follow the same format, an array of [[city, state],...] for each so I can turn them into SQL. For example, parsing 4 would yield:
[
['Kingsport', 'TN'],
['Johnson City', 'TN'],
['Bristol', 'VA'],
['Bristol', 'TN']
]
I'm using a standard PHP install, I've got preg_match and so on but no PECL libraries. Order is unimportant.
Any thoughts on a good way to do this without a big pile of if-then statements?
I would split the input with '-'s and ','s, then delete empty elements in the array. str_replace followed by explode and array_diff (, array ()) should do the trick.
Then identify States - either searching a list or working on the principal that cities don't tend to have 2 upper-case letter names.
Now work through the array. If it's a city, save the name, if it's a state, apply it to the saved cities. Clear the list of cities when you get a city immediately following a state.
Note any exceptions and reformat by hand into a different input.
Hope this helps.
For anyone who's interested, I took the answer from #mike and came up with this:
function SplitLine($line) {
// This is over-simplified, just to cover the given case.
$line = str_replace('ü', 'u', $line);
// Cover case 6.
$delimiter = '-';
if (false !== strpos($line, '--'))
$delimiter = '--';
$line = str_replace('/', $delimiter, $line);
// Case 5 looks like case 2 now.
$parts = explode($delimiter, $line);
$table = array_map(function($part) { return array_map('trim', explode(',', $part)); }, $parts);
// At this point, table contains a grid with missing values.
for ($i = 0; $i < count($table); $i++) {
$row = $table[$i];
// Trivial case (case 1 and 3), go on.
if (2 == count($row))
continue;
if (preg_match('/^[A-Z]{2}$/', $row[0])) {
// Missing city; seek backwards.
$find = $i;
while (2 != count($table[$find]))
$find--;
$table[$i] = [$table[$find][0], $row[0]];
} else {
// Missing state; seek forwards.
$find = $i;
while (2 != count($table[$find]))
$find++;
$table[$i][] = $table[$find][1];
}
}
return $table;
}
It's not pretty and it's slow. It does cover all my cases and since I'm doing an ETL process the speed isn't paramount. There's also no error detection, which works in my particular case.

Split strings into Dictionary words

I am looking for the most efficient algorithm in PHP to check if a string was made from dictionary words only or not.
Example:
thissentencewasmadefromenglishwords
thisonecontainsyxxxyxsomegarbagexaatoo
pure
thisisalsobadxyyyaazzz
Output:
thissentencewasmadefromenglishwords
pure
a.txt
contains the dictionary words
b.txt
contains the strings: one in every line, without spaces made from a..z chars only
Another way to do this is to employ the Aho-Corasick string matching algorithm. The basic idea is to read in your dictionary of words and from that create the Aho-Corasick tree structure. Then, you run each string you want to split into words through the search function.
The beauty of this approach is that creating the tree is a one time cost. You can then use it for all of the strings you're testing. The search function runs in O(n) (n being the length of the string), plus the number of matches found. It's really quite efficient.
Output from the search function will be a list of string matches, telling you which words match at what positions.
The Wikipedia article does not give a great explanation of the Aho-Corasick algorithm. I prefer the original paper, which is quite approachable. See Efficient String Matching: An Aid to Bibliographic Search.
So, for example, given your first string:
thissentencewasmadefromenglishwords
You would get (in part):
this, 0
his, 1
sent, 4
ten, 7
etc.
Now, sort the list of matches by position. It will be almost sorted when you get it from the string matcher, but not quite.
Once the list is sorted by position, the first thing you do is make sure that there is a match at position 0. If there is not, then the string fails the test. If there is (and there might be multiple matches at position 0), you take the length of the matched string and see if there's a string match at that position. Add that match's length and see if there's a match at the next position, etc.
If the strings you're testing aren't very long, then you can use a brute force algorithm like that. It would be more efficient, though, to construct a hash map of the matches, indexed by position. Of course, there could be multiple matches for a particular position, so you have to take that into account. But looking to see if there is a match at a position would be very fast.
It's some work, of course, to implement the Aho-Corasick algorithm. A quick Google search shows that there are php implementations available. How well they work, I don't know.
In the average case, this should be very quick. Again, it depends on how long your strings are. But you're helped by there being relatively few matches at any one position. You could probably construct strings that would exhibit pathologically poor runtimes, but you'd probably have to try real hard. And again, even a pathological case isn't going to take too terribly long if the string is short.
This is a problem that can be solved using Dynamic Programming, based on the next formulas:
f(0) = true
f(i) = OR { f(i-j) AND Dictionary.contais(s.substring(i-j,i) } for each j=1,...,i
First, load your file into a dictionary, then use the DP solution for the above formula.
Pseudo code is something like: (Hope I have no "off by one" for indices..)
check(word):
f = new boolean[word.length() + 1)
f[0] = true
for i from 1 to word.length() + 1:
f[i] = false
for j from 1 to i-1:
if dictionary.contains(word.substring(j-1,i-1)) AND f[j]:
f[i] = true
return f[word.length()
I recommend a recursive approach. Something like this:
<?php
$wordsToCheck = array(
'otherword',
'word1andother',
'word1',
'word1word2',
'word1word3',
'word1word2word3'
);
$wordList = array(
'word1',
'word2',
'word3'
);
$results = array();
function onlyListedWords($word, $wordList) {
if (in_array($word, $wordList)) {
return true;
} else {
$length = strlen($word);
$wordTemp = $word;
$part = '';
for ($i=0; $i < $length; $i++) {
$part .= $wordTemp[$i];
if (in_array($part, $wordList)) {
if ($i == $length - 1) {
return true;
} else {
$wordTemp = substr($wordTemp, $i + 1);
return onlyListedWords($wordTemp, $wordList);
}
}
}
}
}
foreach ($wordsToCheck as $word) {
if (onlyListedWords($word, $wordList))
$results[] = $word;
}
var_dump($results);
?>

Implementing Cutting Stock Algorithm in PHP

I need to implement the Cutting Stock Problem with a php script.
As my math skills are not that great I am just trying to brute force it.
Starting with these parameters
$inventory is an array of lengths that are available to be cut.
$requestedPieces is an array of lengths that were requested by the
customer.
$solution is an empty array
I have currently worked out this recursive function to come up with all possible solutions:
function branch($inventory, $requestedPieces, $solution){
// Loop through the requested pieces and find all inventory that can fulfill them
foreach($requestedPieces as $requestKey => $requestedPiece){
foreach($inventory as $inventoryKey => $piece){
if($requestedPiece <= $piece){
$solution2 = $solution;
array_push($solution2, array($requestKey, $inventoryKey));
$requestedPieces2 = $requestedPieces;
unset($requestedPieces2[$requestKey]);
$inventory2 = $inventory;
$inventory2[$inventoryKey] = $piece - $requestedPiece;
if(count($requestedPieces2) > 0){
branch($inventory2, $requestedPieces2, $solution2);
}else{
global $solutions;
array_push($solutions, $solution2);
}
}
}
}
}
The biggest inefficiency I have discovered with this is that it will find the same solution multiple times but with the steps in a different order.
For example:
$inventory = array(1.83, 20.66);
$requestedPieces = array(0.5, 0.25);
The function will come up with 8 solutions where it should come up with 4 solutions.
What is a good way to resolve this.
This does not answer your question, but I thought it could be worth being mentioned:
You have several other ways to solve your problem, rather than brute forcing it. The wikipedia page on the topic is pretty thorough, but I'll just describe two others simpler ideas. I will use the wikipedia terminology for certain words, namely master for inventory piece, and cut for a requested piece. I will use set to denote a set of cuts pertaining to a given master.
The first one is based on the greedy algorithm, and consist in filling a set with the largest available cut, until no more cut may fit, and repeat that same process for each master, yielding a set for each one of them.
The second one is more dynamic: it uses recursion (like yours), and look for the best fit for the remaining length of master and cuts at each step of the recursion, the goal being to minimize the wasted length when no more cuts can fit.
function branch($master, $cuts, $set){
$goods = array_filter($cuts, function($v) use ($master) { return $v <= $master;});
$res = array($master,$set,$cuts);
if (empty($goods))
return $res;
$remaining = array_diff($cuts, $goods);
foreach($goods as $k => $g){
$t = $set;
array_push($t, $g);
$r = $remaining;
$c = $goods;
for ($i = 0; $i < $k; $i++)
array_push($r,array_shift($c));
array_shift($c);
$t = branch($master - $g, $c, $t);
array_walk($r, function($k,$v) use ($t) {array_push($t[2], $v);});
if ($t[0] == 0) return $t;
if ($t[0] < $res[0])
$res = $t;
}
return $res;
}
The function above should give you the optimal set for a given master. It returns an array of 3 values:
the wasted length on master
the set
the remaining cuts
The parameters are
the master length,
the cuts to be performed (must be sorted in descending order),
the set of cuts already scheduled (a preexisting set, which would be empty for the first call for each master)
Caveats: It depends on the masters' order, you could certainly write a function which tries all the relevant possibilities to find the best order of masters.

counting Plagarism in PHP

Forgive me if this isn't a programming oriented question.
Lets say we have two sentences
[1]=This is a test idea
[2]=This is an experimental idea
If I jumble up [1]
[1]= a This idea test is
Would this count as plagiarism? What sort of logic do I have to apply to detect plagiarism.
I'm not making a complexed plagiarism service, but a rather simple one what can catch obvious plagiarism.
My logic is somewhat like this
<?php
$str1= "This is a test idea.";
$str2= "This is an experimental idea.";
echo "$str1<br>$str2<br>";
$str1Array = explode(" ",$str1);
$str2Array = explode(" ",$str2);
if(count($str1Array) > count($str2Array))
$max=count($str1Array);
else
$max=count($str2Array);
$word_seq = array();
$word_seq_history = array();
$c=0;
$plag_count=0;
for ($i = 0; $i < $max; $i++) {
$lev = levenshtein($str1Array[$i], $str2Array[$i]); // check for an exact match
if ($lev == 0) {
$c+=1;// (exact match)
//echo "<br>$c";
$word = $str1Array[$i];
array_push($word_seq,$word);
}
else
{
if($lev != 0){
if($c>=2)
$plag_count+= count($word_seq);
$current_seq = implode(" ", $word_seq);
array_push($word_seq_history,$current_seq);
echo $current_seq;
$c=0;
$word_seq= array();
}
}
}
echo "plag_count:";
echo $plag_count;
echo "max:";
echo $max;
echo "<br>" ;
echo ($plag_count/$max)*100;
?>
Output:
String 1: "This is a test idea."
String 2: "This is an experimental idea."
Words_Same:2 max:5
Plagiarism: 40%
Do I need to change it or is it fine the way it is?
What I would do to detect plagiarism in a very basic way is to first calibrate my system: ie first do a lot of comparisons with files from which you're sure aren't plagiated
1) compare a bunch of files with each other, detect the plagiarism rate with your function. Get out the words that are the most comonly used (let's say drop your rate up to XX%, trial and error here), put this words in your database and give them a weight of 0. Do this again without this words up to (less than XX%) (with regular expressions you can filter this words) and give them a weight of 1. And so on... Until you reach a plagiarism rate of nearly zero.
2) calculate the 'new' percent by sum(weight of words in your db that appear in the text)/ (the total weight of all your words) (and give the words that do not already come up in your database a weight of 10) = your rate
3) test it with plagiated stuff, if not ok, change a few parameters (weights)
I think this method, if used to check longer passages, will show a high level of correlation just because of common words, especially articles, prepositions, "be" verbs, and other common/overused words. If you're writing about a variety of subjects, be it code or Shakespeare, you're likely to run across a jargon sets that are common to many genuinely unique papers. I think you may need to look at an alternate approach. Have you done any research into plagiarism and its detection?

How to parse a word/phrase with 2 words with dictionary database (in PHP)

I want to parse a sentence into words but some sentences have two words that can be combined into one and result in a different meaning.
For example:
Eminem is a hip hop star.
If I parse it by splitting the words by space I will get
Eminem
is
a
**hip**
**hop**
star
but I want something like this:
Eminem
is
a
**hip hop**
star
This is just an example; there might be some other word combinations listed as a word in a dictionary.
How can I parse this easily?
I have a dictionary in a MySQL database. Is there any API to do this?
No API's I know of. However you could try the SQL like clause.
$words = explode(' ', 'Eminem is a hip hop star');
$len = count($words);
$fixed = array();
for($x = 0; $x < $len; $x++) {
//LIKE 'hip %' will match hip hop
$q = mysql_query("SELECT word FROM dict WHERE word LIKE '".$words[$x]." %'");
//Combine current and next word
$combined = $words[$x].' '.$words[($x+1)];
while( $result = mysql_fetch_array($q)) {
if($result['word'] == $combined) { //Word is in dictionary
$fixed[] = $combined;
$x++;
} else { //Word isn't in dictionary
$fixed[] = $words[$x];
}
}
}
*Please excuse my lack of PDO. I'm lazy right now.
EDIT: I've done some thinking. While the code above isn't optimal, the optimized version I've come up with probably can't do very much better. The fact of the matter is regardless of how you approach the problem, you will need to compare every word in your input sentence to your dictionary and perform additional computations. I see two approaches you can take depending on hardware limits.
Both of these methods assume a dict table with (example) structure:
+--+-----+------+
|id|first|second|
+--+-----+------+
|01|hip |hop |
+--+-----+------+
|02|grade|school|
+--+-----+------+
Option 1: Your webserver has lots of available RAM (and a decent processor)
The idea here is to completely bypass the database layer by caching the dictionary in PHP's memory (with APC or memcache, the latter if you plan to run on several severs). This will place all the load on your webserver, however it could be significantly faster since accessing cached data from the RAM is much faster than querying your DB.
(Again, I've left out PDO and Sanitization for simplicity's sake)
// Step One: Cache Dictionary..the entire dictionary
// This could be run on server start-up or before every user input
if(!apc_exists('words')) {
$words = array();
$q = mysql_query('SELECT first, second FROM dict');
while($res = mysql_fetch_array($q)) {
$words[] = array_values($res);
}
apc_store('words', serialize($words)); //You could use memcache if you want
}
// Step Two: Compare cached dictionary to user input
$data = explode(' ', 'Eminem is a hip hop star');
$words = apc_fetch('words');
$count = count($data);
for($x = 0; $x < $count; $x++) { //Simpler to use a for loop
foreach($words as $word) { //Match against each word
if($data[$x] == $word[0] && $data[$x+1] == $word[1]) {
$data[$x] .= ' '.$word[1];
array_splice($data, $x, 1);
$count--;
}
}
}
Option 2: Fast SQL Server
The second option involves querying each of the words in the input text from the SQL server. For example, for the sentence "Eminem is hip hop" you would create a query that looked like SELECT * FROM dict WHERE (first = 'Eminem' && second = 'is') || (first = 'is' && second = 'hip') || (first = 'hip' && second = 'hop'). Then to fix the array of words you would simply loop through MySQL's results and fuse the appropriate words together. If you are willing to take this route, it might be more efficient to cache commonly used words and fix them before querying the database. This way you can eliminate conditions from your query.

Categories