I am writing a software to compare articles. I am looking for an efficient and accurate algorithm to calculate the difference (variation) between two articles. The variation should completely depend on words and not letters. I have tried levenshtein() but it has a time complexity of O(n*m) which is very expensive when performed on big texts like an article. I have also tried similar_text() which has a higher time complexity of O(n*m*3). Moreover, levenshtein() and similar_text() calculates the number of operations needed to transform one string to another which is not an accurate way to calculate the difference between two big articles.
What other options do I have?
EDIT:
I am trying to calculate the variation approximately from the point of view of a search engine (Google).
PostgreSQL uses a tsvector for it's full-text search feature. Maybe that's something that could get quite handy for you.
If you can define how to measure text similarity based on words, you are half way through. For example: You may count the occurence of each word for both article and then create a simple difference of the two lists. However, this does not work for similarity by meaning.
If you have a database, use their fulltext features. As mentioned before, PostGres offers such a feature. I work with MSSQL and you could simply call the FREETEXT function which will calculate a 'rank' indicating how similar two texts are.
I highly recommend using a mature product, instead of trying to write your own.
There is no way to compare two articles. levenshtein() and similar_text() designed to compare two words, not articles.
The simplest algorythm is to explode your articles by words, find word-by-word similaryty and do some math, depending on your task, like this:
// not tested!
function similar_articles($articleA, $articleB) {
$wordsA = array_unique(preg_split('#[\W]+#', $articleA));
$wordsB = array_unique(preg_split('#[\W]+#', $articleA));
$resultSimilarity = 0;
foreach($wordsA as $wordA) {
$wordSimilarity = 0;
foreach($wordsB as $wordB) {
similar_text($wordA, $wordB, $percent);
$wordSimilarity = max($wordSimilarity, $percent);
}
$resultSimilarity += $wordSimilarity;
}
return($resultSimilarity / count($wordsA));
}
Note: similar_articles($artileA, $articleB) != similar_articles($artileB, $articleA) because of similar_text($wordA, $wordB) != similar_text($wordB, $wordA).
A simple method for calculate a type of distance is to compare references. Another method is to select some key word in concordance to a dictionary and calculate the distance in order of social relevance.
Also, in order to use Levenshtein distance take a look on stringmetric.
In my case, I needed to calculate the variation between two articles. So, I found that very simple solution working for me very well. It works by simply calculating the similarity as the common words between the two articles divided by max(number of words in article A, number of words in article B). The variation then is calculated by subtracting the similarity from 100 to get the variation percentage. The code below explains it all.
function get_variation($article1,$article2){
$wordsA = array_unique(preg_split('#[\W]+#', $article1));
$wordsB = array_unique(preg_split('#[\W]+#', $article2));
$intersection = array_intersect($wordsA, $wordsB);
$similarity = (count($intersection)/ (max(count($wordsA),count($wordsB))) * 100);
$similarity = number_format($similarity, 2, '.', '');
$variation = 100-$similarity;
return $variation;
}
Related
I have big database of articles and I'd like before adding new items to DB check if already similar items exist and if so - group them together, so that later I can easily display them as a group of similar items.
Currently we use very simple, but shockingly very precise and our needs fully satisfying PHP's similar_text() function. The problem is, that before we add an item to DB, we first need to pull X amount of items from DB to then loop through every single one in order to check whether our new item is at least 75% similar to other items in order to group them together. This uses a lot of resources and time that we don't really have.
We use MySQL and Solr for all our queries. I've tried using MySQL Full-Text Search, Solr More like this. Compared to PHPs implementation, they are super fast and efficient, but I just can't get a robust percentage score which PHP similar_text() provides. It is crucial for our grouping to be accurate.
For example using this MySQL query:
SELECT id, body, ROUND(((MATCH(body) AGAINST ('ARTICLE TEXT')) / scores.max_score) * 100) as relevance
FROM natural_text_test,
(SELECT MAX(MATCH(body) AGAINST('ARTICLE TEXT')) as max_score FROM natural_text_test LIMIT 1) scores
HAVING relevance > 75
ORDER BY relevance DESC
i get that article with 130 words is 85% similar with another article with 4700 words. And in comparison PHP's similar_text() returns only 3% similarity score which is well below our threshold and is correct in our case.
I've also looked into Levenshtein distance algorithm, but it seems that the same problem as with MySQL and Solr arises.
There has to be a better way to handle similarity checks, maybe I'm using the algorithms incorrectly?
Based on some of the Comments, I might propose this...
It seems that 75%-similar documents would have a lot of the same sentences in the same order.
Break the doc into sentences
Take a crude hash of each sentence, map it to a visible ascii character. This gives you a string that is, perhaps, 1/100th the size of the original doc.
Store that with the doc.
When searching, use levenshtein() on this string to find 'similar' documents.
Sure, hashing is imperfect, etc. But this is fast. And you could apply some other technique to double-check the few docs that are close.
For a hash, I might do
$md5 = md5($sentence);
$x = somehow get 6 bits out of that hex string
$hash = chr(ord('0' + $x));
I will be happy to get some help. I have the following problem:
I'm given a list of numbers and a target number.
subset_sum([11.96,1,15.04,7.8,20,10,11.13,9,11,1.07,8.04,9], 20)
I need to find an algorithm that will find all numbers that combined will sum target number ex: 20.
First find all int equal 20
And next for example the best combinations here are:
11.96 + 8.04
1 + 10 + 9
11.13 + 7.8 + 1.07
9 + 11
Remaining value 15.04.
I need an algorithm that uses 1 value only once and it could use from 1 to n values to sum target number.
I tried some recursion in PHP but runs out of memory really fast (50k values) so a solution in Python will help (time/memory wise).
I'd be glad for some guidance here.
One possible solution is this: Finding all possible combinations of numbers to reach a given sum
The only difference is that I need to put a flag on elements already used so it won't be used twice and I can reduce the number of possible combinations
Thanks for anyone willing to help.
there are many ways to think about this problem.
If you do recursion make sure to identify your end cases first, then proceed with the rest of the program.
This is the first thing that comes to mind.
<?php
subset_sum([11.96,1,15.04,7.8,20,10,11.13,9,11,1.07,8.04,9], 20);
function subset_sum($a,$s,$c = array())
{
if($s<0)
return;
if($s!=0&&count($a)==0)
return;
if($s!=0)
{
foreach($a as $xd=>$xdd)
{
unset($a[$xd]);
subset_sum($a,$s-$xdd,array_merge($c,array($xdd)));
}
}
else
print_r($c);
}
?>
This is possible solution, but it's not pretty:
import itertools
import operator
from functools import reduce
def subset_num(array, num):
subsets = reduce(operator.add, [list(itertools.combinations(array, r)) for r in range(1, 1 + len(array))])
return [subset for subset in subsets if sum(subset) == num]
print(subset_num([11.96,1,15.04,7.8,20,10,11.13,9,11,1.07,8.04,9], 20))
Output:
[(20,), (11.96, 8.04), (9, 11), (11, 9), (1, 10, 9), (1, 10, 9), (7.8, 11.13, 1.07)]
DISCLAIMER: this is not a full solution, it is a way to just help you build the possible subsets. It does not help you to pick which ones go together (without using the same item more than once and getting the lowest remainder).
Using dynamic programming you can build all the subsets that add up to the given sum, then you will need to go through them and find which combination of subsets is best for you.
To build this archive you can (I'm assuming we're dealing with non-negative numbers only) put the items in a column, go from top to bottom and for each element compute all the subsets that add up to the sum or a lower number than it and that include only items from the column that are in the place you are looking at or higher. When you build a subset you put in its node both the sum of the subset (which may be the given sum or smaller) and the items that are included in the subset. So in order to compute the subsets for an item [i] you need only look at the subsets you've created for item [i-1]. For each of them there are 3 options:
1) the subset's sum is the given sum ---> Keep the subset as it is and move to the next one.
2) the subset's sum is smaller than the given sum but larger than it if item [i] is added to it ---> Keep the subset as it is and move on to the next one.
3) the subset's sum is smaller than the given sum and it will still be smaller or equal to it if item [i] is added to it ---> Keep one copy of the subset as it is and create another one with item [i] added to it (both as a member and added to the sum of the subset).
When you're done with the last item (item [n]), look at the subsets you've created - each one has its sum in its node and you can see which ones are equal to the given sum (and which ones are smaller - you don't need those anymore).
As I wrote at the beginning - now you need to figure out how to take the best combination of subsets that do not have a shared member between any of them.
Basically you're left with a problem that resembles the classic knapsack problem but with another limitation (not every stone can be taken with every other stone). Maybe the limitation actually helps, I'm not sure.
A bit more about the advantage of dynamic programming in this case
The basic idea of dynamic programming instead of recursion is to trade redundancy of operations with occupation of memory space. By that I mean to say that recursion with a complex problem (normally a backtrack knapsack-like problem, as we have here) normally ends up calculating the same thing a fair amount of times because the different branches of calculation have no concept of each other's operations and results. Dynamic programming saves the results and uses them along the way to build "bigger" results, relying on the previous/"smaller" ones. Because the use of the stack is much more straightforward than in recursion, you don't get the memory problem you get with recursion regarding the maintenance of the function's state, but you do need to handle a great deal of memory that you store (sometimes you can optimise that).
So for example in our problem, trying to combine a subset that would add up to the required sum, the branch that starts with item A and the branch that starts with item B do not know of each other's operations. let's assume item C and item D together add up to the sum, but either of them added alone to A or B would not exceed the sum, and that A don't go with B in the solution (we can have sum=10, A=B=4, C=D=5 and there is no subset that sums up to 2 (so A and B can't be in the same group)). The branch trying to figure out A's group would (after trying and rejecting having B in its group) add C (A+C=9) and then add D, in which point would reject this group and trackback (A+C+D=14 > sum=10). The same would happen to B of course (A=B) because the branch figuring out B's group has no information regarding what just happened to the branch dealing with A. So in fact we've calculated C+D twice, and haven't even used it yet (and we're about to calculate it yet a third time to realise they belong in a group of their own).
NOTE:
Looking around while writing this answer I came across a technique I was not familiar with and might be a better solution for you: memoization. Taken from wikipedia:
memoization is an optimization technique used primarily to speed up computer programs by storing the results of expensive function calls and returning the cached result when the same inputs occur again.
So I have a possbile solution:
#compute difference between 2 list but keep duplicates
def list_difference(a, b):
count = Counter(a) # count items in a
count.subtract(b) # subtract items that are in b
diff = []
for x in a:
if count[x] > 0:
count[x] -= 1
diff.append(x)
return diff
#return combination of numbers that match target
def subset_sum(numbers, target, partial=[]):
s = sum(partial)
# check if the partial sum is equals to target
if s == target:
print "--------------------------------------------sum_is(%s)=%s" % (partial, target)
return partial
else:
if s >= target:
return # if we reach the number why bother to continue
for i in range(len(numbers)):
n = numbers[i]
remaining = numbers[i+1:]
rest = subset_sum(remaining, target, partial + [n])
if type(rest) is list:
#repeat until rest is > target and rest is not the same as previous
def repeatUntil(subset, target):
currSubset = []
while sum(subset) > target and currSubset != subset:
diff = subset_sum(subset, target)
currSubset = subset
subset = list_difference(subset, diff)
return subset
Output:
--------------------------------------------sum_is([11.96, 8.04])=20
--------------------------------------------sum_is([1, 10, 9])=20
--------------------------------------------sum_is([7.8, 11.13, 1.07])=20
--------------------------------------------sum_is([20])=20
--------------------------------------------sum_is([9, 11])=20
[15.04]
Unfortunately this solution does work for a small list. For a big list still trying to break the list in small chunks and calculate but the answer is not quite correct. You can see it o a new thread here:
Finding unique combinations of numbers to reach a given sum
I have a task where I have three arrays A,B,C. All of the contain the same data. For the sake of simplicity lets assume the data is numbers 1 to 5. The data would be in different jumbled sequences. I want to find out among B & C which array has data most similar to A.
Eg:
A = 1,2,3,4,5
B = 1,2,3,5,4
C = 4,1,2,3,5
In this case, it is easy to visually comprehend that B is more similar to A. But it gets more complicated for really jumbled sequences.
Eg:
A = 1,2,3,4,5
B = 5,3,1,4,2
C = 4,1,2,3,5
In this case, I would assume C to be more closer to A. I am thinking that this assumption can be quantified as: How many elements have the same sequence in both arrays? In above example the subsequence of [1,2,3] is the same in both arrays. The second question would be what is the offset difference between the similar subsequence ? In this case it is 1, because the subsequence begins at index 0 for A and index 1 for C.
So the number of elements in a matching sequence and their offsets are what I am thinking to use. I plan on adding a weightage to these two entities (number of elements in matching sequence, and offset difference in their occurrence)
Does this make sense? I only need a rough approximation of similarity and the results do not need to be exact. Are there any formal mathematical or data-structure models that solve this problem?
BTW, the project where I need this implemented is in PHP. Does it have any inbuilt functions like the levenstein model for string difference?
Any suggestions are very welcome!
Well I suppose you can come up with your own algorithm (for instance generate all suffixes and then search for them and then define a scoring procedure) or you could use a well known algorithm like
Smith-Waterman for local alignment or Needleman-Wunsch for global. The advantage of these algorithms is that they are well-understood and give you all the possible alignments (and you can choose the best for your case).
NW in PHP
SW in PHP
I am interested to use this ranking class, based off of an article by Evan Miller to rank a table I have that has upvotes and downvotes. I have a system very similar to Stack Overflow's up/down voting system for an events site I am working on, and by using this ranking class I feel as though results will be more accurate. My question is how do I order by the function 'hotness'?
private function _hotness($upvotes = 0, $downvotes = 0, $posted = 0) {
$s = $this->_score($upvotes, $downvotes);
$order = log(max(abs($s), 1), 10);
if($s > 0) {
$sign = 1;
} elseif($s < 0) {
$sign = -1;
} else {
$sign = 0;
}
$seconds = $posted - 1134028003;
return round($order + (($sign * $seconds)/45000), 7);
}
I suppose each time a user votes I could have a column in my table that has the hotness data recalculated for the new vote, and order by that column on the main page. But I am interested to do this more on-the-fly incorporating the function above, and I am not sure if that is possible.
From Evan Miller, he uses:
SELECT widget_id, ((positive + 1.9208) / (positive + negative) -
1.96 * SQRT((positive * negative) / (positive + negative) + 0.9604) /
(positive + negative)) / (1 + 3.8416 / (positive + negative))
AS ci_lower_bound FROM widgets WHERE positive + negative > 0
ORDER BY ci_lower_bound DESC;
But I rather not do this calculation in the sql as I feel this is messy and difficult to change down the line if I utilize this code on multiple pages .etc.
Accessing the corresponding "Posts" table for anything (reading, writing, sorting, comparing, etc.) is extremely quick and thus relying on the database is the "most on-the-fly" alternative you have for non-temporary data storage (memory/sessions are still quicker but, logically, cannot be used to store this information).
You should be more worried about building a good ranking algorithm delivering the results you want (you are proposing two different systems, delivering different results) and working on making the whole code and the code-database communication as efficient as possible.
In principle, small codes with iterative simple orders offer the quickest and most reliable solution for this kind of situations. Example:
Ranking function (like the first one you propose or any
other one built on the ranking rules you want) called every time a
vote is given. It writes to the corresponding column(s) in the
"Posts" table (the simpler the query, the better: you can create a
ranking system as complex as you wish, but try to rely on PHP
rather than on queries).
Every time a comparison between posts is required, the "Posts" table is read with a simple SELECT ordering the records by ranking
(you can have various "assessing columns" (e.g., up-votes,
down-votes, further considerations); but better having one with the
definitive ranking).
You are right, query like this is rather messy and expensive as well.
Mixed PHP/MySQL on the fly is a bad idea well as you will have to select values for all posts and calculate hotness and then select a list of hotest ones. Extremely expensive.
You should consider saving at least part of your calculation to the database. Definitely order should go to the database. It's always better to calculate something and save just once on every save/update, instead of calculating each time it will be displayed. Try to do a benchmark on how much time you will save by calculating order on save/update instead of every time you calculate the hotness. Good thing is that order never changes unless someone upvotes/downvotes which you save to the db anyway, same for the sign.
Even if you save the sign to the db you are stil not able to avoid calculating on the fly due to the posted timestamp parameter.
I would see what difference does it make and where it makes a difference and calculate hotness with a CLI script every x amount of time only for those scripts where this is crucial, every y amount of time where it's making less of a difference.
Taking this approach you will be recalculating hotness only when necessary. This will make your application much more efficient.
I am not sure if it is possible with your DB and Schema however have you consider writing a UDF for custom sorting?
A post from stackoverflow talks about how to do this here.
I have 30,000 rows in a database that need to be similarity checked (using similar_text or another such function).
In order to do this it will require doing 30,000^2 checks for each columns.
I estimate I will be checking on average 4 columns.
This means I will have to do 3,600,000,000 checks.
What is the best (fastest, and most reliable) way to do this with PHP, bearing in mind request memory limits and time limits etc?
The server need to still actively serve webpages at the same time as doing this.
PS. The server we are using is an 8 core Xeon 32 GB ram.
Edit:
The size of each column is normally less that 50 characters.
I guess you just need FULL TEXT search.
If that not fits you, you have only one chance to solve this: cache the results.
So you will not have to parse 3bil of records for each requests
Anyway here how you can do it:
$result = array();
$sql = "SELECT * FROM TABLE";
while( $row = ... ) {
$result[] = $row; //> Append the current record
}
Now results contains all the rows from your table.
At this point you said you want to similar_text() all columns with each other.
To do that and cache the results you need at least a table (as I said in the comment).
//> Starting calculating the similarity
foreach($result as $k=>$v) {
foreach($result as $k2=>$v2) {
//> At this point you have 2 rows, $v and $v2 containing your column
$similarity = 0;
$similartiy += levensthein($v['column1'],$v2['column1']);
$similartiy += levensthein($v['column2'],$v2['column2']);
//> What ever comparison you need here between columns
//> Now you can finally store the result by inserting in a table the $similarity
"INSERT DELAYED INTO similarity (value) VALUES ('$similarity')";
}
}
2 Things you have to notice:
I used levensthein because it's much faster than similar_text (notice it's value it's the contrary of similar_text, because the greater the value levensthein returns the less the affinity between string)
I Used INSERT DELAYED to greatly lower the database cost
oy... similar_text() is O(n^3) !
do you really need a percentage similarity for each comparison or can you just do a quick compare of the first/middle/last X bytes of the strings to narrow the field?
if you're just looking for dups say... you can probably narrow down the number of comparisons you need to do, and that will be the most effective tack imho.