Similar text percentage in php - php

It seems so easy to find the percentage between two strings using php code, I just use
int similar_text ( string $first , string $second [, float &$percent ]
but assume that I have two strings for example:
1- Sponsors back away from Sharapova after failed drug test
2- Maria Sharapova failed drugs test at Australian Open
With similar_text tool I got 53.7% but it doesn't make any sense because the two strings are talking about "failed drug test" for "Sharapova" and the percent should be more than 53.7%.
My question is: is there any way to find the real similarity percent between two strings?

I have implemented several algorithms that will search for duplicates and they can be quite similar.
The approach I am usually using is the following:
normalize the strings
use a comparison algorithm (e.g. similar_text, levenshtein, etc.)
It appears to me that in implementing step 1) you will be able to improve your results drastically.
Example of normalization algorithm (I use "Sponsors back away from Sharapova after failed drug test" for the details):
1) lowercase the string
-> "sponsors back away from sharapova after failed drug test"
2) explode string in words
-> [sponsors, back, away, from, sharapova, after, failed, drug, test]
3) remove noisy words (like propositions, e.g. in, for, that, this, etc.). This step can be customized to your needs
-> [sponsors, sharapova, failed, drug, test]
4) sort the array alphabetically (optional, but this can help implementing the algorithm...)
-> [drug, failed, sharapova, sponsors, test]
Applying the very same algorithm to your other string, you would obtain:
[australian, drugs, failed, maria, open, sharapova, test]
This will help you elaborate a clever algorithm. For example:
for each word in the first string, search the highest similarity in the words of the second string
accumulate the highest similarity
divide the accumulated similarity by the number of words
$words1 = ['drug', 'failed', 'sharapova', 'sponsors', 'test'];
$words2 = ['australian', 'drugs', 'failed', 'maria', 'open', 'sharapova', 'test'];
$nbWords1 = count($words1);
$stringSimilarity = 0;
foreach($words1 as $word1){
$max = null;
$similarity = null;
foreach($words2 as $word2){
similar_text($word1, $word2, $similarity);
if($similarity > $max){ //1)
$max = $similarity;
}
}
$stringSimilarity += $max; //2)
}
var_dump(($stringSimilarity/$nbWords1)); //3)
Running this code will give you 84.83660130719. Not bad, I think ^^. I am sure this algorithm can be further refined, but this is a good start... Also here, we are basically computing the average similarity percentage for each words, you may want a different final approach... tune for your needs ;-)

Related

PHP built in functions complexity (isAnagramOfPalindrome function)

I've been googling for the past 2 hours, and I cannot find a list of php built in functions time and space complexity. I have the isAnagramOfPalindrome problem to solve with the following maximum allowed complexity:
expected worst-case time complexity is O(N)
expected worst-case space complexity is O(1) (not counting the storage required for input arguments).
where N is the input string length. Here is my simplest solution, but I don't know if it is within the complexity limits.
class Solution {
// Function to determine if the input string can make a palindrome by rearranging it
static public function isAnagramOfPalindrome($S) {
// here I am counting how many characters have odd number of occurrences
$odds = count(array_filter(count_chars($S, 1), function($var) {
return($var & 1);
}));
// If the string length is odd, then a palindrome would have 1 character with odd number occurrences
// If the string length is even, all characters should have even number of occurrences
return (int)($odds == (strlen($S) & 1));
}
}
echo Solution :: isAnagramOfPalindrome($_POST['input']);
Anyone have an idea where to find this kind of information?
EDIT
I found out that array_filter has O(N) complexity, and count has O(1) complexity. Now I need to find info on count_chars, but a full list would be very convenient for future porblems.
EDIT 2
After some research on space and time complexity in general, I found out that this code has O(N) time complexity and O(1) space complexity because:
The count_chars will loop N times (full length of the input string, giving it a start complexity of O(N) ). This is generating an array with limited maximum number of fields (26 precisely, the number of different characters), and then it is applying a filter on this array, which means the filter will loop 26 times at most. When pushing the input length towards infinity, this loop is insignificant and it is seen as a constant. Count also applies to this generated constant array, and besides, it is insignificant because the count function complexity is O(1). Hence, the time complexity of the algorithm is O(N).
It goes the same with space complexity. When calculating space complexity, we do not count the input, only the objects generated in the process. These objects are the 26-elements array and the count variable, and both are treated as constants because their size cannot increase over this point, not matter how big the input is. So we can say that the algorithm has a space complexity of O(1).
Anyway, that list would be still valuable so we do not have to look inside the php source code. :)
A probable reason for not including this information is that is is likely to change per release, as improvements are made / optimizations for a general case.
PHP is built on C, Some of the functions are simply wrappers around the c counterparts, for example hypot a google search, a look at man hypot, in the docs for he math lib
http://www.gnu.org/software/libc/manual/html_node/Exponents-and-Logarithms.html#Exponents-and-Logarithms
The source actually provides no better info
https://github.com/lattera/glibc/blob/a2f34833b1042d5d8eeb263b4cf4caaea138c4ad/math/w_hypot.c (Not official, Just easy to link to)
Not to mention, This is only glibc, Windows will have a different implementation. So there MAY even be a different big O per OS that PHP is compiled on
Another reason could be because it would confuse most developers.
Most developers I know would simply choose a function with the "best" big O
a maximum doesnt always mean its slower
http://www.sorting-algorithms.com/
Has a good visual prop of whats happening with some functions, ie bubble sort is a "slow" sort, Yet its one of the fastest for nearly sorted data.
Quick sort is what many will use, which is actually very slow for nearly sorted data.
Big O is worst case - PHP may decide between a release that they should optimize for a certain condition and that will change the big O of the function and theres no easy way to document that.
There is a partial list here (which I guess you have seen)
List of Big-O for PHP functions
Which does list some of the more common PHP functions.
For this particular example....
Its fairly easy to solve without using the built in functions.
Example code
function isPalAnagram($string) {
$string = str_replace(" ", "", $string);
$len = strlen($string);
$oddCount = $len & 1;
$string = str_split($string);
while ($len > 0 && $oddCount >= 0) {
$current = reset($string);
$replace_count = 0;
foreach($string as $key => &$char) {
if ($char === $current){
unset($string[$key]);
$len--;
$replace_count++;
continue;
}
}
$oddCount -= ($replace_count & 1);
}
return ($len - $oddCount) === 0;
}
Using the fact that there can not be more than 1 odd count, you can return early from the array.
I think mine is also O(N) time because its worst case is O(N) as far as I can tell.
Test
$a = microtime(true);
for($i=1; $i<100000; $i++) {
testMethod("the quick brown fox jumped over the lazy dog");
testMethod("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa");
testMethod("testest");
}
printf("Took %s seconds, %s memory", microtime(true) - $a, memory_get_peak_usage(true));
Tests run using really old hardware
My way
Took 64.125452041626 seconds, 262144 memory
Your way
Took 112.96145009995 seconds, 262144 memory
I'm fairly sure that my way is not the quickest way either.
I actually cant see much info either for languages other than PHP (Java for example).
I know a lot of this post is speculating about why its not there and theres not a lot drawing from credible sources, I hope its an partially explained why big O isnt listed in the documentation page though

How unique a 5-digit mt_rand() number is?

I am just wondering, how unique is a mt_rand() number is, if you draw 5-digits number?
In the example, I tried to get a list of 500 random numbers with this function and some of them are repeated.
http://www.php.net/manual/en/function.mt-rand.php
<?php
header('Content-Type: text/plain');
$errors = array();
$uniques = array();
for($i = 0; $i < 500; ++$i)
{
$random_code = mt_rand(10000, 99999);
if(!in_array($random_code, $uniques))
{
$uniques[] = $random_code;
}
else
{
$errors[] = $random_code;
}
}
/**
* If you get any data in this array, it is not exactly unique
* Run this script for few times and you may see some repeats
*/
print_r($errors);
?>
How many digits may be required to ensure that the first 500 random numbers drawn in a loop are unique?
If numbers are truly random, then there's a probability that numbers will be repeated. It doesn't matter how many digits there are -- adding more digits makes it much less likely there will be a repeat, but it's always a possibility.
You're better off checking if there's a conflict, then looping until there isn't like so:
$uniques = array();
for($i = 0; $i < 500; $i++) {
do {
$code = mt_rand(10000, 99999);
} while(in_array($code, $uniques));
$uniques[] = $code
}
Why not use range, shuffle, and slice?
<?php
$uniques = range(10000, 99999);
shuffle($uniques);
$uniques = array_slice($uniques, 0, 500);
print_r($uniques);
Output:
Array
(
[0] => 91652
[1] => 87559
[2] => 68494
[3] => 70561
[4] => 16514
[5] => 71605
[6] => 96725
[7] => 15908
[8] => 14923
[9] => 10752
[10] => 13816
*** truncated ***
)
This method is less expensive as it does not search the array each time to see if the item is already added or not. That said, it does make this approach less "random". More information should be provided on where these numbers are going to be used. If this is an online gambling site, this would be the worst! However if this was used in returning "lucky" numbers for a horoscope website, I think it would be fine.
Furthermore, this method could be extended, changing the shuffle method to use mt_rand (where as the original method simply used rand). It may also use openssl_random_pseudo_bytes, but that might be overkill.
The birthday paradox is at play here. If you pick a random number from 10000-99999 500 times, there's a good chance of duplicates.
Intuitive idea with small numbers
If you flip a coin twice, you'll get a duplicate about half the time. If you roll a six-sided die twice, you'll get a duplicate 1/6 of the time. If you roll it 3 times, you'll get a duplicate 4/9 (44%) of the time. If you roll it 4 times you'll get at least one duplicate 13/18 (63.33%). Roll it a fifth time and it's 49/54 (90.7%). Roll it a sixth time and it's 98.5%. Roll it a seventh time and it's 100%.
If you take replace the six-sided die with a 20-sided die, the probabilities grow a bit more slowly, but grow they do. After 3 rolls you have a 14.5% chance of duplicates. After 6 rolls it's 69.5%. After 10 rolls it's 96.7%, near certainty.
The math
Let's define a function f(num_rolls, num_sides) to generalize this to any number of rolls of any random number generator that chooses out of a finite set of choices. We'll define f(num_rolls, num_sides) to be the probability of getting no duplicates in num_rolls of a num_sides-side die.
Now we can try to build a recursive definition for this. To get num_rolls unique numbers, you'll need to first roll num_rolls-1 unique numbers, then roll one more unique number, now that num_rolls-1 numbers have been taken. Therefore
f(num_rolls, num_sides) =
f(num_rolls-1, num_sides) * (num_sides - (num_rolls - 1)) / num_sides
Alternately,
f(num_rolls + 1, num_side) =
f(num_rolls, num_sides) * (num_sides - num_rolls) / num_sides
This function follows a logistic decay curve, starting at 1 and moving very slowly (since num_rolls is very low, the change with each step is very small), then slowly picking up speed as num_rolls grows, then eventually tapering off as the function's value gets closer and closer to 0.
I've created a Google Docs spreadsheet that has this function built in as a formula to let you play with this here: https://docs.google.com/spreadsheets/d/1bNJ5RFBsXrBr_1BEXgWGein4iXtobsNjw9dCCVeI2_8
Tying this back to your specific problem
You've generated rolled a 90000-sided die 500 times. The spreadsheet above suggests you'd expect at least one duplicate pair about 75% of the time assuming a perfectly random mt_rand. Mathematically, the operation your code was performing is choosing N elements from a set with replacement. In other words, you pick a random number out of the bag of 90000 things, write it down, then put it back in the bag, then pick another random number, repeat 500 times. It sounds like you wanted all of the numbers to be distinct, in other words you wanted to choose N elements from a set without replacement. There are a few algorithms to do this. Dave Chen's suggestion of shuffle and then slice is a relatively straightforward one. Josh from Qaribou's suggestion of separately rejecting duplicates is another possibility.
Your question deals with a variation of the "Birthday Problem" which asks if there are N students in a class, what is the probability that at least two students have the same birthday? See Wikipedia: The "Birthday Problem".
You can easily modify the formula shown there to answer your problem. Instead of having 365 equally probable possibilities for the birthday of each student, you have 90001 (=99999-10000+2) equally probable integers that can be generated between 10000 and 99999. The probability that if you generate 500 such numbers that at least two numbers will be the same is:
P(500)= 1- 90001! / ( 90001^n (90001 - 500)! ) = 0.75
So there is a 75% chance that at least two of the 500 numbers that you generate will be the same or, in other words, only a 25% chance that you will be successful in getting 500 different numbers with the method you are currently using.
As others here have already suggested, I would suggest checking for repeated numbers in your algorithm rather than just blindly generating random numbers and hoping that you don't have a match between any pair of numbers.

Finding similarities in text, and more specifically answers to the text? - PHP

I have a unique situation in the sense that what I am asking is for my own convenience rather than the end user of my application.
I am attempting to create an application that tests peoples' IQ scores (I know they are irrelevant and not much use to anyone), nothing too serious, just a project of mine to keep me busy in between assignments.
I am writing it locally in WAMP with PHP. I have found that there are a lot of available IQ questions and answers on the internet that I can use for my project. I have also noticed that there are a lot of the same questions but they are worded slightly differently.
Is there any third party PHP library that I can utilize in order to stop me from including "two" of the same questions in my application?
Some examples of questions that are the "same" but programatically are considered different;
The average of 20 numbers is zero. Of them, at the most, how many may be greater than zero?
The average of 20 numbers is zero. Of them how many may be greater than zero?
The average of 20 numbers is zero. Of them how many may be greater than zero, at the most?
Obviously you can see that PHP itself using operators cannot accomplish this and myself trying to differentiate between the similarities in the questions is far greater than my programming skill.
I have looked into plagiarism software but have not found any open source PHP projects.
Is there a simpler solution?
Thanks
** EDIT **
One idea I had was before inserting a question use explode at every space, and then in the resulting array match it against other questions that have also had the same function applied. The more matches the more equal the questions are?
I am a newcomer to PHP does this sound feasible?
As acfrancis has already answered: it doesn't get much simpler than using the built in levenshtein function.
However, to answer your final question: yes, doing it the way that you suggest is feasible and not too difficult.
Code
function checkQuestions($para1, $para2){
$arr1 = array_unique(array_filter(explode(' ', preg_replace('/[^a-zA-Z0-9]/', ' ', strtolower($para1)))));
$arr2 = array_unique(array_filter(explode(' ', preg_replace('/[^a-zA-Z0-9]/', ' ', strtolower($para2)))));
$intersect = array_intersect($arr1, $arr2);
$p1 = count($arr1); //Number of words in para1
$p2 = count($arr2); //Number of words in para2
$in = count($intersect); //Number of words in intersect
$lowest = ($p1 < $p2) ? $p1 : $p2; //Which is smaller p1 or p2?
return array(
'Average' => number_format((100 / (($p1+$p2) / 2)) * $in, 2), //Percentage the same compared to average length of questions
'Smallest' => number_format((100 / $lowest) * $in, 2) //Percentage the same compared to shortest question
);
}
Explanation
We define a function that accepts two arguments (the arguments being the questions that we're comparing).
We filter the input and convert to an array
Make the input lower case strtolower
Filter out non-alpha-numeric characters preg_replace
We explode the filtered string on spaces
We filter the created array
Remove blanks array_filter
Remove duplicates array_unique
Repeat 2-4 for the second question
Find matching words in both arrays and move to new array $intersect
Count number of words in each of the three arrays $p1, $p2, and $in
Calculate percentage similarity and return
You'd then need to set a threshold for how similar the questions had to be before being deemed the same e.g. 80%.
N.B.
The function returns an array of two values. The first comparing the length to the average of the two input questions the second only to the shortest. You could modify it returns a single value.
I used number_format for the percentages... But you'd be fine with returning int probably
Examples
Example 1
$question1 = 'The average of 20 numbers is zero. Of them, at the most, how many may be greater than zero?';
$question2 = 'The average of 20 numbers is zero. Of them how many may be greater than zero?';
if(checkQuestions($question1, $question2)['Average'] >= 80){
echo "Questions are the same...";
}
else{
echo "Questions are not the same...";
}
//Output: Questions are the same...
Example 2
$para1 = 'The average of 20 numbers is zero. Of them, at the most, how many may be greater than zero?';
$para2 = 'The average of 20 numbers is zero. Of them how many may be greater than zero?';
$para3 = 'The average of 20 numbers is zero. Of them how many may be greater than zero, at the most?';
var_dump(checkQuestions($para1, $para2));
var_dump(checkQuestions($para1, $para3));
var_dump(checkQuestions($para2, $para3));
/**
Output:
array(2) {
["Average"]=>
string(5) "93.33"
["Smallest"]=>
string(6) "100.00"
}
array(2) {
["Average"]=>
string(6) "100.00"
["Smallest"]=>
string(6) "100.00"
}
array(2) {
["Average"]=>
string(5) "93.33"
["Smallest"]=>
string(6) "100.00"
}
*/
Try using the Levenstein Distance algorithm:
http://php.net/manual/en/function.levenshtein.php
I've used it (in C#, not PHP) for a similar problem and it works quite well. The trick I found is to divide the Levenstein Distance by the length of the first sentence (in characters). That will give you a rough percentage of change required to convert question 1 into question 2 (for example).
In my experience, if you get anything less than 50-60% (i.e., less than 0.5 or 0.6), the sentences are the same. It might seem high but notice that 100% isn't the maximum. For example, to convert the string "z" into "abcdefghi" requires around 10 character changes (that's the Levenstein Distance: remove z then add abcdefghi) or a change of 1,000% as per the calculation above. With big enough changes, you can convert any random string into any other random string.

PHP Compare whether strings are (almost) equal

I need to compare names which can be written in several ways. For example, a name like St. Thomas is sometimes written like St-Thomas or Sant Thomas. Preferably, I'm looking to build a function that gives a percentage of 'equalness' to a comparison, like some forums do (this post is 5% edited for example).
PHP has two (main) built-in functions for this.
levenshtein which counts how many changes (remove/add/replacements) are needed to produce string2 from string1. (lower is better)
and
similar_text which returns the number of matching characters (higher is better). Note that you can pass a reference as the third parameter and it'll give you a percentage.
<?php
$originalPost = "Here's my question to stack overflou. Thanks /h2ooooooo";
$editedPost = "Question to stack overflow.";
$matchingCharacters = similar_text($originalPost, $editedPost, $matchingPercentage);
var_dump($matchingCharacters); //int(25)
var_dump($matchingPercentage); //float(60.975609756098) (hence edited 40%)
?>
The edit distance between two strings of characters generally refers to the Levenshtein distance.
http://php.net/manual/en/function.levenshtein.php
$v1 = 'pupil';
$v2 = 'people';
# TRUE if $v1 & $v2 have similar pronunciation
soundex($v1) == soundex($v2);
# Same but it use a more accurate comparison algorithm
metaphone($v1) == metaphone($v2);
# Calculate how many common characters between 2 strings
# Percent store the percentage of common chars
$common = similar_text($v1, $v2, $percent);
# Compute the difference of 2 text
$diff = levenshtein($v1, $v2);
So, either levenshtein($v1, $v2) or similar_text($v1, $v2, $percent) will do it for you but still there is tradeoff. The complexity of the levenshtein() algorithm is O(m*n), where n and m are the length of v1 and v2 (rather good when compared to similar_text(), which is O(max(n,m)**3), but still expensive).
Check out levenshtein(), which does what you want and is comparatively efficient (but not extremely efficient):
http://www.php.net/manual/en/function.levenshtein.php
You can use different approaches.
You can use the similar_text() function to check for similarity.
OR
You can use levenshtein() function to find out...
The Levenshtein distance is defined as the minimal number of characters you have to replace, insert or delete to transform str1 into str2
And then check for a reasonable threshold for your check.

What is the best algorithm to see if my number is in an array of ranges?

I have a 2 dimensional arrays in php containing the Ranges. for example:
From.........To
---------------
125..........3957
4000.........5500
5217628......52198281
52272128.....52273151
523030528....523229183
and so on
and it is a very long list. now I want to see if a number given by user is in range.
for example numbers 130, 4200, 52272933 are in my range but numbers 1, 5600 are not.
of course I can count all indexes and see if my number is bigger than first and smaller than second item. but is there a faster algorithm or a more efficient way of doing it using php function?
added later
It is sorted. it is actually numbers created with ip2long() showing all IPs of a country.
I just wrote a code for it:
$ips[1] = array (2,20,100);
$ips[2] = array (10,30,200);
$n=11;// input ip
$count = count($ips);
for ($i = 0; $i <= $count; $i++) {
if ($n>=$ips[1][$i]){
if ($n<=$ips[2][$i]){
echo "$i found";
break;
}
}else if($n<$ips[1][$i]){echo "not found";break;}
}
in this situation numbers 2,8,22,and 200 are in range. but not numbers 1,11,300
Put the ranges in a flat array, sorted from lower to higher, like this:
a[0] = 125
a[1] = 3957
a[2] = 4000
a[3] = 5500
a[4] = 5217628
a[5] = 52198281
a[6] = 52272128
a[7] = 52273151
a[8] = 523030528
a[9] = 523229183
Then do a binary search to determine at what index of this array the number in question should be inserted. If the insertion index is even then the number is not in any sub-range. If the insertion index is odd, then the number falls inside one of the ranges.
Examples:
n = 20 inserts at index 0 ==> not in a range
n = 126 inserts at index 1 ==> within a range
n = 523030529 inserts at index 9 ==> within a range
You can speed things up by implementing a binary search algorithm. Thus, you don't have to look at every range.
Then you can use in_array to check if the number is in the array.
I'm not sure if I got you right, do your arrays really look like this:
array(125, 126, 127, ..., 3957);
If so, what's the point? Why not just have?
array(125, 3957);
That contains all the information necessary.
The example you give suggests that the numbers may be large and the space sparse by comparison.
At that point, you don't have very many options. If the array is sorted, binary search is about all there is. If the array is not sorted, you're down to plain, old CS101 linear search.
The correct data structure to use for this problem is an interval tree. This is, in general, much faster than binary search.
I am assuming that the ranges do not overlap.
If that is the case, you can maintain a map data structure that is keyed on the lower value of the range.
Now all you have to do (given the number N) is to find the key in the map that is just lower than N (using binary search - logarithmic complexity) and then check if the number is lesser than the right value.
Basically, it is a binary search (logarithmic) on the constructed map.
From a pragmatic point of view, a linear search may very well turn out to be the fastest lookup method. Think of page faults and hard disk seek time here.
If your array is large enough (whatever "enough" actually means), it may be wise to stuff your IPs in a SQL database and let the database figure out how to efficiently compute SELECT ID FROM ip_numbers WHERE x BETWEEN start AND end;.

Categories