I have a portion of text (500-1500 chars)
And I have a list of keywords (1000 records)..
What should I do to find the keywords from that list that are related to my given text?
I was thinking to search the occorences of those keywords in my text for every keywords in the list, but it's a bit "expensive" i think
Thanks
If the keywords always stay the same you could create an index over them which improves search speed (tremendously). The standard data structure to handle this is the trie but a much better (!) alternative is the Aho-Corasick automaton or another multi-pattern search algorithm such as multi-pattern Horspool (also known as Wu-Manber algorithm).
Finally, a very simple alternative is to concatenate all your keywords with pipes (|) and use the result as a regular expression. Technically, this approaches the Aho-Corasick automaton and is much simpler for you to implement.
I throw my hat in the ring …
function extractWords($text, $minWordLength = null, array $stopwords = array(), $caseIgnore = true)
{
$pattern = '/\w'. (is_null($minWordLength) ? '+' : '{'.$minWordLength.',}') .'/';
$matches = array();
preg_match_all($pattern, $text, $matches);
$words = $matches[0];
if ($caseIgnore) {
$words = array_map('strtolower', $words);
$stopWords = array_map('strtolower', $stopwords);
}
$words = array_diff($words, $stopwords);
return $words;
}
function countKeywords(array $words, array $keywords, $threshold = null, $caseIgnore = true)
{
if ($caseIgnore) {
$keywords = array_map('strtolower', $keywords);
}
$words = array_intersect($words, $keywords);
$counts = array_count_values($words);
arsort($counts, SORT_NUMERIC);
if (!is_null($threshold)) {
$counts = array_filter($counts, function ($count) use ($threshold) { return $count >= $threshold; });
}
return $counts;
}
Usage:
$text = 'a b c a'; // your text
$keywords = array('a', 'b'); // keywords from your database
$words = extractWords($text);
$count = countKeywords($words, $keywords);
print_r($count);
$total = array_sum($count);
var_dump($total);
$affinity = ($total == 0 ? 0 : 1 / (count($words) / $total));
var_dump($affinity);
Prints
Array
(
[a] => 2
[b] => 1
)
int(3)
float(0.75)
Related
I have a long string variable that contains coordinates
I want to keep each coordinate in a separate cell in the array according to Lat and Lon..
For example. The following string:
string = "(33.110029967689556, 35.60865999564635), (33.093492845160036, 35.63955904349791), (33.0916232355565, 35.602995170206896)";
I want this:
arrayX[0] = "33.110029967689556";
arrayX[1] = "33.093492845160036";
arrayX[2] = "33.0916232355565";
arrayY[0] = "35.60865999564635";
arrayY[1] = "35.63955904349791";
arrayY[2] = "35.602995170206896";
Does anyone have an idea ?
Thanks
Use substr to modify sub string, it allow you to do that with a little line of code.
$array_temp = explode('),', $string);
$arrayX = [];
$arrayY = [];
foreach($array_temp as $at)
{
$at = substr($at, 1);
list($arrayX[], $arrayY[]) = explode(',', $at);
}
print_r($arrayX);
print_r($arrayY);
The simplest way is probably to use a regex to match each tuple:
Each number is a combination of digits and .: the regex [\d\.]+ matches that;
Each coordinate has the following format: (, number, ,, space, number,). The regex is \([\d\.]+,\s*[\d\.]+\).
Then you can capture each number by using parenthesis: \(([\d\.]+),\s*([\d\.]+)\). This will produce to capturing groups: the first will contain the X coordinate and the second the Y.
This regex can be used with the method preg_match_all.
<?php
$string = '(33.110029967689556, 35.60865999564635), (33.093492845160036, 35.63955904349791), (33.0916232355565, 35.602995170206896)';
preg_match_all('/\(([\d\.]+)\s*,\s*([\d\.]+)\)/', $string, $matches);
$arrayX = $matches['1'];
$arrayY = $matches['2'];
var_dump($arrayX);
var_dump($arrayY);
For a live example see http://sandbox.onlinephpfunctions.com/code/082e8454486dc568a6557058fef68d6f10c8dbd0
My suggestion, working example here: https://3v4l.org/W99Uu
$string = "(33.110029967689556, 35.60865999564635), (33.093492845160036, 35.63955904349791), (33.0916232355565, 35.602995170206896)";
// Split by each X/Y pair
$array = explode("), ", $string);
// Init result arrays
$arrayX = array();
$arrayY = array();
foreach($array as $pair) {
// Remove parentheses
$pair = str_replace('(', '', $pair);
$pair = str_replace(')', '', $pair);
// Split into two strings
$arrPair = explode(", ", $pair);
// Add the strings to the result arrays
$arrayX[] = $arrPair[0];
$arrayY[] = $arrPair[1];
}
You need first to split the string into an array. Then you clean the value to get only the numbers. Finally, you put the new value into the new array.
<?php
$string = "(33.110029967689556, 35.60865999564635), (33.093492845160036, 35.63955904349791), (33.0916232355565, 35.602995170206896)";
$loca = explode(", ", $string);
$arr_x = array();
$arr_y = array();
$i = 1;
foreach($loca as $index => $value){
$i++;
if ($i % 2 == 0) {
$arr_x[] = preg_replace('/[^0-9.]/', '', $value);
}else{
$arr_y[] = preg_replace('/[^0-9.]/', '', $value);
}
}
print_r($arr_x);
print_r($arr_y);
You can test it here :
http://sandbox.onlinephpfunctions.com/code/4bf04e7aabeba15ecfa114d8951eb771610a43a4
I'm trying to create a method/function that compares two sentence and returns a percentage of their similarity.
For e.g. in PHP there is a function called similar_text, but it's not working well.
Here I have a few examples that should get a high similartiy when comparing against each other:
In the backyard there is a green tree and the sun is shinnying.
The sun is shinnying in the backyard and there is a green tree too.
A yellow tree is in the backyard with a shinnying sun.
In the front yard there is a green tree and the sun is shinnying.
In the front yard there is a red tree and the sun is no shinnying.
Does anyone know how to get a good example?
I would prefere to use PHP for it, but I don't mind to use Java or Python for it.
In the internet I found this function:
function compareStrings($s1, $s2) {
//one is empty, so no result
if (strlen($s1)==0 || strlen($s2)==0) {
return 0;
}
//replace none alphanumeric charactors
//i left - in case its used to combine words
$s1clean = preg_replace("/[^A-Za-z-]/", ' ', $s1);
$s2clean = preg_replace("/[^A-Za-z-]/", ' ', $s2);
//remove double spaces
$s1clean = str_replace(" ", " ", $s1clean);
$s2clean = str_replace(" ", " ", $s2clean);
//create arrays
$ar1 = explode(" ",$s1clean);
$ar2 = explode(" ",$s2clean);
$l1 = count($ar1);
$l2 = count($ar2);
//flip the arrays if needed so ar1 is always largest.
if ($l2>$l1) {
$t = $ar2;
$ar2 = $ar1;
$ar1 = $t;
}
//flip array 2, to make the words the keys
$ar2 = array_flip($ar2);
$maxwords = max($l1, $l2);
$matches = 0;
//find matching words
foreach($ar1 as $word) {
if (array_key_exists($word, $ar2))
$matches++;
}
return ($matches / $maxwords) * 100;
}
But it's only returning 80%. similar_text is returning just 39%.
So what I need is to compare a string to an array (string as a haystack and array as a needle) and get the elements from the string that repeat within the array. For this purpose I've taken a sample function for using an array as a needle in the substr_count function.
$animals = array('cat','dog','bird');
$toString = implode(' ', $animals);
$data = array('a');
function substr_count_array($haystack, $needle){
$initial = 0;
foreach ($needle as $substring) {
$initial += substr_count($haystack, $substring);
}
return $initial;
}
echo substr_count_array($toString, $data);
The problem is that if I search for a character such as 'a', it gets through the check and validates as a legit value because 'a' is contained within the first element. So the above outputs 1. I figured this was due to the foreach() but how do I bypass that? I want to search for a whole string match, not partial.
You can break up the $haystack into individual words, then do an in_array() check over it to make sure the word exists in that array as a whole word before doing your substr_count():
$animals = array('cat','dog','bird', 'cat', 'dog', 'bird', 'bird', 'hello');
$toString = implode(' ', $animals);
$data = array('cat');
function substr_count_array($haystack, $needle){
$initial = 0;
$bits_of_haystack = explode(' ', $haystack);
foreach ($needle as $substring) {
if(!in_array($substring, $bits_of_haystack))
continue; // skip this needle if it doesn't exist as a whole word
$initial += substr_count($haystack, $substring);
}
return $initial;
}
echo substr_count_array($toString, $data);
Here, cat is 2, dog is 2, bird is 3, hello is 1 and lion is 0.
Edit: here's another alternative using array_keys() with the search parameter set to the $needle:
function substr_count_array($haystack, $needle){
$bits_of_haystack = explode(' ', $haystack);
return count(array_keys($bits_of_haystack, $needle[0]));
}
Of course, this approach requires a string as the needle. I'm not 100% sure why you need to use an array as the needle, but perhaps you could do a loop outside the function and call it for each needle if you need to - just another option anyway!
Just throwing my solution in the ring here; the basic idea, as outlined by scrowler as well, is to break up the search subject into separate words so that you can compare whole words.
function substr_count_array($haystack, $needle)
{
$substrings = explode(' ', $haystack);
return array_reduce($substrings, function($total, $current) use ($needle) {
return $total + count(array_keys($needle, $current, true));
}, 0);
}
The array_reduce() step is basically this:
$total = 0;
foreach ($substrings as $substring) {
$total = $total + count(array_keys($needle, $substring, true));
}
return $total;
The array_keys() expression returns the keys of $needle for which the value equals $substring. The size of that array is the number of occurrences.
I have an array:
$myArray=array(
'hello my name is richard',
'hello my name is paul',
'hello my name is simon',
'hello it doesn\'t matter what my name is'
);
I need to find the sub string (min 2 words) that is repeated the most often, maybe in an array format, so my return array could look like this:
$return=array(
array('hello my', 3),
array('hello my name', 3),
array('hello my name is', 3),
array('my name', 4),
array('my name is', 4),
array('name is', 4),
);
So I can see from this array of arrays how often each string was repeated amongst all strings in the array.
Is the only way to do it like this?..
function repeatedSubStrings($array){
foreach($array as $string){
$phrases=//Split each string into maximum number of sub strings
foreach($phrases as $phrase){
//Then count the $phrases that are in the strings
}
}
}
I've tried a solution similar to the above but it was too slow, processing around 1000 rows per second, can anyone do it faster?
A solution to this might be
function getHighestRecurrence($strs){
/*Storage for individual words*/
$words = Array();
/*Process multiple strings*/
if(is_array($strs))
foreach($strs as $str)
$words = array_merge($words, explode(" ", $str));
/*Prepare single string*/
else
$words = explode(" ",$strs);
/*Array for word counters*/
$index = Array();
/*Aggregate word counters*/
foreach($words as $word)
/*Increment count or create if it doesn't exist*/
(isset($index[$word]))? $index[$word]++ : $index[$word] = 1;
/*Sort array hy highest value and */
arsort($index);
/*Return the word*/
return key($index);
}
While this has a higher runtime, I think it's simpler from an implementation perspective:
$substrings = array();
foreach ($myArray as $str)
{
$subArr = explode(" ", $str);
for ($i=0;$i<count($subArr);$i++)
{
$substring = "";
for ($j=$i;$j<count($subArr);$j++)
{
if ($i==0 && ($j==count($subArr)-1))
break;
$substring = trim($substring . " " . $subArr[$j]);
if (str_word_count($substring, 0) > 1)
{
if (array_key_exists($substring, $substrings))
$substrings[$substring]++;
else
$substrings[$substring] = 1;
}
}
}
}
arsort($substrings);
print_r($substrings);
I'm assuming by "substring" you really mean "substring split along word boundaries" since that's what your example shows.
In that case, assuming any maximum repeated substring will do (since there may be ties), you can always choose just a single word as a maximum repeated substring, if you think about it. For any phrase "A B", the phrases "A" and "B" individually must occur at least as often as "A B" because they both occur every time "A B" does and they may occur at other times. Therefore, a single word must be have a count that at least ties with any substring that contains that word.
So you just need to split all phrases into a set of unique words, and then just count the words and return one of the words with the highest count. This will run way faster than actually counting every possible substring.
This should run in O(n) time
$twoWordPhrases = function($str) {
$words = preg_split('#\s+#', $str, -1, PREG_SPLIT_NO_EMPTY);
$phrases = array();
foreach (range(0, count($words) - 2) as $offset) {
$phrases[] = array_slice($words, $offset, 2);
}
return $phrases;
};
$frequencies = array();
foreach ($myArray as $str) {
$phrases = $twoWordPhrases($str);
foreach ($phrases as $phrase) {
$key = join('/', $phrase);
if (!isset($frequencies[$key])) {
$frequencies[$key] = 0;
}
$frequencies[$key]++;
}
}
print_r($frequencies);
I have a set of strings, each string has a variable number of segments separated by pipes (|), e.g.:
$string = 'abc|b|ac';
Each segment with more than one char should be expanded into all the possible one char combinations, for 3 segments the following "algorithm" works wonderfully:
$result = array();
$string = explode('|', 'abc|b|ac');
foreach (str_split($string[0]) as $i)
{
foreach (str_split($string[1]) as $j)
{
foreach (str_split($string[2]) as $k)
{
$result[] = implode('|', array($i, $j, $k)); // more...
}
}
}
print_r($result);
Output:
$result = array('a|b|a', 'a|b|c', 'b|b|a', 'b|b|c', 'c|b|a', 'c|b|c');
Obviously, for more than 3 segments the code starts to get extremely messy, since I need to add (and check) more and more inner loops. I tried coming up with a dynamic solution but I can't figure out how to generate the correct combination for all the segments (individually and as a whole). I also looked at some combinatorics source code but I'm unable to combine the different combinations of my segments.
I appreciate if anyone can point me in the right direction.
Recursion to the rescue (you might need to tweak a bit to cover edge cases, but it works):
function explodinator($str) {
$segments = explode('|', $str);
$pieces = array_map('str_split', $segments);
return e_helper($pieces);
}
function e_helper($pieces) {
if (count($pieces) == 1)
return $pieces[0];
$first = array_shift($pieces);
$subs = e_helper($pieces);
foreach($first as $char) {
foreach ($subs as $sub) {
$result[] = $char . '|' . $sub;
}
}
return $result;
}
print_r(explodinator('abc|b|ac'));
Outputs:
Array
(
[0] => a|b|a
[1] => a|b|c
[2] => b|b|a
[3] => b|b|c
[4] => c|b|a
[5] => c|b|c
)
As seen on ideone.
This looks like a job for recursive programming! :P
I first looked at this and thought it was going to be a on-liner (and probably is in perl).
There are other non-recursive ways (enumerate all combinations of indexes into segments then loop through, for example) but I think this is more interesting, and probably 'better'.
$str = explode('|', 'abc|b|ac');
$strlen = count( $str );
$results = array();
function splitAndForeach( $bchar , $oldindex, $tempthread) {
global $strlen, $str, $results;
$temp = $tempthread;
$newindex = $oldindex + 1;
if ( $bchar != '') { array_push($temp, $bchar ); }
if ( $newindex <= $strlen ){
print "starting foreach loop on string '".$str[$newindex-1]."' \n";
foreach(str_split( $str[$newindex - 1] ) as $c) {
print "Going into next depth ($newindex) of recursion on char $c \n";
splitAndForeach( $c , $newindex, $temp);
}
} else {
$found = implode('|', $temp);
print "Array length (max recursion depth) reached, result: $found \n";
array_push( $results, $found );
$temp = $tempthread;
$index = 0;
print "***************** Reset index to 0 *****************\n\n";
}
}
splitAndForeach('', 0, array() );
print "your results: \n";
print_r($results);
You could have two arrays: the alternatives and a current counter.
$alternatives = array(array('a', 'b', 'c'), array('b'), array('a', 'c'));
$counter = array(0, 0, 0);
Then, in a loop, you increment the "last digit" of the counter, and if that is equal to the number of alternatives for that position, you reset that "digit" to zero and increment the "digit" left to it. This works just like counting with decimal numbers.
The string for each step is built by concatenating the $alternatives[$i][$counter[$i]] for each digit.
You are finished when the "first digit" becomes as large as the number of alternatives for that digit.
Example: for the above variables, the counter would get the following values in the steps:
0,0,0
0,0,1
1,0,0 (overflow in the last two digit)
1,0,1
2,0,0 (overflow in the last two digits)
2,0,1
3,0,0 (finished, since the first "digit" has only 3 alternatives)