How to find similar text in a large string?

How to find similar text in a large string? - php

I have a large string str and a needle ndl. Now, I need to find similar text of ndl from the string str. For example,
SOURCE: "This is a demo text and I love you about this".
NEEDLE: "I you love"
OUTPUT: "I love you"
SOURCE: "I have a unique idea. Do you need one?".
NEEDLE: "a unik idia"
OUTPUT: "a unique idea"
I found that I can do this using similarity measures like cosine or manhatton similarity measure. However, I think implementation of this algorithms will be difficult. Would you please suggest me any easy or fastest way to do this maybe using any library function of php? TIA

There is no PHP native function to achieve this goal.However the possibilities of PHP is just limited by your imagination.We can't on SO suggest libraries to achieve your goal and you need to keep in mind that this kind of questions can be flagged as off-topic. So instead of suggesting some libraries I will just point you into the directions you need to explore.
As designed ,your question suggest that you don't need simple strings match functions like stripos and co and a regex can't achieve this. For examples
unik and unique
and also
idia and idea
can't be matched by those functions. So You need to look for something like levenshtein function.But as you need sub strings and not necessarly the whole string and also ,in order to make the work easier for the levenshtein function and your server, You need to use some imagination.You could for example break both haystack and needle in words and then use levenshtein to find most closest values to your needles.
This is one way to achieve this .Read carefully the comments to understand the idea and you will be able to implement something better.
for strings with only ASCII chars it is relatively easy to achieve it. But for other Encodings you will probably encounter many difficulties.But a simple approach to handle multibyte strings too could be something like:
function to_ascii($text,$encoding="UTF-8") {
if (is_string($text)) {
// Includes combinations of characters that present as a single glyph
$text = preg_replace_callback('/\X/u', __FUNCTION__, $text);
}
elseif (is_array($text) && count($text) == 1 && is_string($text[0])) {
// IGNORE characters that can't be TRANSLITerated to ASCII
$text = #iconv($encoding, "ASCII//IGNORE//TRANSLIT", $text[0]);
// The documentation says that iconv() returns false on failure but it returns ''
if ($text === '' || !is_string($text)) {
$text = '?';
}
elseif (preg_match('/\w/', $text)) { // If the text contains any letters...
$text = preg_replace('/\W+/', '', $text); // ...then remove all non-letters
}
}
else { // $text was not a string
$text = '';
}
return $text;
}
function find_similar($needle,$str,$keep_needle_order=false){
if(!is_string($needle)||!is_string($str))
{
return false;
}
$valid=array();
//get encodings and words from haystack and needle
setlocale(LC_CTYPE, 'en_GB.UTF8');
$encoding_s=mb_detect_encoding($str);
$encoding_n=mb_detect_encoding($needle);
mb_regex_encoding ($encoding_n);
$pneed=array_filter(mb_split('\W',$needle));
mb_regex_encoding ($encoding_s);
$pstr=array_filter(mb_split('\W',$str));
foreach($pneed as $k=>$word)//loop trough needle's words
{
foreach($pstr as $key=>$w)
{
if($encoding_n!==$encoding_s)
{//if $encodings are not the same make some transliteration
$tmp_word=($encoding_n!=='ASCII')?to_ascii($word,$encoding_n):$word;
$tmp_w=($encoding_s!=='ASCII')?to_ascii($w,$encoding_s):$w;
}else
{
$tmp_word=$word;
$tmp_w=$w;
}
$tmp[$tmp_w]=levenshtein($tmp_w,$tmp_word);//collect levenshtein distances
$keys[$tmp_w]=array($key,$w);
}
$nominees=array_flip(array_keys($tmp,min($tmp)));//get the nominees
$tmp=10000;
foreach($nominees as $nominee=>$idx)
{//test sound like to get more precision
$idx=levenshtein(metaphone($nominee),metaphone($tmp_word));
if($idx<$tmp){
$answer=$nominee;//get the winner
}
unset($nominees[$nominee]);
}
if(!$keep_needle_order){
$valid[$keys[$answer][0]]=$keys[$answer][1];//get the right form of the winner
}
else{
$valid[$k]=$keys[$answer][1];
}
$tmp=$nominees=array();//clean a little for the next iteration
}
if(!$keep_needle_order)
{
ksort($valid);
}
$valid=array_values($valid);//get only the values
/*return the array of the closest value to the
needle according to this algorithm of course*/
return $valid;
}
var_dump(find_similar('i knew you love me','finally i know you loved me and all my pets'));
var_dump(find_similar('I you love','This is a demo text and I love you about this'));
var_dump(find_similar('a unik idia','I have a unique idea. Do you need?'));
var_dump(find_similar("Goebel, Weiss, Goethe, Goethe und Goetz",'Weiß, Goldmann, Göbel, Weiss, Göthe, Goethe und Götz'));
var_dump(find_similar('Ḽơᶉëᶆ ȋṕšᶙṁ ḍỡḽǭᵳ ʂǐť ӓṁệẗ, ĉṓɲṩḙċťᶒțûɾ ấɖḯƥĭṩčįɳġ ḝłįʈ',
'Ḽơᶉëᶆ ȋṕšᶙṁ ḍỡḽǭᵳ ʂǐť ӓṁệẗ, ĉṓɲṩḙċťᶒțûɾ ấɖḯƥĭṩčįɳġ ḝłįʈ, șếᶑ ᶁⱺ ẽḭŭŝḿꝋď ṫĕᶆᶈṓɍ ỉñḉīḑȋᵭṵńť ṷŧ ḹẩḇőꝛế éȶ đꝍꞎôꝛȇ ᵯáꞡᶇā ąⱡîɋṹẵ.'));
and the output is:
array(5) {
[0]=>
string(1) "i"
[1]=>
string(4) "know"
[2]=>
string(3) "you"
[3]=>
string(5) "loved"
[4]=>
string(2) "me"
}
array(3) {
[0]=>
string(1) "I"
[1]=>
string(4) "love"
[2]=>
string(3) "you"
}
array(3) {
[0]=>
string(1) "a"
[1]=>
string(6) "unique"
[2]=>
string(4) "idea"
}
array(5) {
[0]=>
string(6) "Göbel"
[1]=>
string(5) "Weiss"
[2]=>
string(6) "Goethe"
[3]=>
string(3) "und"
[4]=>
string(5) "Götz"
}
array(8) {
[0]=>
string(13) "Ḽơᶉëᶆ"
[1]=>
string(13) "ȋṕšᶙṁ"
[2]=>
string(14) "ḍỡḽǭᵳ"
[3]=>
string(6) "ʂǐť"
[4]=>
string(11) "ӓṁệẗ"
[5]=>
string(26) "ĉṓɲṩḙċťᶒțûɾ"
[6]=>
string(23) "ấɖḯƥĭṩčįɳġ"
[7]=>
string(9) "ḝłįʈ"
}
if you need the output as string you can use join on the result of the function before use it
You can run the working code and check the result online
But you must keep in mind that this will not work for all kind of strings nor for all PHP versions

Try this code to find string within string
$data = "I have a unique idea. Do you need one?";
$find = "a unique idea";
$start = strpos($data, $find);
if($start){
$end = $start + strlen($find);
print_r(substr($data, $start, strlen($find)));
} else {
echo "not found";
}

This is a very easy way to do that:
$source = "This is a demo text and I love you about this";
$needle = "I you love";
$words = explode(" " , $source);
$needleWords = explode(" ", $needle);
$results = [];
foreach($needleWords as $key => $needleWord) {
foreach($words as $keyWords => $word) {
if(strcasecmp($word, $needleWord) == 0) {
$results[$keyWords] = $needleWord;
}
}
}
uksort($results, function($a , $b) {
return $a - $b;
});
echo(implode(" " , $results));
Output
I love you

Use this function it checks any length string
function sim($aa,$bb){
$l1 = strlen($aa);
$l2 = strlen($bb);
if ($l1 > $l2) {
$a = $bb;
$b = $aa;
}else{
$a = $aa;
$b = $bb;
}
// Format string
$a = explode(" ", implode(" ", array_unique(explode(" ", preg_replace("~\b[a-z]{1,2}\b\s*~", "", $a)))));
$b = explode(" ", implode(" ", array_unique(explode(" ", preg_replace("~\b[a-z]{1,2}\b\s*~", "", $b)))));
sort($a);
sort($b);
$a = implode(" ",$a);
$b = implode(" ",$b);
$a = strtolower(preg_replace("/[^A-Za-z0-9\-]/", " ", $a));
$b = strtolower(preg_replace("/[^A-Za-z0-9\-]/", " ", $b));
$dc2 = count(array_diff(str_split($a,2),str_split($b,2)));
$cA1 = count(str_split($a));
$cB1 = count(str_split($b));
$a = explode(" ",$a);
$b = explode(" ",$b);
$dc = count(array_diff($a, $b));
$cA = count($a);
$cB = count($b);
// Calculate similarity
$p = (((($cA + $cB) / 2) - $dc) * 100) / (($cA + $cB) / 2);
$p2 = ((((($cA1 / 2) + ($cB1 / 2)) / 2) - $dc2) * 100) / ((($cA1 / 2) + ($cB1 / 2)) / 2);
$percent = ($p + $p2) / 2;
echo $percent . " %";
}

Related

php function for rading out values from string

i want to make a php loop that puts the values from a string in 2 different variables.
I am a beginner. the numbers are always the same like "3":"6" but the length and the amount of numbers (always even). it can also be "23":"673",4:6.

You can strip characters other than numbers and delimiters, and then do explode to get an array of values.
$string = '"23":"673",4:6';
$string = preg_replace('/[^\d\:\,]/', '', $string);
$pairs = explode(',', $string);
$pairs_array = [];
foreach ($pairs as $pair) {
$pairs_array[] = explode(':', $pair);
}
var_dump($pairs_array);
This gives you:
array(2) {
[0]=>
array(2) {
[0]=>
string(2) "23"
[1]=>
string(3) "673"
}
[1]=>
array(2) {
[0]=>
string(1) "4"
[1]=>
string(1) "6"
}
}

<?php
$string = '"23":"673",4:6';
//Remove quotes from string
$string = str_replace('"','',$string);
//Split sring via comma (,)
$splited_number_list = explode(',',$string);
//Loop first list
foreach($splited_number_list as $numbers){
//Split numbers via colon
$splited_numbers = explode(':',$numbers);
//Numbers in to variable
$number1 = $splited_numbers[0];
$number2 = $splited_numbers[1];
echo $number1." - ".$number2 . "<br>";
}
?>

Regular expression for utf-8 string sliceing at linebreaks or after a number of characters

I found a function on the web, that uses a regular experssion, to iterate over a string and inserts linebreaks after a specified number of characters, so it will fit into a narrow table cell with a fixed width.
here is the function:
/**
* wordwrap for utf8 encoded strings
*
* #param string $str
* #param integer $len
* #param string $what
* #return string
* #author Milian Wolff <mail#milianw.de>
*/
function utf8_wordwrap($str, $width, $break, $cut = false) {
if (!$cut || $_SESSION['wordwrap']) {
$regexp = '#^(?:[\x00-\x7F]|[\xC0-\xFF][\x80-\xBF]+){'.$width.'}#';
} else {
return $str; //if no wordwrap turned on, returns the original string
}
if (function_exists('mb_strlen')) {
$str_len = mb_strlen($str,'UTF-8');
} else {
$str_len = preg_match_all('/[\x00-\x7F\xC0-\xFD]/', $str, $var_empty);
}
$while_what = ceil($str_len / $width);
$i = 1;
$return = '';
while ($i < $while_what) {
preg_match($regexp, $str,$matches);
$string = $matches[0];
$return .= $string.$break;
$str = substr($str, strlen($string));
$i++;
}
return $return.$str;
}
here is the regexp:
#^(?:[\x00-\x7F]|[\xC0-\xFF][\x80-\xBF]+){20}#
It does its job well, if it's combined with a while loop until there is a line break character in the string.
An example string:
1. first
2. second
3. third
The output of prag_match:
array (
0 => '1. first
2. second
3',
)
so it just counts for the 20th character, and returns it.
What I would need is:
To make it return everything until a new line char (\n) OR if there isn't any, return the first 20 char.
So the output in this case would be something like this:
array (
0 => '1. first',
1 => '2. second',
2 => '3. third'
)
UPDATE:
I tried Steve Robbins's answer and it worked perfectly, until the string had some spec UTF-8 characters in it. It's my fault, I didn't provide a decent example in the first place.
Here is what it does:
<?php
header('Content-type: text/html; charset=UTF-8');
$input = '1. first
2. second
3. third
ez eg nyoulőűúúú3456789öüö987654323456789öü
pam
param';
$output = array();
foreach (explode("\n", $input) as $value) {
foreach (str_split($value, 20) as $v) {
$trimmed = trim($v);
if (!empty($trimmed))
$output[] = $trimmed;
}
}
var_dump($output);
And the output is:
array(8) {
[0]=>
string(8) "1. first"
[1]=>
string(9) "2. second"
[2]=>
string(8) "3. third"
[3]=>
string(20) "ez eg nyoulőűúú�"
[4]=>
string(20) "�3456789öüö987654"
[5]=>
string(13) "323456789öü"
[6]=>
string(3) "pam"
[7]=>
string(5) "papam"
}
http://codepad.org/Gt4CshXt

Why use regex?
<?php
$input = '1. first
2. second
3. third';
$output = array();
foreach (explode("\n", $input) as $value) {
foreach (str_split($value, 20) as $v) {
$trimmed = trim($v);
if (!empty($trimmed))
$output[] = $trimmed;
}
}
var_dump($output);
Gives
array(3) {
[0]=>
string(8) "1. first"
[1]=>
string(9) "2. second"
[2]=>
string(8) "3. third"
}
Example: http://codepad.org/OoillEUu

Thanks everyone for your efforts! I've found the solution here
<?php
header('Content-Type: text/html; charset=utf-8');
$input = '1. first
2. second
3. third
ez eg nyoulőűúúú3456789öüö987654323456789öü
pam
papam';
var_dump(utf8_wordwrap($input,20,"<br>",true));
function utf8_wordwrap($string, $width=20, $break="\n", $cut=false)
{
if($cut) {
// Match anything 1 to $width chars long followed by whitespace or EOS,
// otherwise match anything $width chars long
$search = '/(.{1,'.$width.'})(?:\s|$)|(.{'.$width.'})/uS';
$replace = '$1$2'.$break;
} else {
// Anchor the beginning of the pattern with a lookahead
// to avoid crazy backtracking when words are longer than $width
$pattern = '/(?=\s)(.{1,'.$width.'})(?:\s|$)/uS';
$replace = '$1'.$break;
}
return preg_replace($search, $replace, $string);
}
?>
string '1. first
<br>2. second
<br>3. third
<br>ez eg<br>nyoulőűúúú3456789öüö<br>987654323456789öü
<br>pam
<br>papam<br>' (length=122)

split a comma separated string in a pair of 2 using php

I have a string having 128 values in the form of :
1,4,5,6,0,0,1,0,0,5,6,...1,2,3.
I want to pair in the form of :
(1,4),(5,6),(7,8)
so that I can make a for loop for 64 data using PHP.

You can accomplish this in these steps:
Use explode() to turn the string into an array of numbers
Use array_chunk() to form groups of two
Use array_map() to turn each group into a string with brackets
Use join() to glue everything back together.
You can use this delicious one-liner, because everyone loves those:
echo join(',', array_map(function($chunk) {
return sprintf('(%d,%d)', $chunk[0], isset($chunk[1]) ? $chunk[1] : '0');
}, array_chunk(explode(',', $array), 2)));
Demo
If the last chunk is smaller than two items, it will use '0' as the second value.

<?php
$a = 'val1,val2,val3,val4';
function x($value)
{
$buffer = explode(',', $value);
$result = array();
while(count($buffer))
{ $result[] = array(array_shift($buffer), array_shift($buffer)); }
return $result;
}
$result = x($a);
var_dump($result);
?>
Shows:
array(2) { [0]=> array(2) { [0]=> string(4) "val1" [1]=> string(4) "val2" } [1]=> array(2) { [0]=> string(4) "val3" [1]=> string(4) "val4" } }
If modify it, then it might help you this way:
<?php
$a = '1,2,3,4';
function x($value)
{
$buffer = explode(',', $value);
$result = array();
while(count($buffer))
{ $result[] = sprintf('(%d,%d)', array_shift($buffer), array_shift($buffer)); }
return implode(',', $result);
}
$result = x($a);
var_dump($result);
?>
Which shows:
string(11) "(1,2),(3,4)"

how to order string from HTTP_ACCEPT_LANGUAGE [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
HTTP_ACCEPT_LANGUAGE
i try to code a language option tool. therefor i use
$default_language = (strtolower($_SERVER["HTTP_ACCEPT_LANGUAGE"]));
if (eregi('af', $default_language)) {do something}
now i would like to order the string when i will echo:
$_SERVER["HTTP_ACCEPT_LANGUAGE"]
for example an user has specified a number of languages.
example for chrome with different languages:
nl,en-gb;q=0.8,en;q=0.6,fr;q=0.4,fr-ca;q=0.2
so how can i read out the string to bring it in a certain order where i can see that nl is the first language that is prefered.
the code should be something like:
if ('nl'== array[0]) {do something}
so if there is someone who could help me out i really would appreciate.
thanks alot.

From HTTP/1.1 Header Field Definitions:
Each language-range MAY be given an associated quality value which represents an estimate of the user's preference for the languages specified by that range. The quality value defaults to "q=1".
You have to loop over languages and select one with highest quality (preferrence); like this:
$preferred = "en"; // default
if(isset($_SERVER["HTTP_ACCEPT_LANGUAGE"]))
{
$max = 0.0;
$langs = explode(",", $_SERVER["HTTP_ACCEPT_LANGUAGE"]);
foreach($langs as $lang)
{
$lang = explode(';', $lang);
$q = (isset($lang[1])) ? ((float) $lang[1]) : 1.0;
if ($q > $max)
{
$max = $q;
$preferred = $lang[0];
}
}
$preferred = trim($preferred);
}
// now $preferred is user's preferred language
If Accept-Language header is not sent, all languages are equally acceptable.

How about explode()?
$array = explode(",",$_SERVER["HTTP_ACCEPT_LANGUAGE"]);
Given your example string, you should end up with the following values
$array[0] = "nl"
$array[1] = "en-gb;q=0.8"
$array[2] = "en;q=0.6"
etc.

If you prefer to assume that the string is not always ordered before it is sent by the browser then the following code will parse and sort it. Note that I've changed French's q to 0.9.
<?php
$lang = 'nl,en-gb;q=0.8,en;q=0.6,fr;q=0.9,fr-ca;q=0.2';
$langs = array();
foreach(explode(',', $lang) as $entry) {
$t1 = explode(';', $entry);
switch( count($t1) ) {
case 1:
$langs[] = array($t1[0], 1.0);
break;
case 2:
$t2 = explode('=', $t1[1]);
$langs[] = array($t1[0], floatval($t2[1]));
break;
default:
echo("what is this I don't even");
break;
}
}
function mysort($a, $b) {
if( $a[1] == $b[1] ) { return 0; }
elseif( $a[1] > $b[1] ) { return -1; }
else { return 1; }
}
usort($langs, 'mysort');
var_dump($langs);
Output:
array(5) {
[0]=>
array(2) {
[0]=>
string(2) "nl"
[1]=>
float(1)
}
[1]=>
array(2) {
[0]=>
string(2) "fr"
[1]=>
float(0.9)
}
[2]=>
array(2) {
[0]=>
string(5) "en-gb"
[1]=>
float(0.8)
}
[3]=>
array(2) {
[0]=>
string(2) "en"
[1]=>
float(0.6)
}
[4]=>
array(2) {
[0]=>
string(5) "fr-ca"
[1]=>
float(0.2)
}
}

try this :
<?php
print_r(Get_Client_Prefered_Language(true, $_SERVER['HTTP_ACCEPT_LANGUAGE']));
function Get_Client_Prefered_Language ($getSortedList = false, $acceptedLanguages = false)
{
if (empty($acceptedLanguages))
$acceptedLanguages = $_SERVER["HTTP_ACCEPT_LANGUAGE"];
// regex borrowed from Gabriel Anderson on http://stackoverflow.com/questions/6038236/http-accept-language
preg_match_all('/([a-z]{1,8}(-[a-z]{1,8})?)\s*(;\s*q\s*=\s*(1|0\.[0-9]+))?/i', $acceptedLanguages, $lang_parse);
$langs = $lang_parse[1];
$ranks = $lang_parse[4];
// (recursive anonymous function)
$getRank = function ($j)use(&$getRank, &$ranks)
{
while (isset($ranks[$j]))
if (!$ranks[$j])
return $getRank($j + 1);
else
return $ranks[$j];
};
// (create an associative array 'language' => 'preference')
$lang2pref = array();
for($i=0; $i<count($langs); $i++)
$lang2pref[$langs[$i]] = (float) $getRank($i);
// (comparison function for uksort)
$cmpLangs = function ($a, $b) use ($lang2pref) {
if ($lang2pref[$a] > $lang2pref[$b])
return -1;
elseif ($lang2pref[$a] < $lang2pref[$b])
return 1;
elseif (strlen($a) > strlen($b))
return -1;
elseif (strlen($a) < strlen($b))
return 1;
else
return 0;
};
// sort the languages by prefered language and by the most specific region
uksort($lang2pref, $cmpLangs);
if ($getSortedList)
return $lang2pref;
// return the first value's key
reset($lang2pref);
return key($lang2pref);
}

The languages are ordered as the user prefers them. All you have to do is to split the string at the , symbol and from the parts, get rid off everything from the ; to the end (including the ;) and you have the languages in the user's prefered order.

PHP get a part of string

How can I check in PHP whether a string contains '-'?
Example
ABC-cde::abcdef
if '-' is found
then I have to perform split() to split ABC from cde::abcdef
else
no need to perform split()
like cde::abcdef

if (strpos($string, "-") !== false)
{
split();
}

Just use explode that should be sufficient
eg. explode ('-',$urstring);
This will only split it (into an array of strings) if "-" exist else return the entire string as a array

How about just using the $limit parameter of explode()?
This will return an array in both your examples, with only one element in the latter case.
Note that split() is deprecated as of PHP 5.3: http://php.net/manual/en/function.split.php
$s1 = 'ABC-cde::abcdef';
$s2 = 'cde::abcdef';
$s3 = 'ABC-with-more-hyphens';
explode('-', $s1, 2);
// array(2) {
// [0]=>
// string(3) "ABC"
// [1]=>
// string(11) "cde::abcdef"
// }
explode('-', $s2, 2);
// array(1) {
// [0]=>
// string(11) "cde::abcdef"
// }
explode('-', $s3, 2);
// array(2) {
// [0]=>
// string(3) "ABC"
// [1]=>
// string(17) "with-more-hyphens"
// }

Just split() it and count the elements in the return array. Maybe it's even enough to continue with the first (or last) element, e.g. $newstring = split($oldstring, '-')[0]...

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to find similar text in a large string? - php

Try this code to find string within string $data = "I have a unique idea. Do you need one?"; $find = "a unique idea"; $start = strpos($data, $find); if($start){ $end = $start + strlen($find); print_r(substr($data, $start, strlen($find))); } else { echo "not found"; }

Related

php function for rading out values from string

Regular expression for utf-8 string sliceing at linebreaks or after a number of characters

split a comma separated string in a pair of 2 using php

how to order string from HTTP_ACCEPT_LANGUAGE [duplicate]

PHP get a part of string

Categories

Resources