Most efficient way to find matching words from paragraph

Most efficient way to find matching words from paragraph - php

I have a Paragraph that I have to parse for different keywords. For example, Paragraph:
"I want to make a change in the world. Want to make it a better place to live. Peace, Love and Harmony. It is all life is all about. We can make our world a good place to live"
And my keywords are
"world", "earth", "place"
I should report whenever I have a match and how many times.
Output should be:
"world" 2 times and "place" 1 time
Currently, I am just converting Paragraph strings to array of characters and then matching each keyword with all of the array contents.
Which is wasting my resources.
Please guide me for an efficient way.( I am using PHP)

As #CasimiretHippolyte commented, regex is the better means as word boundaries can be used. Further caseless matching is possible using the i flag. Use with preg_match_all return value:
Returns the number of full pattern matches (which might be zero), or FALSE if an error occurred.
The pattern for matching one word is: /\bword\b/i. Generate an array where the keys are the word values from search $words and values are the mapped word-count, that preg_match_all returns:
$words = array("earth", "world", "place", "foo");
$str = "at Earth Hour the world-lights go out and make every place on the world dark";
$res = array_combine($words, array_map( function($w) USE (&$str) { return
preg_match_all('/\b'.preg_quote($w,'/').'\b/i', $str); }, $words));
print_r($res); test at eval.in outputs to:
Array
(
[earth] => 1
[world] => 2
[place] => 1
[foo] => 0
)
Used preg_quote for escaping the words which is not necessary, if you know, they don't contain any specials. For the use of inline anonymous functions with array_combine PHP 5.3 is required.

<?php
Function woohoo($terms, $para) {
$result ="";
foreach ($terms as $keyword) {
$cnt = substr_count($para, $keyword);
if ($cnt) {
$result .= $keyword. " found ".$cnt." times<br>";
}
}
return $result;
}
$terms = array('world', 'earth', 'place');
$para = "I want to make a change in the world. Want to make it a better place to live.";
$r = woohoo($terms, $para);
echo($r);
?>

I will use preg_match_all(). Here is how it would look in your code. The actual function returns the count of items found, but the $matches array will hold the results:
<?php
$string = "world";
$paragraph = "I want to make a change in the world. Want to make it a better place to live. Peace, Love and Harmony. It is all life is all about. We can make our world a good place to live";
if (preg_match_all($string, $paragraph, &$matches)) {
echo 'world'.count($matches[0]) . "times";
}else {
echo "match NOT found";
}
?>

Related

How can i replace all specific strings in a long text which have dynamic number in between?

I am trying to replace strings contains specific string including a dynamic number in between.
I tried preg_match_all but it give me NULL value
Here is what i am actually looking for with all details:
In my long text there are values which contains this [_wc_acof_(some dynamic number)] , i.e: [_wc_acof_6] i want to convert them to $postmeta['_wc_acof_14'][0]
This can be multiple in the same long text.
I want to run through with this logic:
1- First i get all numbers after [_wc_acof_ and save them in array by using preg_match_all as guided here get number after string php regex
2- Then i run a foreach loop and set my arrays for patterns and replacements with that number i.e:
foreach ($allMatchNumbers as $MatchNumber){
$key = "[_wc_acof_" . $MatchNumber. "]";
$patterns[] = $key;
$replacements[] = $postmeta[$key][0];
}
3- Then i do replace with this echo preg_replace($patterns, $replacements, $string);
But i am unable to get preg_match_all it gives me NULL where i tried below
preg_match_all('/[_wc_acof_/',$string,$allMatchNumbers );
Please Help? i am not sure if preg_grep is better than this?

It seems you want to process the input in stages, to obtain all the numbers in specific lexical context first, and then modify the user input using some lookup technique.
The first step can be implemented as
preg_match_all('~\[_wc_acof_(\d+)]~', $text, $matches)
that extracts all sequences of one or more digit in between [_wc_acof_ and ] into Group 1 (you can access the values via $matches[1]).
Then, you may fill the $replacements array using these values.
Next, you can use
preg_replace_callback('~\[_wc_acof_(\d+)]~', function($m) use ($replacements){
return $replacements[$m[1]];
}, $text)
See the PHP demo:
<?php
$text = '<p>[_wc_acof_6] i want to convert this and it contains also this [_wc_acof_9] or can be this [_wc_acof_11] number can never be static</p>';
if (preg_match_all('~\[_wc_acof_(\d+)]~', $text, $matches)) {
foreach($matches[1] as $matched){
$replacements[$matched] = 'NEW_VALUE_FOR_'.$matched.'_KEY';
}
print_r($replacements);
echo preg_replace_callback('~\[_wc_acof_(\d+)]~', function($m) use ($replacements){
return $replacements[$m[1]];
}, $text);
}
Output:
Array
(
[0] => 6
[1] => 9
[2] => 11
)
NEW_VALUE_FOR_6_KEY i want to convert this and it contains also this NEW_VALUE_FOR_9_KEY or can be this NEW_VALUE_FOR_11_KEY number can never be static

php regex find substring in substring

I am still playing around for one project with matching words.
Let assume that I have a given string, say maxmuster . Then I want to mark this part of my random word maxs which are in maxmuster in the proper order, like the letters are.
I wil give some examples and then I tell what I already did. Lets keep the string maxmuster. The bold part is the matched one by regex (best would be in php, however could be python, bash, javascript,...)
maxs
Mymaxmuis
Lemu
muster
Of course also m, u, ... will be matched then. I know that, I am going to fix that later. However, the solution, I though, should not so difficult, so I try to divide the word in groups like this:
/(maxmuster)?|(maxmuste)?|(maxmust)?|(maxmus)?|(maxmu)?|(maxm)?|(max)?|(ma)?|(m)?/gui
But then I forgot of course the other combinations, like:
(axmuster)(xmus) and so on. Did I really have to do that, or exist there a simple regex trick, to solve this question, like I explained above?
Thank you very much

Sounds like you need string intersection. If you don't mind non regex idea, have a look in Wikibooks Algorithm Implementation/Strings/Longest common substring PHP section.
foreach(["maxs", "Mymaxmuis", "Lemu", "muster"] AS $str)
echo get_longest_common_subsequence($str, "maxmuster") . "\n";
max
maxmu
mu
muster
See this PHP demo at tio.run (caseless comparison).
If you need a regex idea, I would join both strings with space and use a pattern like this demo.
(?=(\w+)(?=\w* \w*?\1))\w
It will capture inside a lookahead at each position before a word character in the first string the longest substring that also matches the second string. Then by PHP matches of the first group need to be sorted by length and the longest match will be returned. See the PHP demo at tio.run.
function get_longest_common_subsequence($w1="", $w2="")
{
$test_str = preg_quote($w1,'/')." ".preg_quote($w2,'/');
if(preg_match_all('/(?=(\w+)(?=\w* \w*?\1))\w/i', $test_str, $out) > 0)
{
usort($out[1], function($a, $b) { return strlen($b) - strlen($a); });
return $out[1][0];
}
}

TL;DR
Using Regular Expressions:
longestSubstring(['Mymaxmuis', 'axmuis', 'muster'], buildRegexFrom('maxmuster'));
Full snippet
Using below regex you are able to match all true sub-strings of string maxmuster:
(?|((?:
m(?=a)
|(?<=m)a
|a(?=x)
|(?<=a)x
|x(?=m)
|(?<=x)m
|m(?=u)
|(?<=m)u
|u(?=s)
|(?<=u)s
|s(?=t)
|(?<=s)t
|t(?=e)
|(?<=t)e
|e(?=r)
|(?<=e)r
)+)|([maxmuster]))
Live demo
You have to cook such a regex from a word like maxmuster so you need a function to call it:
function buildRegexFrom(string $word): string {
// Split word to letters
$letters = str_split($word);
// Creating all side of alternations in our regex
foreach ($letters as $key => $letter)
if (end($letters) != $letter)
$regex[] = "$letter(?={$letters[$key + 1]})|(?<=$letter){$letters[$key + 1]}";
// Return whole cooked pattern
return "~(?|((?>".implode('|', $regex).")+)|([$word]))~i";
}
To return longest match you need to sort results according to matches length from longest to shortest. It means writing another piece of code for it:
function longestSubstring(array $array, string $regex): array {
foreach ($array as $value) {
preg_match_all($regex, $value, $matches);
usort($matches[1], function($a, $b) {
return strlen($b) <=> strlen($a);
});
// Store longest match being sorted
$substrings[] = $matches[1][0];
}
return $substrings;
}
Putting all things together:
print_r(longestSubstring(['Mymaxmuis', 'axmuis', 'muster'], buildRegexFrom('maxmuster')));
Outputs:
Array
(
[0] => maxmu
[1] => axmu
[2] => muster
)
PHP live demo

Here is my take on this problem using regex.
<?php
$subject="maxmuster";
$str="Lemu";
$comb=str_split($subject); // Split into single characters.
$len=strlen($subject);
for ($i=2; $i<=$len; $i++){
for($start=0; $start<$len; $start++){
$temp="";
$inc=$start;
for($j=0; $j<$i; $j++){
$temp=$temp.$subject[$inc];
$inc++;
}
array_push($comb,$temp);
}
}
echo "Matches are:\n";
for($i=0; $i<sizeof($comb); $i++){
$pattern = "/".$comb[$i]."/";
if(preg_match($pattern,$str, $matches)){
print_r($matches);
};
}
?>
And here is an Ideone Demo.

Replace (add) words case sensitive from arrays

I am new to php and especially to regex.
My target is to enrich textes automatically with hints for "keywords" which are listed in arrays.
So far I had come.
$pattern = array("/\bexplanations\b/i",
"/\btarget\b/i",
"/\bhints\b/i",
"/\bhint\b/i",
);
$replacement = array("explanations <i>(Erklärungen)</i>",
"target <i>Ziel</i>",
"hints <i>Hinsweise</i>",
"hint <i>Hinweis</i>",
);
$string = "Target is to add some explanations (hints) from an array to
this text. I am thankful for every hint.";
echo preg_replace($pattern, $replacement, $string);
returns:
target <i>Ziel</i> is to add some explanations <i>(Erklärungen)</i> (hints <i>Hinsweise</i>) from an array to this text. I am thankful for every hint <i>Hinweis</i>
1) In generally I wonder if there are more elegant solutions (eventually without replacing the original word)?
On later state the arrays will contain more than 1000 items... and come from mariadb.
2) How can I achive, that the word "Targets" achives a case sensitive treatment?
(without duplicate the length of my arrays).
Sorry for my English and many thanks in advance.

If you project to increase the size of your array and if the text may be a bit long, processing all the text (once per word) isn't a reliable way. Also, with a large array, it isn't reliable to build a giant alternation with all the words.
But if you store all the translations in an associative array and split the text on word-boundaries, you can do it in one pass:
// Translation array with all keys lowercase
$trans = [ 'explanations' => 'Erklärungen',
'target' => 'Ziel',
'hints' => 'Hinsweise',
'hint' => 'Hinweis'
];
$parts = preg_split('~\b~', $text);
$partsLength = count($parts);
// All words are in the odd indexes
for ($i=1; $i<$partsLength; $i+=2) {
$lcWord = strtolower($parts[$i]);
if (isset($trans[$lcWord]))
$parts[$i] .= ' <i>(' . $trans[$lcWord] . ')</i>';
}
$result = implode('', $parts);
Actually the limitation here is that you can't use a key that contains a word-boundary (if you want to translate a whole expression with several words for instance), but if you want to handle this case, you can use preg_match_all in place of preg_split and build a pattern that tests these special cases before, something like:
preg_match_all('~mushroom pie\b|\w+|\W*~iS', $text, $m);
$parts = &$m[0];
$partsLength = count($parts);
$i = 1 ^ preg_match('~^\w~', $parts[0]);
for (; $i<$partsLength; $i+=2) {
...
(if you have a lot of exceptions (too many) other strategies are possible.)

Enclose search words with parentheses in regex patterns and use backteferences in replacements. 
See this PHP demo:
$pattern = array("/\b(explanations)\b/i", "/\b(target)\b/i", "/\b(hints)\b/i", "/\b(hint)\b/i", );
$replacement = array('$1 <i>(Erklärungen)</i>', '$1 <i>Ziel</i>', '$1 <i>Hinsweise</i>', '$1 <i>Hinweis</i>', );
$string = "Target is to add some explanations (hints) from an array to this text. I am thankful for every hint.";
echo preg_replace($pattern, $replacement, $string);
That way, you will replace with the words found with actual case used in the text.
Note it is very important to make sure the patterns go in the descending order with longer patterns coming before shorter ones (first Targets, then Target, etc.)

Matching a substring (an apostrophe) in a given word using regex

I have a server application which looks up where the stress is in Russian words. The end user writes a word жажда. The server downloads a page from another server which contains the stresses indicated with apostrophes for each case/declension like this жа'жда. I need to find that word in the downloaded page.
In Russian the stress is always written after a vowel. I've been using so far a regex that is a grouping of all possible combinations (жа'жда|жажда'). Is there a more elegant solution using just a regex pattern instead of making a PHP script which creates all these combinations?
EDIT:
I have a word жажда
The downloaded page contains the string жа'жда. (notice the
apostrophe, I do not before-hand know where the apostrophe in the
word is)
I want to match the word with apostrophe (жа'жда).
P.S.: So far I have a PHP script creating the string (жа'жда|жажда') used in regex (apostrophe is only after vowels) which matches it. My goal is to get rid of this script and use just regex in case it's possible.

If I understand your question,
have these options (d'isorder|di'sorder|dis'order|diso'rder|disor'der|disord'er|disorde'r|disorder‌') and one of these is in the downloaded page and I need to find out which one it is
this may suit your needs:
<pre>
<?php
$s = "d'isorder|di'sorder|dis'order|diso'rder|disor'der|disord'er|disorde'r|disorder'|disorde'";
$s = explode("|",$s);
print_r($s);
$matches = preg_grep("#[aeiou]'#", $s);
print_r($matches);
running example: https://eval.in/207282

Uhm... Is this ok with you?
<?php
function find_stresses($word, $haystack) {
$pattern = preg_replace('/[aeiou]/', '\0\'?', $word);
$pattern = "/\b$pattern\b/";
// word = 'disorder', pattern = "diso'?rde'?r"
preg_match_all($pattern, $haystack, $matches);
return $matches[0];
}
$hay = "something diso'rder somethingelse";
find_stresses('disorder', $hay);
// => array(diso'rder)
You didn't specify if there can be more than one match, but if not, you could use preg_match instead of preg_match_all (faster). For example, in Italian language we have àncora and ancòra :P
Obviously if you use preg_match, the result would be a string instead of an array.

Based, on your code, and the requirements that no function is called and disorder is excluded. I think this is what you want. I have added a test vector.
<pre>
<?php
// test code
$downloadedPage = "
there is some disorde'r
there is some disord'er in the example
there is some di'sorder in the example
there also' is some order in the example
there is some disorder in the example
there is some dso'rder in the example
";
$word = 'disorder';
preg_match_all("#".preg_replace("#[aeiou]#", "$0'?", $word)."#iu"
, $downloadedPage
, $result
);
print_r($result);
$result = preg_grep("#'#"
, $result[0]
);
print_r($result);
// the code you need
$word = 'also';
preg_match("#".preg_replace("#[aeiou]#", "$0'?", $word)."#iu"
, $downloadedPage
, $result
);
print_r($result);
$result = preg_grep("#'#"
, $result
);
print_r($result);
Working demo: https://eval.in/207312

Regex, PHP - finding words that need correction

I have a long string with words. Some of the words have special letters.
For example a string "have now a rea$l problem with$ dolar inp$t"
and i have a special letter "$".
I need to find and return all the words with special letters in a quickest way possible.
What I did is a function that parse this string by space and then using “for” going over all the words and searching for special character in each word. When it finds it—it saves it in an array. But I have been told that using regexes I can have it with much better performance and I don’t know how to implement it using them.
What is the best approach for it?
I am a new to regex but I understand it can help me with this task?
My code: (forbiden is a const)
The code works for now, only for one forbidden char.
function findSpecialChar($x){
$special = "";
$exploded = explode(" ", $x);
foreach ($exploded as $word){
if (strpos($word,$forbidden) !== false)
$special .= $word;
}
return $special;
}

You could use preg_match like this:
// Set your special word here.
$special_word = "café";
// Set your sentence here.
$string = "I like to eat food at a café and then read a magazine.";
// Run it through 'preg_match''.
preg_match("/(?:\W|^)(\Q$special_word\E)(?:\W|$)/i", $string, $matches);
// Dump the results to see it working.
echo '<pre>';
print_r($matches);
echo '</pre>';
The output would be:
Array
(
[0] => café
[1] => café
)
Then if you wanted to replace that, you could do this using preg_replace:
// Set your special word here.
$special_word = "café";
// Set your special word here.
$special_word_replacement = " restaurant ";
// Set your sentence here.
$string = "I like to eat food at a café and then read a magazine.";
// Run it through 'preg_replace''.
$new_string = preg_replace("/(?:\W|^)(\Q$special_word\E)(?:\W|$)/i", $special_word_replacement, $string);
// Echo the results.
echo $new_string;
And the output for that would be:
I like to eat food at a restaurant and then read a magazine.
I am sure the regex could be refined to avoid having to add spaces before and after " restaurant " like I do in this example, but this is the basic concept I believe you are looking for.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Most efficient way to find matching words from paragraph - php

Related

How can i replace all specific strings in a long text which have dynamic number in between?

php regex find substring in substring

Replace (add) words case sensitive from arrays

Matching a substring (an apostrophe) in a given word using regex

Regex, PHP - finding words that need correction

Categories

Resources