Replace (add) words case sensitive from arrays

Replace (add) words case sensitive from arrays - php

I am new to php and especially to regex.
My target is to enrich textes automatically with hints for "keywords" which are listed in arrays.
So far I had come.
$pattern = array("/\bexplanations\b/i",
"/\btarget\b/i",
"/\bhints\b/i",
"/\bhint\b/i",
);
$replacement = array("explanations <i>(Erklärungen)</i>",
"target <i>Ziel</i>",
"hints <i>Hinsweise</i>",
"hint <i>Hinweis</i>",
);
$string = "Target is to add some explanations (hints) from an array to
this text. I am thankful for every hint.";
echo preg_replace($pattern, $replacement, $string);
returns:
target <i>Ziel</i> is to add some explanations <i>(Erklärungen)</i> (hints <i>Hinsweise</i>) from an array to this text. I am thankful for every hint <i>Hinweis</i>
1) In generally I wonder if there are more elegant solutions (eventually without replacing the original word)?
On later state the arrays will contain more than 1000 items... and come from mariadb.
2) How can I achive, that the word "Targets" achives a case sensitive treatment?
(without duplicate the length of my arrays).
Sorry for my English and many thanks in advance.

If you project to increase the size of your array and if the text may be a bit long, processing all the text (once per word) isn't a reliable way. Also, with a large array, it isn't reliable to build a giant alternation with all the words.
But if you store all the translations in an associative array and split the text on word-boundaries, you can do it in one pass:
// Translation array with all keys lowercase
$trans = [ 'explanations' => 'Erklärungen',
'target' => 'Ziel',
'hints' => 'Hinsweise',
'hint' => 'Hinweis'
];
$parts = preg_split('~\b~', $text);
$partsLength = count($parts);
// All words are in the odd indexes
for ($i=1; $i<$partsLength; $i+=2) {
$lcWord = strtolower($parts[$i]);
if (isset($trans[$lcWord]))
$parts[$i] .= ' <i>(' . $trans[$lcWord] . ')</i>';
}
$result = implode('', $parts);
Actually the limitation here is that you can't use a key that contains a word-boundary (if you want to translate a whole expression with several words for instance), but if you want to handle this case, you can use preg_match_all in place of preg_split and build a pattern that tests these special cases before, something like:
preg_match_all('~mushroom pie\b|\w+|\W*~iS', $text, $m);
$parts = &$m[0];
$partsLength = count($parts);
$i = 1 ^ preg_match('~^\w~', $parts[0]);
for (; $i<$partsLength; $i+=2) {
...
(if you have a lot of exceptions (too many) other strategies are possible.)

Enclose search words with parentheses in regex patterns and use backteferences in replacements. 
See this PHP demo:
$pattern = array("/\b(explanations)\b/i", "/\b(target)\b/i", "/\b(hints)\b/i", "/\b(hint)\b/i", );
$replacement = array('$1 <i>(Erklärungen)</i>', '$1 <i>Ziel</i>', '$1 <i>Hinsweise</i>', '$1 <i>Hinweis</i>', );
$string = "Target is to add some explanations (hints) from an array to this text. I am thankful for every hint.";
echo preg_replace($pattern, $replacement, $string);
That way, you will replace with the words found with actual case used in the text.
Note it is very important to make sure the patterns go in the descending order with longer patterns coming before shorter ones (first Targets, then Target, etc.)

Related

preg_replace: how to consider whole array of patterns before replacing?

I'm using preg_replace to match and replace improperly encoded UTF-8 characters with their proper characters. I've created a "old" array containing the wrong characters, and a corresponding "new" array with the replacements. Here is a snippet of each array:
$old = array(
'/â€/',
'/â€™/',
);
$new = array(
'†',
'’',
);
(Note: If you're curious about why I'm doing this, read more here)
A sample string that may contain the wrong data could be:
The programmerâ€™s becoming very frustrated
Which should become:
The programmer's becoming very frustrated
I'm using this function:
$result = preg_replace($old, $new, $str);
But the subject is actually becoming:
The programmer†™s becoming very frustrated
It's clear that PHP is doing what I call a non-greedy match on the subject (not the correct term to use here, I know). preg_replace is executing the replacement on the first pair in the old/new array without considering if there may a different pattern in the pattern array that is more appropriate. If I reverse the order of the replacement pair, then it works as expected.
My question is: Is there an approach that will allow preg_replace to consider all elements of the pattern array before executing a replacement, or is my only option to re-order the arrays?

I don't think there is any option like that. However, you could use an associative array to store your replacements and sort it using uasort and strlen, so larger matches would come first and you wouldn't need to manage your array order manually.
Then you can use array_keys and array_values to act just like your separated $old and $new arrays.
$replacements = array(
'†' => '/â€/',
'’' => '/â€™/',
);
// sorts the replacements array by value string length keeping the indexes intact
uasort($replacements, function($a, $b) {
return strlen($b) - strlen($a);
});
$str = 'The programmerâ€™s becoming very frustrated';
$result = preg_replace(array_values($replacements), array_keys($replacements), $str);
EDIT: As #CasimiretHippolyte pointed out, using array_values is not necessary on the first parameter of the preg_replace function in this case. It would only return a copy from $replacements with numerical indexes but the order would be the same. Unless you need an array with identical structure to $old from your question, you do not need to use it.

Order the arrays $old and $new in such way that the longest regex becomes first:
$old = array(
'/â€™/',
'/â€/',
);
$new = array(
'’',
'†',
);
$str = 'The programmerâ€™s becoming very frustrated';
$result = preg_replace($old, $new, $str);
echo $result,"\n";
output:
The programmer’s becoming very frustrated

I don't believe there is a way to do this only using preg_replace. However you can easily do this sorting your array beforehand:
$replacements = array_combine($old, $new);
krsort($replacements);
$result = preg_repalce( array_keys($replacements), array_values($replacements), $string);

PHP Using str_word_count with strsplit to form array after x words

I've got a large string that I want to put in an array after each 50 words. I thought about using strsplit to cut, but realised that wont take the words in to consideration, just split when it gets to x char.
I've read about str_word_count but can't work out how to put the two together.
What I've got at the moment is:
$outputArr = str_split($output, 250);
foreach($outputArr as $arOut){
echo $arOut;
echo "<br />";
}
But I want to substitute that to form each item of the array at 50 words instead of 250 characters.
Any help will be much appreciated.

Assuming that str_word_count is sufficient for your needs¹, you can simply call it with 1 as the second parameter and then use array_chunk to group the words in groups of 50:
$words = str_word_count($string, 1);
$chunks = array_chunk($words, 50);
You now have an array of arrays; to join every 50 words together and make it an array of strings you can use
foreach ($chunks as &$chunk) { // important: iterate by reference!
$chunk = implode(' ', $chunk);
}
¹ Most probably it is not. If you want to get what most humans consider acceptable results when processing written language you will have to use preg_split with some suitable regular expression instead.

There's another way:
<?php
$someBigString = <<<SAMPLE
This, actually, is a nice' old'er string, as they said, "divided and conquered".
SAMPLE;
// change this to whatever you need to:
$number_of_words = 7;
$arr = preg_split("#([a-z]+[a-z'-]*(?<!['-]))#i",
$someBigString, $number_of_words + 1, PREG_SPLIT_DELIM_CAPTURE);
$res = implode('', array_slice($arr, 0, $number_of_words * 2));
echo $res;
Demo.
I consider preg_split a better tool (than str_word_count) here. Not because the latter is inflexible (it is not: you can define what symbols can make up a word with its third param), but because preg_split will essentially stop processing the string after getting N items.
The trick, as quite common with this function, is to capture delimiters as well, then use them to reconstruct the string with the first N words (where N is given) AND punctuation marks saved.
(of course, the regex used in my example does not strictly comply to str_word_count locale-dependent behavior. But it still restricts the words to consist of alpha, ' and - symbols, with the latter two not at the beginning and the end of any word).

using preg_split to fetch terms between two markers in string

ok I have two strings.
(I use this for a language library system to allow translators to provide translations with placeholders).
In the first string, there are two instances. note that it's not always a single instance, some cases it will be none, one, two, or more.
This is a {[John Doe]} and this is {[Jane Doe]}
and then I have a string that is stored like this:
C'est {[1]} et c'est {[2]}
(translation)
This is a {[1]} and this is a {[2]}
so what I need to do is take the first string, replace everything between {[]} of the starting string and match each instance, i.e. first of first string with {[1]} of second string etc. keep in mind that the reason I am using {[1]} and {[2]} is because in some languages, terms may appear in a different order for gramatical accuracy, but are still terms that don't need translation them selves (names).
so the question is. how do I do this? am thinking preg_split and then match index+1 of each with the second string. that part I can handle. the problem I am having is getting the right regex search going..
this is as close as I could get it..
preg_split('/[(\{\[).*(\]\})]/', $str, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
that returns an array of everything before and after each instance of {[ and ]} when I am just trying to get the contents of inbetween the two..
EDIT: solution derived from NikiC's answer.
function lang($str){
$nwStr = $str;
preg_match_all('(\{\[(.+?)\]\})', $str, $placeholders);
foreach ($placeholders[0] as $mk => $match) {
$pos = $mk+1;
$nwStr = str_replace("$match","{[$pos]}",$nwStr);
}
$result = preg_replace_callback('(\{\[(\d+)\]\})', function ($matches) use ($placeholders) {
$n = $matches[1]-1;
return $placeholders[1][$n];
}, $translation);
return $result;
}
basically what i am doing here is first looping through to replace the matches with the placeholders so that I can match the proper placeholder text in my language files. (i.e. create the right label string out of the input string)

First grab the placeholders from the string:
preg_match_all('(\{\[(.+?)\]\})', $string, $matches);
$placeholders = $matches[1];
Now replace with a callback:
$result = preg_replace_callback('(\{\[(\d+)\]\})', function ($matches) use ($placeholders) {
$n = $matches[1] + 1;
return $placeholders[$n];
}, $translation);

You're almost there. PREG_SPLIT_DELIM_CAPTURE captures the groups between ( and ), so this:
preg_split('/(\{\[.*\]\})/U', $str, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
should work better. I also added the U modifier so that * is ungreedy.
edit also, you have a pair of [ and ] which definitely don't belong there!
Another thing, you probably want to have the parts between the {[...]} construct, so this is better:
preg_split('/\{\[(.*)\]\}/U', $str, -1, PREG_SPLIT_DELIM_CAPTURE);
By removing the PREG_SPLIT_NO_EMPTY, you now know for certain that you will find the tagged parts at odd indexes.

PHP: Bolding of overlapping keywords in string

This is a problem that I have figured out how to solve, but I want to solve it in a simpler way... I'm trying to improve as a programmer.
Have done my research and have failed to find an elegant solution to the following problem:
I have a hypothetical array of keywords to search for:
$keyword_array = array('he','heather');
and a hypothetical string:
$text = "What did he say to heather?";
And, finally, a hypothetical function:
function bold_keywords($text, $keyword_array)
{
$pattern = array();
$replace = array();
foreach($keyword_array as $keyword)
{
$pattern[] = "/($keyword)/is";
$replace[] = "<b>$1</b>";
}
$text = preg_replace($pattern, $replace, $text);
return $text;
}
The function (not too surprisingly) is returning something like this:
"What did <b>he</b> say to <b>he</b>ather?"
Because it is not recognizing "heather" when there is a bold tag in the middle of it.
What I want the final solution to do is, as simply as possible, return one of the two following strings:
"What did <b>he</b> say to <b>heather</b>?"
"What did <b>he</b> say to <b><b>he</b>ather</b>?"
Some final conditions:
--I would like the final solution to deal with a very large number of possible keywords
--I would like it to deal with the following two situations (lines represent overlapping strings):
One string engulfs the other, like the following two examples:
-- he, heather
-- sanding, and
Or one string does not engulf the other:
-- entrain, training
Possible way to solve:
-A regex that ignores tags in keywords
-Long way (that I am trying to avoid):
*Search string for all occurrences of each keyword, store an array of positions (start and end) of keywords to be bolded
*Process this array recursively to combine overlapping keywords, so there is no redundancy
*Add the bold tags (starting from the end of the string, to avoid the positions of information shifting from the additional characters)
Many thanks in advance!

Example
$keyword_array = array('he','heather');
$text = "What did he say to heather?";
$pattern = array();
$replace = array();
sort($keyword_array, SORT_NUMERIC);
foreach($keyword_array as $keyword)
{
$pattern[] = "/ ($keyword)/is";
$replace[] = " <b>$1</b>";
}
$text = preg_replace($pattern, $replace, $text);
echo $text; // What did <b>he</b> say to <b>heather</b>?

need to change your regex pattern to recognize that each "term" you are searching for is followed by whitespace or punctuation, so that it does not apply the pattern match to items followed by an alpha-numeric.

Simplistic and lazy-ish Approach off The Top of My head:
Sort your initial Array by Item length, descending! No more "Not recognized because there's already a Tag in The Middle" issues!
Edit: The nested tags issue is then easily fixed by extending your regex in a Way that >foo and foo< isn't being matched anymore.

Regular Expressions: how to do "option split" replaces

those reqular expressions drive me crazy. I'm stuck with this one:
test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not
Task:
Remove all [[ and ]] and if there is an option split choose the later one so output should be:
test1:link test2:silver test3:out1insideout2 test4:this|not
I came up with (PHP)
$text = preg_replace("/\\[\\[|\\]\\]/",'',$text); // remove [[ or ]]
this works for part1 of the task. but before that I think I should do the option split, my best solution:
$text = preg_replace("/\\[\\[(.*\|)(.*?)\\]\\]/",'$2',$text);
Result:
test1:silver test3:[[out1[[inside]]out2]] this|not
I'm stuck. may someone with some free minutes help me? Thanks!

I think the easiest way to do this would be multiple passes. Use a regular expression like:
\[\[(?:[^\[\]]*\|)?([^\[\]]+)\]\]
This will replace option strings to give you the last option from the group. If you run it repeatedly until it no longer matches, you should get the right result (the first pass will replace [[out1[[inside]]out2]] with [[out1insideout2]] and the second will ditch the brackets.
Edit 1: By way of explanation,
\[\[ # Opening [[
(?: # A non-matching group (we don't want this bit)
[^\[\]] # Non-bracket characters
* # Zero or more of anything but [
\| # A literal '|' character representing the end of the discarded options
)? # This group is optional: if there is only one option, it won't be present
( # The group we're actually interested in ($1)
[^\[\]] # All the non-bracket characters
+ # Must be at least one
) # End of $1
\]\] # End of the grouping.
Edit 2: Changed expression to ignore ']' as well as '[' (it works a bit better like that).
Edit 3: There is no need to know the number of nested brackets as you can do something like:
$oldtext = "";
$newtext = $text;
while ($newtext != $oldtext)
{
$oldtext = $newtext;
$newtext = preg_replace(regexp,replace,$oldtext);
}
$text = $newtext;
Basically, this keeps running the regular expression replace until the output is the same as the input.
Note that I don't know PHP, so there are probably syntax errors in the above.

This is impossible to do in one regular expression since you want to keep content in multiple "hierarchies" of the content. It would be possible otherwise, using a recursive regular expression.
Anyways, here's the simplest, most greedy regular expression I can think of. It should only replace if the content matches your exact requirements.
You will need to escape all backslashes when putting it into a string (\ becomes \\.)
\[\[((?:[^][|]+|(?!\[\[|]])[^|])++\|?)*]]
As others have already explained, you use this with multiple passes. Keep looping while there are matches, performing replacement (only keeping match group 1.)
Difference from other regular expressions here is that it will allow you to have single brackets in the content, without breaking:
test1:[[link]] test2:[[gold|si[lv]er]]
test3:[[out1[[in[si]de]]out2]] test4:this|not
becomes
test1:[[link]] test2:si[lv]er
test3:out1in[si]deout2 test4:this|not

Why try to do it all in one go. Remove the [[]] first and then deal with options, do it in two lines of code.
When trying to get something going favour clarity and simplicity.
Seems like you have all the pieces.

Why not just simply remove any brackets that are left?
$str = 'test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not';
$str = preg_replace('/\\[\\[(?:[^|\\]]+\\|)+([^\\]]+)\\]\\]/', '$1', $str);
$str = str_replace(array('[', ']'), '', $str);

Well, I didn't stick to just regex, because I'm of a mind that trying to do stuff like this with one big regex leads you to the old joke about "Now you have two problems". However, give something like this a shot:
$str = 'test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not'; $reg = '/(.*?):(.*?)( |$)/';
preg_match_all($reg, $str, $m);
foreach($m[2] as $pos => $match) {
if (strpos($match, '|') !== FALSE && strpos($match, '[[') !== FALSE ) {
$opt = explode('|', $match); $match = $opt[count($opt)-1];
}
$m[2][$pos] = str_replace(array('[', ']'),'', $match );
}
foreach($m[1] as $k=>$v) $result[$k] = $v.':'.$m[2][$k];

This is C# using only using non-escaped strings, hence you will have to double the backslashes in other languages.
String input = "test1:[[link]] " +
"test2:[[gold|silver]] " +
"test3:[[out1[[inside]]out2]] " +
"test4:this|not";
String step1 = Regex.Replace(input, #"\[\[([^|]+)\|([^\]]+)\]\]", #"[[$2]]");
String step2 = Regex.Replace(step1, #"\[\[|\]\]", String.Empty);
// Prints "test1:silver test3:out1insideout2 test4:this|not"
Console.WriteLine(step2);

$str = 'test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not';
$s = preg_split("/\s+/",$str);
foreach ($s as $k=>$v){
$v = preg_replace("/\[\[|\]\]/","",$v);
$j = explode(":",$v);
$j[1]=preg_replace("/.*\|/","",$j[1]);
print implode(":",$j)."\n";
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Replace (add) words case sensitive from arrays - php

Related

preg_replace: how to consider whole array of patterns before replacing?

PHP Using str_word_count with strsplit to form array after x words

using preg_split to fetch terms between two markers in string

PHP: Bolding of overlapping keywords in string

Regular Expressions: how to do "option split" replaces

Categories

Resources