Using preg_replace to get characters before/after match - php

I have the following code to get characters before/after the regex match:
$searchterm = 'blue';
$string = 'Here is a sentence talking about blue. This sentence talks about red.';
$regex = '/.*(.{10}\b' . $searchterm . '\b.{10}).*/si';
echo preg_replace($regex, '$1', $string);
Output: "ing about blue. This se" (expected).
When I change $searchterm = 'red', then I get this:
Output: "Here is a sentence talking about blue. This sentence talks about red."
I am expecting this: "lks about red." The same thing happens if you start at the beginning of the sentence. Is there a way to use a similar regex to not pull back the entire string when it's at the start/end?
Example of what is happening: https://sandbox.onlinephpfunctions.com/code/e500b505860ded429e78869f61dbf4128ff368b3

Converting my comment to answer so that solution is easy to find for future visitors.
You regex regex is almost correct but make sure to use a non-greedy quantifier with .{0,10} limit for surrounding substring:
$searchterm = 'blue';
$string = 'Here is a sentence talking about blue. This sentence talks about red.';
$regex = '/.*?(.{0,10}\b' . $searchterm . '\b.{0,10}).*/si';
echo preg_replace($regex, '$1', $string);
Updated Code Demo
RegEx Demo

You'd better use preg_match with .{0,10} quantifiers instead of {10},
function truncateString($searchterm){
$string = 'Here is a sentence talking about blue. This sentence talks about red.';
$regex = '/.{0,10}\b' . $searchterm . '\b.{0,10}/si';
if (preg_match($regex, $string, $m)) {
echo $m[0] . "\n";
}
}
truncateString('blue');
// => ing about blue. This se
truncateString('red');
// => lks about red.
See the PHP demo.
preg_match will find and return the first match only. The .{0,10} pattern will match zero to ten occurrences of any char (since the s modifier is used, the . matches even line break chars).
One more thing: if your $searchterm can contain special regex metacharacters, anywhere in the term, you should consider refactoring the code to
$regex = '/.{0,10}(?<!\w)' . preg_quote($searchterm, '/') . '(?!\w).{0,10}/si';
where (?<!\w) / (?!\w) are unambiguous word boundaries and the preg_quote is used to escape all special chars.

Related

Remove s or 's from all words in a string with PHP

I have a string in PHP
$string = "Dogs are Jonny's favorite pet";
I want to use regex or some method to remove s or 's from the end of all words in the string.
The desired output would be:
$revisedString = "Dog are Jonny favorite pet";
Here is my current approach:
<?php
$string = "Dogs are Jonny's favorite pet";
$stringWords = explode(" ", $string);
$counter = 0;
foreach($stringWords as $string) {
if(substr($string, -1) == s){
$stringWords[$counter] = trim($string, "s");
}
if(strpos($string, "'s") !== false){
$stringWords[$counter] = trim($string, "'s");
}
$counter = $counter + 1;
}
print_r($stringWords);
$newString = "";
foreach($stringWords as $string){
$newString = $newString . $string . " ";
}
echo $newString;
}
?>
How would this be achieved with REGEX?
For general use, you must leverage much more sophisticated technique than an English-ignorant regex pattern. There may be fringe cases where the following pattern fails by removing an s that it shouldn't. It could be a name, an acronym, or something else.
As an unreliable solution, you can optionally match an apostrophe then match a literal s if it is not immediately preceded by another s. Adding a word boundary (\b) on the end improves the accuracy that you are matching the end of words.
Code: (Demo)
$string = "The bass can access the river's delta from the ocean. The fishermen, assassins, and their friends are happy on the banks";
var_export(preg_replace("~'?(?<!s)s\b~", '', $string));
Output:
'The bass can access the river delta from the ocean. The fishermen, assassin, and their friend are happy on the bank'
PHP Live Regex always helped me a lot in such moments. Even already knowing how REGEX works, I still use it just to be sure some times.
To make use of REGEX in your case, you can use preg_replace().
<?php
// Your string.
$string = "Dogs are Jonny's favorite pet";
// The vertical bar means "or" and the backslash
// before the apostrophe is needed so you don't end
// your pattern string since we're using single quotes
// to delimit it. "\s" means a single space.
$regex_pattern = '/\'s\s|s\s|s$/';
// Fill the preg_replace() with the pattern, the replacement
// (a single space in this case), your string, -1 (so preg_replace()
// will replace all the matches) and a variable of your desire
// to be the "counter" (preg_replace() will automatically
// fill it).
$newString = preg_replace($regex_pattern, ' ', $string, -1, $counter);
// Use the rtrim() to remove spaces at the right of the sentence.
$newString = rtrim($newString, " ");
echo "New string: " . $newString . ". ";
echo "Replacements: " . $counter . ".";
?>
In this case, the function will identify any "'s" or "s" with spaces (\s) after them and then replace them with a single space.
The preg_replace() will also count all the replacements and register them automatically on $counter or any variable you place there instead.
Edit:
Phil's comment is right and indeed my previous REGEX would lose a "s" at the end of the string. Adding "|s$" will solve it. Again, "|" means "or" and the "$" means that the "s" must be at the end of the string.
In attention to mickmackusa's comment, my solution is meant only to remove "s" characters at the end of words inside the string as this was Sparky Johnson' request here. Removing plurals would require a complex code since not only we need to remove "s" characters from plural only words but also change verbs and other stuff.

How to match regex patterns that contain special characters?

I have a generic routine which used to substitute out short-codes (which begin with a "^" character) with gender specific options. I have been asked to extend this to correct some common misspellings. These words won't have a special character at the start.
Until now I have been using PHP's str_replace function but because of the possibility of some words appearing within others, I need to ensure that the code uses word boundaries when matching. I am now attempting to use preg_replace.
While the actual code is getting data from a database table, including the gender specific replacements, I can reproduce the issue with simpler code for the purposes of asking this question.
Consider the following array with $search => $replace structure:
$subs = array("^Heshe" => "He",
"apples" => "bananas");
I then want to cycle through the array to replace the tokens:
$message = "^Heshe likes apples but not crabapples.";
foreach ($subs as $search => $replace)
{
$pattern = '/\b' . preg_quote($search, '/') . '\b/u';
$message = preg_replace($pattern, $replace, $message);
}
echo $message;
I expect the message He likes bananas but not crabapples. to be displayed, but instead I get the message ^Heshe likes bananas but not crabapples.
I have also tried $pattern = '/\b\Q' . $search . '\E\b/u', also with the same results.
Unfortunately, the "^" characters are part of some legacy system and changing it is not feasible. How do I get the regex to work?
Problem is this line:
$pattern = '/\b' . preg_quote($search, '/') . '\b/u';
As $search is ^Heshe you cannot match \b (word boundary) before ^ since that is not a word character.
You can use lookarounds instead in your pattern like this:
$pattern = '/(?<!\w)' . preg_quote($search, '/') . '(?!\w)/u';
Which means match $search if it is not followed and preceded by a word char.
Or else use:
$pattern = '/(?<=\s|^)' . preg_quote($search, '/') . '(?=\s|$)/u';
Which means match $search if it is followed and preceded by a whitespace or line start/end.

Replace only the last match using preg_replace()

So, for example user input some regex match and he wants that last match will be replaced by input-string.
Example:
$str = "hello, world, hello!";
// For now, regex will be for example just word,
// but it should work with match too
replaceLastMatch($str, "hello", "replacement");
echo $str; // Should output "hello, world, replacement!";
Use a negative lookahead to ensure that you only match the last occurrence of the search string:
function replaceLastMatch($str, $search, $replace) {
$pattern = sprintf('~%s(?!.*%1$s)~', $search);
return preg_replace($pattern, $replace, $str, 1);
}
Usage:
$str = "hello, world, hello!";
echo replaceLastMatch($str, 'h\w{4}', 'replacement');
echo replaceLastMatch($str, 'hello', 'replacement');
Output:
hello, world, replacement!
Demo
Here is what I came up with:
Short version:
It is vulnerable though (e.g. if user uses groups (abc), this will break):
function replaceLastMatch($string, $search, $replacement) {
// Escape all / as it delimits the regex
// Construct the regex pattern to be ungreedy at the right (? behind .*)
$search = '/^(.*)' . str_replace('/', '\\/', $search) . '(.*?)$/s';
return preg_replace($search, '${1}' . $replacement . '${2}', $string);
}
Longer version (personally recommended):
This version allows the user to use groups without interfering with this function (e.g. pattern ((ab[cC])+(XY)*){1,5}):
function replaceLastMatch($string, $search, $replacement) {
// Escape all '/' as it delimits the regex
// Construct the regex pattern to be ungreedy at the right (? behind .*)
$search = '/^.*(' . str_replace('/', '\\/', $search) . ').*?$/s';
// Match our regex and store matches including offsets
// If regex does not match, return $string as-is
if(1 !== preg_match($search, $string, $matches, PREG_OFFSET_CAPTURE))
return $string;
return substr($string, 0, $matches[1][1]) . $replacement
. substr($string, $matches[1][1] + strlen($matches[1][0]));
}
One general warning: You should be very careful with user input, as it can do all kings of nasty stuff. Be always prepared for inputs that are rather "unproductive".
Explanation:
The core of the match last functionality is the ? (greediness inversion) operator (see Repetition - somewhere in the middle).
While repetition patterns (e.g. .*) are greedy by default, consuming as much as it can possibly match, making a pattern ungreedy (e.g. .*?) will make it match as little as possible (while still matching at all).
Hence, in our case, the greedy front part of the pattern will always have precedence over the non-greedy back part and our custom middle part will match the very last instance possible.

regex to replace a given word with space at either side or not at all

I am working with some code in PHP that grabs the referrer data from a search engine, giving me the query that the user entered.
I would then like to remove certain stop words from that string if they exist. However, the word may or may not have a space at either end.
For example, I have been using str_replace to remove a word as follows:
$keywords = str_replace("for", "", $keywords);
$keywords = str_replace("sale", "", $keywords);
but if the $keywords value is "baby formula" it will change it to "baby mula" - removing the "for" part.
Without having to create further str_replace's that account for " for" and "for " - is there a preg_replace type command I could use that would remove the given word if it is found with a space at either end?
My idea would be to put all of the stop words into an array and step through them that way and I suspect that a preg_replace is going to be quicker than stepping through multiple str_replace lines.
UPDATE:
Solved thanks to you guys using the following combination:
$keywords = "...";
$stopwords = array("for","each");
foreach($stopwords as $stopWord)
{
$keywords = preg_replace("/(\b)$stopWord(\b)/", "", $keywords);
}
$keywords = "...";
$stopWords = array("for","sale");
foreach($stopWords as $stopWord){
$keywords = preg_replace("/(\b)$stopWord(\b)/", "", $keywords);
}
Try it this way
$keywords = preg_replace( '/(?!\w)(for|sale)(?>!\w)/', '', $keywords );
You can use word boundaries for this
$keywords = preg_replace('/\bfor\b/', '', $keywords);
or with multiple words
$keywords = preg_replace('/\b(?:for|sale)\b/', '', $keywords);
While Armel's answer will work, it is not performing optimally. Yes, your desired output will require wordboundaries and probably case-insensitive matching, but:
Wordboundaries gain nothing from being wrapped in parentheses.
Performing iterated preg_match() calls for each element in the blacklist array is not efficient. Doing so will ask the regex engine to perform wave after wave of individual keyword checks on the full string.
I recommend building a single regex pattern that will check for all keywords during each step of traversing the string -- one time. To generate the single pattern dynamically, you only need to implode your blacklist array of elements with | (pipes) which represent the "OR" command in regex. By wrapping all of the pipe-delimited keywords in a non-capturing group ((?:...)), the wordboundaries (\b) serve their purpose for all keywords in the blacklist array.
Code: (Demo)
$string = "Each person wants peaches for themselves forever";
$blacklist = array("for", "each");
// if you might have non-letter characters that have special meaning to the regex engine
//$blacklist = array_map(function($v){return preg_quote($v, '/');}, $blacklist);
//print_r($blacklist);
echo "Without wordboundaries:\n";
var_export(preg_replace('/' . implode('|', $blacklist) . '/i', '', $string));
echo "\n\n---\n";
echo "With wordboundaries:\n";
var_export(preg_replace('/\b(?:' . implode('|', $blacklist) . ')\b/i', '', $string));
echo "\n\n---\n";
echo "With wordboundaries and consecutive space mop up:\n";
var_export(trim(preg_replace(array('/\b(?:' . implode('|', $blacklist) . ')\b/i', '/ \K +/'), '', $string)));
Output:
Without wordboundaries:
' person wants pes themselves ever'
---
With wordboundaries:
' person wants peaches themselves forever'
---
With wordboundaries and consecutive space mop up:
'person wants peaches themselves forever'
p.s. / \K +/ is the second pattern fed to preg_replace() which means the input string will be read a second time to search for 2 or more consecutive spaces. \K means "restart the fullstring match from here"; effectively it releases the previously matched space. Then one or more spaces to follow are matched and replaced with an empty string.

Get first word from a string, but different dividers

I found this solution on stackoverflow for getting the first word from a sentence.
$myvalue = 'Test me more';
$arr = explode(' ',trim($myvalue));
echo $arr[0]; // will print Test
However, this case takes ' ' (a space) as the divider. Does anyone know how to get the first word from a string if you do not know what the divider is? It can be ' ' (space), '.' (full stop), '.' (or comma). Basically, how do you take anything that is a letter from a string up to the point where there is no letter?
E.g.:
'House, rest of sentence here' would give 'House'
'House.' would also give 'House'
'House thing' would also give 'House'
Thanks!
There is a string function (strtok) which can be used to split a string into smaller strings (tokens) based on some separator(s). For the purposes of this thread, the first word (defined as anything before the first space character) of Test me more can be obtained by tokenizing the string on the space character.
<?php
$value = "Test me more";
echo strtok($value, " "); // Test
?>
For more details and examples, see the strtok PHP manual page.
preg_split is what you're looking for.
$str = "bla1 bla2,bla3";
$words = preg_split("/[\s,]+/", $str);
This snippet splits the $str by space, \t, comma, \n.
Use the preg_match() function with a regular expression:
if (preg_match('/^\w*/', 'Your text here', $matches) > 0) {
echo $matches[0]; // $matches[0] will contain the first word of your sentence
} else {
// no match found
}

Categories