How to match regex patterns that contain special characters? - php

I have a generic routine which used to substitute out short-codes (which begin with a "^" character) with gender specific options. I have been asked to extend this to correct some common misspellings. These words won't have a special character at the start.
Until now I have been using PHP's str_replace function but because of the possibility of some words appearing within others, I need to ensure that the code uses word boundaries when matching. I am now attempting to use preg_replace.
While the actual code is getting data from a database table, including the gender specific replacements, I can reproduce the issue with simpler code for the purposes of asking this question.
Consider the following array with $search => $replace structure:
$subs = array("^Heshe" => "He",
"apples" => "bananas");
I then want to cycle through the array to replace the tokens:
$message = "^Heshe likes apples but not crabapples.";
foreach ($subs as $search => $replace)
{
$pattern = '/\b' . preg_quote($search, '/') . '\b/u';
$message = preg_replace($pattern, $replace, $message);
}
echo $message;
I expect the message He likes bananas but not crabapples. to be displayed, but instead I get the message ^Heshe likes bananas but not crabapples.
I have also tried $pattern = '/\b\Q' . $search . '\E\b/u', also with the same results.
Unfortunately, the "^" characters are part of some legacy system and changing it is not feasible. How do I get the regex to work?

Problem is this line:
$pattern = '/\b' . preg_quote($search, '/') . '\b/u';
As $search is ^Heshe you cannot match \b (word boundary) before ^ since that is not a word character.
You can use lookarounds instead in your pattern like this:
$pattern = '/(?<!\w)' . preg_quote($search, '/') . '(?!\w)/u';
Which means match $search if it is not followed and preceded by a word char.
Or else use:
$pattern = '/(?<=\s|^)' . preg_quote($search, '/') . '(?=\s|$)/u';
Which means match $search if it is followed and preceded by a whitespace or line start/end.

Related

Replace words in a string including plural variations with apostrophes

I want to link matches for specific words in a sentence. Overall this is easy, and sample code could go like this:
$words = array("Facebook", "Apple");
$text = "Is Facebook's vr hardware better than Apple's current prototype?";
foreach($words as $w) {
$pattern = '/' . $w .'\b/i';
$link = '' . $w . '';
$text = preg_replace($pattern, $link, $text);
}
print $text;
However I would like to catch variations of words that have 's (apostrophe-s).
To do that I need to search for the two possible variations (with and without the 's), but the outcome also affects what text used in the replacement.
I'm drawing a blank on how to pro-actively used preg_match and then alter preg_replace based on the outcome. Any advice appreciated.
try using the optional ? quantifier and parenthesis.
$pattern = '/' . $w .'(\'s)?\b/i';
should match either version.
now, to use the match in your replacement, you can add an extra set of parenthesis, like this:
$pattern = '/(' . $w .'(\'s)?)\b/i';
then insert the matched string into your replacement, like this:
$link = '$1';
the $1 in the replacement string will be replaced with whatever the outer parenthesis of the match contains.

Using preg_replace to get characters before/after match

I have the following code to get characters before/after the regex match:
$searchterm = 'blue';
$string = 'Here is a sentence talking about blue. This sentence talks about red.';
$regex = '/.*(.{10}\b' . $searchterm . '\b.{10}).*/si';
echo preg_replace($regex, '$1', $string);
Output: "ing about blue. This se" (expected).
When I change $searchterm = 'red', then I get this:
Output: "Here is a sentence talking about blue. This sentence talks about red."
I am expecting this: "lks about red." The same thing happens if you start at the beginning of the sentence. Is there a way to use a similar regex to not pull back the entire string when it's at the start/end?
Example of what is happening: https://sandbox.onlinephpfunctions.com/code/e500b505860ded429e78869f61dbf4128ff368b3
Converting my comment to answer so that solution is easy to find for future visitors.
You regex regex is almost correct but make sure to use a non-greedy quantifier with .{0,10} limit for surrounding substring:
$searchterm = 'blue';
$string = 'Here is a sentence talking about blue. This sentence talks about red.';
$regex = '/.*?(.{0,10}\b' . $searchterm . '\b.{0,10}).*/si';
echo preg_replace($regex, '$1', $string);
Updated Code Demo
RegEx Demo
You'd better use preg_match with .{0,10} quantifiers instead of {10},
function truncateString($searchterm){
$string = 'Here is a sentence talking about blue. This sentence talks about red.';
$regex = '/.{0,10}\b' . $searchterm . '\b.{0,10}/si';
if (preg_match($regex, $string, $m)) {
echo $m[0] . "\n";
}
}
truncateString('blue');
// => ing about blue. This se
truncateString('red');
// => lks about red.
See the PHP demo.
preg_match will find and return the first match only. The .{0,10} pattern will match zero to ten occurrences of any char (since the s modifier is used, the . matches even line break chars).
One more thing: if your $searchterm can contain special regex metacharacters, anywhere in the term, you should consider refactoring the code to
$regex = '/.{0,10}(?<!\w)' . preg_quote($searchterm, '/') . '(?!\w).{0,10}/si';
where (?<!\w) / (?!\w) are unambiguous word boundaries and the preg_quote is used to escape all special chars.

how to use regex special characters in pattern in preg_replace

I am trying to replace 2.0 to stack,
but the following code replace 2008 to 2.08
Following is my code:
$string = 'The story is inspired by the Operation Batla House that took place in 2008 ';
$tag = '2.0';
$pattern = '/(\s|^)'.($tag).'(?=[^a-z^A-Z])/i';
echo preg_replace($pattern, '2.0', $string);
Use preg_quote and make sure you pass the regex delimiter as the second argument:
$string = 'The story is inspired by the Operation Batla House that took place in 2008 ';
$tag = '2.0';
$pattern = '/(\s|^)' . preg_quote($tag, '/') . '(?=[^a-zA-Z])/i';
// ^^^^^^^^^^^^^^^^^^^^^
echo preg_replace($pattern, '2.0', $string);
The string is not modified. See the PHP demo. The regex delimiter here is /, thus it is passed as the 2nd parameter to preg_quote.
Note that [^a-z^A-Z] matches any chars but ASCII letters and ^ since you added the second ^ in the character class. I changed [^a-z^A-Z] to [^a-zA-Z].
Also, the capturing group at the start may be replaced with a single lookbehind, (?<!\S), it will make sure your match occurs only at the string start or after a whitespace.
If you expect to also match at the end of the string, replace (?=[^a-zA-Z]) (that requires a char other than a letter immediately to the right of the current location) with (?![a-zA-Z]) (that requires a char other than a letter or end of string immediately to the right of the current location).
So, use
$pattern = '/(?<!\S)' . preg_quote($tag, '/') . '(?![a-zA-Z])/i';
Also, consider using unambiguous word boundaries
$pattern = '/(?<!\w)' . preg_quote($tag, '/') . '(?!\w)/i';

Replace only the last match using preg_replace()

So, for example user input some regex match and he wants that last match will be replaced by input-string.
Example:
$str = "hello, world, hello!";
// For now, regex will be for example just word,
// but it should work with match too
replaceLastMatch($str, "hello", "replacement");
echo $str; // Should output "hello, world, replacement!";
Use a negative lookahead to ensure that you only match the last occurrence of the search string:
function replaceLastMatch($str, $search, $replace) {
$pattern = sprintf('~%s(?!.*%1$s)~', $search);
return preg_replace($pattern, $replace, $str, 1);
}
Usage:
$str = "hello, world, hello!";
echo replaceLastMatch($str, 'h\w{4}', 'replacement');
echo replaceLastMatch($str, 'hello', 'replacement');
Output:
hello, world, replacement!
Demo
Here is what I came up with:
Short version:
It is vulnerable though (e.g. if user uses groups (abc), this will break):
function replaceLastMatch($string, $search, $replacement) {
// Escape all / as it delimits the regex
// Construct the regex pattern to be ungreedy at the right (? behind .*)
$search = '/^(.*)' . str_replace('/', '\\/', $search) . '(.*?)$/s';
return preg_replace($search, '${1}' . $replacement . '${2}', $string);
}
Longer version (personally recommended):
This version allows the user to use groups without interfering with this function (e.g. pattern ((ab[cC])+(XY)*){1,5}):
function replaceLastMatch($string, $search, $replacement) {
// Escape all '/' as it delimits the regex
// Construct the regex pattern to be ungreedy at the right (? behind .*)
$search = '/^.*(' . str_replace('/', '\\/', $search) . ').*?$/s';
// Match our regex and store matches including offsets
// If regex does not match, return $string as-is
if(1 !== preg_match($search, $string, $matches, PREG_OFFSET_CAPTURE))
return $string;
return substr($string, 0, $matches[1][1]) . $replacement
. substr($string, $matches[1][1] + strlen($matches[1][0]));
}
One general warning: You should be very careful with user input, as it can do all kings of nasty stuff. Be always prepared for inputs that are rather "unproductive".
Explanation:
The core of the match last functionality is the ? (greediness inversion) operator (see Repetition - somewhere in the middle).
While repetition patterns (e.g. .*) are greedy by default, consuming as much as it can possibly match, making a pattern ungreedy (e.g. .*?) will make it match as little as possible (while still matching at all).
Hence, in our case, the greedy front part of the pattern will always have precedence over the non-greedy back part and our custom middle part will match the very last instance possible.

regex to replace a given word with space at either side or not at all

I am working with some code in PHP that grabs the referrer data from a search engine, giving me the query that the user entered.
I would then like to remove certain stop words from that string if they exist. However, the word may or may not have a space at either end.
For example, I have been using str_replace to remove a word as follows:
$keywords = str_replace("for", "", $keywords);
$keywords = str_replace("sale", "", $keywords);
but if the $keywords value is "baby formula" it will change it to "baby mula" - removing the "for" part.
Without having to create further str_replace's that account for " for" and "for " - is there a preg_replace type command I could use that would remove the given word if it is found with a space at either end?
My idea would be to put all of the stop words into an array and step through them that way and I suspect that a preg_replace is going to be quicker than stepping through multiple str_replace lines.
UPDATE:
Solved thanks to you guys using the following combination:
$keywords = "...";
$stopwords = array("for","each");
foreach($stopwords as $stopWord)
{
$keywords = preg_replace("/(\b)$stopWord(\b)/", "", $keywords);
}
$keywords = "...";
$stopWords = array("for","sale");
foreach($stopWords as $stopWord){
$keywords = preg_replace("/(\b)$stopWord(\b)/", "", $keywords);
}
Try it this way
$keywords = preg_replace( '/(?!\w)(for|sale)(?>!\w)/', '', $keywords );
You can use word boundaries for this
$keywords = preg_replace('/\bfor\b/', '', $keywords);
or with multiple words
$keywords = preg_replace('/\b(?:for|sale)\b/', '', $keywords);
While Armel's answer will work, it is not performing optimally. Yes, your desired output will require wordboundaries and probably case-insensitive matching, but:
Wordboundaries gain nothing from being wrapped in parentheses.
Performing iterated preg_match() calls for each element in the blacklist array is not efficient. Doing so will ask the regex engine to perform wave after wave of individual keyword checks on the full string.
I recommend building a single regex pattern that will check for all keywords during each step of traversing the string -- one time. To generate the single pattern dynamically, you only need to implode your blacklist array of elements with | (pipes) which represent the "OR" command in regex. By wrapping all of the pipe-delimited keywords in a non-capturing group ((?:...)), the wordboundaries (\b) serve their purpose for all keywords in the blacklist array.
Code: (Demo)
$string = "Each person wants peaches for themselves forever";
$blacklist = array("for", "each");
// if you might have non-letter characters that have special meaning to the regex engine
//$blacklist = array_map(function($v){return preg_quote($v, '/');}, $blacklist);
//print_r($blacklist);
echo "Without wordboundaries:\n";
var_export(preg_replace('/' . implode('|', $blacklist) . '/i', '', $string));
echo "\n\n---\n";
echo "With wordboundaries:\n";
var_export(preg_replace('/\b(?:' . implode('|', $blacklist) . ')\b/i', '', $string));
echo "\n\n---\n";
echo "With wordboundaries and consecutive space mop up:\n";
var_export(trim(preg_replace(array('/\b(?:' . implode('|', $blacklist) . ')\b/i', '/ \K +/'), '', $string)));
Output:
Without wordboundaries:
' person wants pes themselves ever'
---
With wordboundaries:
' person wants peaches themselves forever'
---
With wordboundaries and consecutive space mop up:
'person wants peaches themselves forever'
p.s. / \K +/ is the second pattern fed to preg_replace() which means the input string will be read a second time to search for 2 or more consecutive spaces. \K means "restart the fullstring match from here"; effectively it releases the previously matched space. Then one or more spaces to follow are matched and replaced with an empty string.

Categories