Replace only the last match using preg_replace() - php

So, for example user input some regex match and he wants that last match will be replaced by input-string.
Example:
$str = "hello, world, hello!";
// For now, regex will be for example just word,
// but it should work with match too
replaceLastMatch($str, "hello", "replacement");
echo $str; // Should output "hello, world, replacement!";

Use a negative lookahead to ensure that you only match the last occurrence of the search string:
function replaceLastMatch($str, $search, $replace) {
$pattern = sprintf('~%s(?!.*%1$s)~', $search);
return preg_replace($pattern, $replace, $str, 1);
}
Usage:
$str = "hello, world, hello!";
echo replaceLastMatch($str, 'h\w{4}', 'replacement');
echo replaceLastMatch($str, 'hello', 'replacement');
Output:
hello, world, replacement!
Demo

Here is what I came up with:
Short version:
It is vulnerable though (e.g. if user uses groups (abc), this will break):
function replaceLastMatch($string, $search, $replacement) {
// Escape all / as it delimits the regex
// Construct the regex pattern to be ungreedy at the right (? behind .*)
$search = '/^(.*)' . str_replace('/', '\\/', $search) . '(.*?)$/s';
return preg_replace($search, '${1}' . $replacement . '${2}', $string);
}
Longer version (personally recommended):
This version allows the user to use groups without interfering with this function (e.g. pattern ((ab[cC])+(XY)*){1,5}):
function replaceLastMatch($string, $search, $replacement) {
// Escape all '/' as it delimits the regex
// Construct the regex pattern to be ungreedy at the right (? behind .*)
$search = '/^.*(' . str_replace('/', '\\/', $search) . ').*?$/s';
// Match our regex and store matches including offsets
// If regex does not match, return $string as-is
if(1 !== preg_match($search, $string, $matches, PREG_OFFSET_CAPTURE))
return $string;
return substr($string, 0, $matches[1][1]) . $replacement
. substr($string, $matches[1][1] + strlen($matches[1][0]));
}
One general warning: You should be very careful with user input, as it can do all kings of nasty stuff. Be always prepared for inputs that are rather "unproductive".
Explanation:
The core of the match last functionality is the ? (greediness inversion) operator (see Repetition - somewhere in the middle).
While repetition patterns (e.g. .*) are greedy by default, consuming as much as it can possibly match, making a pattern ungreedy (e.g. .*?) will make it match as little as possible (while still matching at all).
Hence, in our case, the greedy front part of the pattern will always have precedence over the non-greedy back part and our custom middle part will match the very last instance possible.

Related

Using preg_replace to get characters before/after match

I have the following code to get characters before/after the regex match:
$searchterm = 'blue';
$string = 'Here is a sentence talking about blue. This sentence talks about red.';
$regex = '/.*(.{10}\b' . $searchterm . '\b.{10}).*/si';
echo preg_replace($regex, '$1', $string);
Output: "ing about blue. This se" (expected).
When I change $searchterm = 'red', then I get this:
Output: "Here is a sentence talking about blue. This sentence talks about red."
I am expecting this: "lks about red." The same thing happens if you start at the beginning of the sentence. Is there a way to use a similar regex to not pull back the entire string when it's at the start/end?
Example of what is happening: https://sandbox.onlinephpfunctions.com/code/e500b505860ded429e78869f61dbf4128ff368b3
Converting my comment to answer so that solution is easy to find for future visitors.
You regex regex is almost correct but make sure to use a non-greedy quantifier with .{0,10} limit for surrounding substring:
$searchterm = 'blue';
$string = 'Here is a sentence talking about blue. This sentence talks about red.';
$regex = '/.*?(.{0,10}\b' . $searchterm . '\b.{0,10}).*/si';
echo preg_replace($regex, '$1', $string);
Updated Code Demo
RegEx Demo
You'd better use preg_match with .{0,10} quantifiers instead of {10},
function truncateString($searchterm){
$string = 'Here is a sentence talking about blue. This sentence talks about red.';
$regex = '/.{0,10}\b' . $searchterm . '\b.{0,10}/si';
if (preg_match($regex, $string, $m)) {
echo $m[0] . "\n";
}
}
truncateString('blue');
// => ing about blue. This se
truncateString('red');
// => lks about red.
See the PHP demo.
preg_match will find and return the first match only. The .{0,10} pattern will match zero to ten occurrences of any char (since the s modifier is used, the . matches even line break chars).
One more thing: if your $searchterm can contain special regex metacharacters, anywhere in the term, you should consider refactoring the code to
$regex = '/.{0,10}(?<!\w)' . preg_quote($searchterm, '/') . '(?!\w).{0,10}/si';
where (?<!\w) / (?!\w) are unambiguous word boundaries and the preg_quote is used to escape all special chars.

Looking for specific character in capture group

I need to replace all double quotes in any (variable) given string.
For example:
$text = 'data-caption="hello"world">';
$pattern = '/data-caption="[[\s\S]*?"|(")]*?">/';
$output = preg_replace($pattern, '"', $text);
should result in:
"hello"world"
(The above pattern is my attempt at getting it to work)
The problem is that I don't now in advance if and how many double quotes are going to be in the string.
How can i replace the " with quot; ?
You may match strings between data-caption=" and "> and then replace all " inside that match with " using a mere str_replace:
$text = 'data-caption="<element attribute1="wert" attribute2="wert">Name</element>">';
$pattern = '/data-caption="\K.*?(?=">)/';
$output = preg_replace_callback($pattern, function($m) {
return str_replace('"', '"', $m[0]);
}, $text);
print_r($output);
// => data-caption="<element attribute1="wert" attribute2="wert">Name</element>">
See the PHP demo
Details
data-caption=" - starting delimiter
\K - match reset operator
.*? - any 0+ chars other than line break chars, as few as possible
(?=">) - a positive lookahead that requires the "> substring immediately to the right of the current location.
The match is passed to the anonymous function inside preg_replace_callback (accessible via $m[0]) and that is where it is possible to replace all " symbols in a convenient way.

How to cut string from start to second last dot of the string?

I have some string, for example:
cats, e.g. Barsik, are funny. And it is true. So,
And I want to get as result:
cats, e.g. Barsik, are funny.
My try:
mb_ereg_search_init($text, '((?!e\.g\.).)*\.[^\.]');
$match = mb_ereg_search_pos();
But it gets position of second dot (after word "true").
How to get desired result?
Since a naive approach works for you, I am posting an answer. However, please note that detecting a sentence end is a very difficult task for a regex, and although it is possible to some degree, an NLP package should be used for that.
Having said that, I suggested using
'~(?<!\be\.g)\.(?=\s+\p{Lu})~ui'
The regex matches any dot (\.) that is not preceded with a whole word e.g (see the negative lookbehind (?<!\be\.g)), but that is followed with 1 or more whitespaces (\s+) followed with 1 uppercase Unicode letter \p{Lu}.
See the regex demo
The case insensitive i modifier does not impact what \p{Lu} matches.
The ~u modifier is required since you are working with Unicode texts (like Russian).
To get the index of the first occurrence, use a preg_match function with the PREG_OFFSET_CAPTURE flag. Here is a bit simplified regex you supplied in the comments:
preg_match('~(?<!т\.н)(?<!т\.к)(?<!e\.g)\.(?=\s+\p{L})~iu', $text, $match, PREG_OFFSET_CAPTURE);
See the lookaheads are executed one by one, and at the same location in string, thus, you do not have to additionally group them inside a positive lookahead. See the regex demo.
IDEONE demo:
$re = '~(?<!т\.н)(?<!т\.к)(?<!e\.g)\.(?=\s+\p{L})~iu';
$str = "cats, e.g. Barsik, are funny. And it is true. So,";
preg_match($re, $str, $match, PREG_OFFSET_CAPTURE);
echo $match[0][1];
Here are two approaches to get substring from start to second last . position of the initial string:
using strrpos and substr functions:
$str = 'cats, e.g. Barsik, and e.g. Lusya are funny. And it is true. So,';
$len = strlen($str);
$str = substr($str, 0, (strrpos($str, '.', strrpos($str, '.') - $len - 1) - $len) + 1);
print_r($str); // "cats, e.g. Barsik, and e.g. Lusya are funny."
using array_reverse, str_split and array_search functions:
$str = 'cats, e.g. Barsik, and e.g. Lusya are funny. And it is true. So,';
$parts = array_reverse(str_split($str));
$pos = array_search('.', $parts) + 1;
$str = implode("", array_reverse(array_slice($parts, array_search('.', array_slice($parts, $pos)) + $pos)));
print_r($str); // "cats, e.g. Barsik, and e.g. Lusya are funny."

Masking all but first letter of a word using Regex

I'm attempting to create a bad word filter in PHP that will analyze the word and match against an array of known bad words, but keep the first letter of the word and replace the rest with asterisks. Example:
fook would become f***
shoot would become s**
The only part I don't know is how to keep the first letter in the string, and how to replace the remaining letters with something else while keeping the same string length.
$string = preg_replace("/\b(". $word .")\b/i", "***", $string);
Thanks!
$string = 'fook would become';
$word = 'fook';
$string = preg_replace("~\b". preg_quote($word, '~') ."\b~i", $word[0] . str_repeat('*', strlen($word) - 1), $string);
var_dump($string);
$string = preg_replace("/\b".$word[0].'('.substr($word, 1).")\b/i", "***", $string);
This can be done in many ways, with very weird auto-generated regexps...
But I believe using preg_replace_callback() would end up being more robust
<?php
# as already pointed out, your words *may* need sanitization
foreach($words as $k=>$v)
$words[$k]=preg_quote($v,'/');
# and to be collapsed into a **big regexpy goodness**
$words=implode('|',$words);
# after that, a single preg_replace_callback() would do
$string = preg_replace_callback('/\b('. $words .')\b/i', "my_beloved_callback", $string);
function my_beloved_callback($m)
{
$len=strlen($m[1])-1;
return $m[1][0].str_repeat('*',$len);
}
Here is unicode-friendly regular expression for PHP:
function lowercase_except_first_letter($s) {
// the following line SKIP the first word and pass it to callback func...
// \W it allows to keep the first letter even in words in quotes and brackets
return preg_replace_callback('/(?<!^|\s|\W)(\w)/u', function($m) {
return mb_strtolower($m[1]);
}, $s);
}

Put URLs from string into array using regex (problem with trailing period)

I am trying to write a function that pulls all url's from a string and remove a potential trailing slash from the end.
function getUrls($string) {
$regex = '/https?\:\/\/[^\" ]+/i';
preg_match_all($regex, $string, $matches);
return ($matches[0]);
}
But that returns http://test.com. (trailing period) If i have
$string = "Hi I am sharing http://test.com.";
$urls = getUrls($string);
It returns the URL with the period at the end.
This one seems to work (taken from here)
$regex="/(https?:\/\/+[\w\-]+\.[\w\-]+)/i";
In case anyone comes across this, here is what I put together:
$aProtocols = array('http:\/\/', 'https:\/\/', 'ftp:\/\/', 'news:\/\/', 'nntp:\/\/', 'telnet:\/\/', 'irc:\/\/', 'mms:\/\/', 'ed2k:\/\/', 'xmpp:', 'mailto:');
$aSubdomains = array('www'=>'http://', 'ftp'=>'ftp://', 'irc'=>'irc://', 'jabber'=>'xmpp:');
$sRELinks = '/(?:(' . implode('|', $aProtocols) . ')[^\^\[\]{}|\\"\'<>`\s]*[^!#\^()\[\]{}|\\:;"\',.?<>`\s])|(?:(?:(?:(?:[^#:<>(){}`\'"\/\[\]\s]+:)?[^#:<>(){}`\'"\/\[\]\s]+#)?(' . implode('|', array_keys($aSubdomains)) . ')\.(?:[^`~!##$%^&*()_=+\[{\]}\\|;:\'",<.>\/?\s]+\.)+[a-z]{2,6}(?:[\/#?](?:[^\^\[\]{}|\\"\'<>`\s]*[^!#\^()\[\]{}|\\:;"\',.?<>`\s])?)?)|(?:(?:[^#:<>(){}`\'"\/\[\]\s]+#)?((?:(?:(?:(?:[0-1]?[0-9]?[0-9])|(?:2[0-4][0-9])|(?:25[0-5]))(?:\.(?:(?:[0-1]?[0-9]?[0-9])|(?:2[0-4][0-9])|(?:25[0-5]))){3})|(?:[A-Fa-f0-9:]{16,39}))|(?:(?:[^`~!##$%^&*()_=+\[{\]}\\|;:\'",<.>\/?\s]+\.)+[a-z]{2,6}))\/(?:[^\^\[\]{}|\\"\'<>`\s]*[^!#\^()\[\]{}|\\:;"\',.?<>`\s](?:[#?](?:[^\^\[\]{}|\\"\'<>`\s]*[^!#\^()\[\]{}|\\:;"\',.?<>`\s])?)?)?)|(?:[^#:<>(){}`\'"\/\[\]\s]+:[^#:<>(){}`\'"\/\[\]\s]+#((?:(?:(?:(?:[0-1]?[0-9]?[0-9])|(?:2[0-4][0-9])|(?:25[0-5]))(?:\.(?:(?:[0-1]?[0-9]?[0-9])|(?:2[0-4][0-9])|(?:25[0-5]))){3})|(?:[A-Fa-f0-9:]{16,39}))|(?:(?:[^`~!##$%^&*()_=+\[{\]}\\|;:\'",<.>\/?\s]+\.)+[a-z]{2,6}))(?:\/(?:(?:[^\^\[\]{}|\\"\'<>`\s]*[^!#\^()\[\]{}|\\:;"\',.?<>`\s])?)?)?(?:[#?](?:[^\^\[\]{}|\\"\'<>`\s]*[^!#\^()\[\]{}|\\:;"\',.?<>`\s])?)?))|([^#:<>(){}`\'"\/\[\]\s]+#(?:(?:(?:[^`~!##$%^&*()_=+\[{\]}\\|;:\'",<.>\/?\s]+\.)+[a-z]{2,6})|(?:(?:(?:(?:(?:[0-1]?[0-9]?[0-9])|(?:2[0-4][0-9])|(?:25[0-5]))(?:\.(?:(?:[0-1]?[0-9]?[0-9])|(?:2[0-4][0-9])|(?:25[0-5]))){3})|(?:[A-Fa-f0-9:]{16,39}))))(?:[^\^*\[\]{}|\\"<>\/`\s]+[^!#\^()\[\]{}|\\:;"\',.?<>`\s])?)/i';
function getUrls($string) {
global $sRELinks;
preg_match_all($sRELinks, $string, $matches);
return ($matches[0]);
}
From http://yellow5.us/journal/server_side_text_linkification/
Depending on how strict you want to be, consider the Liberal, Accurate Regex Pattern for Matching URLs regular expression pattern discussed on Daring Fireball. The pattern in full is:
\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))
If you are interested in how it works, Alan Storm has a great explanation.

Categories