is it possible to find overlapping matches with a single regex? - php

Here's a sample which executes the preg_replace multiple times to find nested/overlapping matches:
$text = '[foo][foo][/foo][/foo]';
//1st: ^^^^^ ^^^^^^
//2nd: ^^^^^ ^^^^^^
//3rd: fails
do {
$text = preg_replace('~\[foo](.*?)\[/foo]~', '[bar]$1[/bar]', $text, -1, $replace_count);
} while ($replace_count);
echo $text; //'[bar][bar][/bar][/bar]'
I'm satisfied with the result the and behavior. However, it seems inefficient to scan through the whole string 3 times as in the example above. Is there any regex magic to do this in a single replace?
Conditions:
I can't simply replace ~\[(/)?foo]~ with [$1bar], I need to make sure there is a matching closing [/foo] tag after an opening [foo] tag and replace them both at a time. It doesn't matter whether they're nested or not. Unpaired [foo] and [/foo] should not be replaced.
In JS I could set the Regex object's lastIndex property to the beginning of the match so that it starts matching again from the beginning of the last match. I couldn't find any startIndex option for regex replacing in PHP, and working with substr()ing could also be inefficient. I've looked around whether PCRE would have an achor for "start next match at this position" or similar but I had no luck.
Is there a better approach?
To clarify on unpaired tags, given the input:
[foo][foo][/foo]
I'm fine with either [bar][foo][/bar] or [foo][bar][/bar] as output. The former is the legacy behavior.

A full regex solution is not possible for this specific case.
Your solution adapted to match paired tags (in the common sense):
$pattern = '~\[foo]((?>[^[]++|\[(?!/?foo]))*)\[/foo]~';
$result = $text;
do {
$result = preg_replace($pattern, '[bar]$1[/bar]', $result, -1, $count);
} while ($count);
Another way that parses the string only once:
$arr = preg_split('~(\[/?foo])~', $text, -1, PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY);
$stack = array();
foreach ($arr as $key=>$item) {
if ($item == '[foo]') $stack[] = $key;
else if ($item == '[/foo]' && !empty($stack)) {
$arr[array_pop($stack)] = '[bar]';
$arr[$key] = '[/bar]';
}
}
$result = implode($arr);
the performance of this second script is independant of the depth.
To answer the title question, yes it is possible to find overlapping matches with a single regex, however, you can't perform a replacement with this kind of pattern, example:
$pattern = '~(?=(\[foo]((?>[^[]++|\[(?!/?foo)|(?1))*)\[/foo]))~';
preg_match_all($pattern, $text, $matches);
The trick is to use a lookahead and a capturing group. Note that the whole match is always an empty string, this is the reason why you can't use this pattern with preg_replace.

A better way to do this is to find the end [/foo] and backtrack until you find a begin [foo] or [foo(space).*]. Replace match region with something else and keep doing it until no ending is found. But with regular strpos/stripos or plain old substr, not regex.
It might be achievable with regex, but I've always done this kind of thing with regular seeks as it's also faster.

Related

Find next word after colon in regex

I am getting a result as a return of a laravel console command like
Some text as: 'Nerad'
Now i tried
$regex = '/(?<=\bSome text as:\s)(?:[\w-]+)/is';
preg_match_all( $regex, $d, $matches );
but its returning empty.
my guess is something is wrong with single quotes, for this i need to change the regex..
Any guess?
Note that you get no match because the ' before Nerad is not matched, nor checked with the lookbehind.
If you need to check the context, but avoid including it into the match, in PHP regex, it can be done with a \K match reset operator:
$regex = '/\bSome text as:\s*'\K[\w-]+/i';
See the regex demo
The output array structure will be cleaner than when using a capturing group and you may check for unknown width context (lookbehind patterns are fixed width in PHP PCRE regex):
$re = '/\bSome text as:\s*\'\K[\w-]+/i';
$str = "Some text as: 'Nerad'";
if (preg_match($re, $str, $match)) {
echo $match[0];
} // => Nerad
See the PHP demo
Just come from the back and capture the word in a group. The Group 1, will have the required string.
/:\s*'(\w+)'$/

Search for matching words without false positivis

I found this link and am working off of it, but I need to extend it a little further.
Check if string contains word in array
I am trying to create a script that checks a webpage for known bad words. I have one array with a list of bad words, and it compares it to the string from file_get_contents.
This works at a basic level, but returns false positives. For example, if I am loading a webpage with the word "title" it returns that it found the word "tit".
Is my best bet to strip all html and punctuation, then explode it based on spaces and put each individual word into an array? I am hoping there is a more efficient process then that.
Here is my code so far:
$url = 'http://somewebsite.com/';
$content = strip_tags(file_get_contents($url));
//list of bad words separated by commas
$badwords = 'tit,butt,etc'; //this will eventually come from a db
$badwordList = explode(',', $badwords);
foreach($badwordList as $bad) {
$place = strpos($content, $bad);
if (!empty($place)) {
$foundWords[] = $bad;
}
}
print_r($foundWords);
Thanks in advance!
You can just use a regex with preg_match_all():
$badwords = 'tit,butt,etc';
$regex = sprintf('/\b(%s)\b/', implode('|', explode(',', $badwords)));
if (preg_match_all($regex, $content, $matches)) {
print_r($matches[1]);
}
The second statement creates the regex which we are using to match and capture the required words off the webpage. First, it splits the $badwords string on commas, and join them with |. This resulting string is then used as the pattern like so: /\b(tits|butt|etc)\b/. \b (which is a word boundary) will ensure that only whole words are matched.
This regex pattern would match any of those words, and the words which are found in the webpage, will be stored in array $matches[1].

Make two simple regex's into one

I am trying to make a regex that will look behind .txt and then behind the "-" and get the first digit .... in the example, it would be a 1.
$record_pattern = '/.txt.+/';
preg_match($record_pattern, $decklist, $record);
print_r($record);
.txt?n=chihoi%20%283-1%29
I want to write this as one expression but can only seem to do it as two. This is the first time working with regex's.
You can use this:
$record_pattern = '/\.txt.+-(\d)/';
Now, the first group contains what you want.
Your regex would be,
\.txt[^-]*-\K\d
You don't need for any groups. It just matches from the .txt and upto the literal -. Because of \K in our regex, it discards the previously matched characters. In our case it discards .txt?n=chihoi%20%283- string. Then it starts matching again the first digit which was just after to -
DEMO
Your PHP code would be,
<?php
$mystring = ".txt?n=chihoi%20%283-1%29";
$regex = '~\.txt[^-]*-\K\d~';
if (preg_match($regex, $mystring, $m)) {
$yourmatch = $m[0];
echo $yourmatch;
}
?> //=> 1

preg replace complete word using partial patterns in PHP

I am using preg_replace($oldWords, $newWords, $string); to replace an array of words.
I wish to replace all words starting with foo into hello, and all words starting with bar into world
i.e foo123 should change to hello , foobar should change to hello, barx5 should change to world, etc.
If my arrays are defined as:
$oldWords = array('/foo/', '/bar/');
$newWords = array('hello', 'world');
then foo123 changes to hello123 and not hello. similarly barx5 changes to worldx5 and not world
How do I replace the complete matched word?
Thanks.
This is actually pretty simple if you understand regex, as well as how preg_replace works.
Firstly, your replacement arrays are incorrectly formed. What is:
$oldWords = array('\foo\', '\bar\');
Should instead be:
$oldWords = array('/foo/', '/bar/');
As the backslash in php escapes the character after it, meaning your strings were getting turned into non-strings, and it was messing up the rest of your code.
As to your actual question, however, you can achieve the desired effect with this:
$oldWords = array('/foo\w*/', '/bar\w*/');
\w matches any word character, while * is a quantifier either meaning 0 or any number of matches.
Adding in those two items will cause the regex to match any string with foo and x number of word-characters directly after it, which is what preg_replace then replaces; the match.
one way to do it is to loop through the array checking each word, since we are only checking the first three letters I would use a substr() instead of a regex because regex functions are slower.
foreach( $oldWords as $word ) {
$newWord = substr( $word, 0, 2 );
if( $newWord === 'foo' ) {
$word = 'hello';
}
else if( $newWord === 'bar' ) {
$word = 'world';
}
};

Replace after a needle in a string?

I have a string, something like
bbbbabbbbbccccc
Are there any way for me to replace all the letters "b" after the only one letter "a" into "c" without having to split the string, using PHP?
bbbbacccccccccc
odd question.
echo preg_replace('/a(.*)$/e', "'a'.strtr($1, 'b', 'c')", 'bbbabbbbbccccc');
preg_replace matches everything to the right of 'a' with regex. the e modifier in the regex evaluates the replacement string as code. the code in the replacement string uses strtr() to replace 'b's with 'c's.
Here are three options.
First, a split. Yes, I know you want to do it without a split.
$string = 'bbbbabbbbbccccc';
$parts = preg_split('/(a)/', $string, 2, PREG_SPLIT_DELIM_CAPTURE);
// Parts now looks like:
// array('bbb', 'a', 'bbbbcccc');
$parts[2] = str_replace('b', 'c', $parts[2]);
$correct_string = join('', $parts);
Second, a position search and a substring replacement.
$string = 'bbbbabbbbbccccc';
$first_a_index = strpos($string, 'a');
if($first_a_index !== false) {
// Now, grab everything from that first 'a' to the end of the string.
$replaceable = substr($string, $first_a_index);
// Replace it.
$replaced = str_replace('b', 'c', $replaceable );
// Now splice it back in
$string = substr_replace($string, $replaced, $first_a_index);
}
Third, I was going to post a regex, but the one dqhendricks posted is just as good.
These code examples are verbose for clarity, and can be reduced to one-or-two-liners.
$s = 'bbbbabbbbbccccc';
echo preg_replace('/((?:(?!\A)\G|(?<!a)a(?!a))[^b]*)b/', '$1c', $s);
\G matches the position where the previous match ended. On the first match attempt, \G matches the beginning of the string like \A. We don't want that, so we use (?!\A) to prevent it.
(?<!a)a(?!a) matches an a that's neither preceded nor followed by an a. The a is captured in group #1 so we can plug it back into the replacement with $1.
This is a "pure regex" solution, meaning it does the whole job in one call to preg_replace and doesn't rely on embedded code and the /e modifier. It's good to know in case you ever find yourself working within those constraints, but it definitely shouldn't be your first resort.

Categories