Regex: Replace unknown number of occurances after a given marker - php

I am trying to figure out a way of replacing a / with - in the GET part of a href tag in a html file looking like this:
blah blah <a href="aaaaa/aaaaa/aaaaa/?q=43/23"> blah blah <a
href="aaaaa/aaaaa/aaaaa/?q=43/11/1"> blah blah blah
So basically I'm looking to make the two URLs end with ?q=43-23 and ?q=43-11-1 respectively.
How can I do this with a preg_replace? I can obviously get the 43/23 to be 43-23 with
/(\?.+?)\/(.+?)$/is
And I can get 43/11/1 to be 43-11-1 with
/(\?.+?)\/(.+?)\/(.+?)$/is
But how can I do this in a single regex taking into account that there may be an unlimited number of slashes after the ?. Any suggestions or someone who can point me in the right direction?

I think it could be easy for your content;
print preg_replace_callback('~\?q=([^&"]*)~', function($m) {
return '?q='. str_replace('/', '-', $m[1]);
}, $s);
// for PHP < 5.3.0
print preg_replace_callback('~\?q=([^&"]*)~', create_function(
'$m', 'return "?q=". str_replace("/", "-", $m[1]);'
), $s);
Out;
blah blah <a href="aaaaa/aaaaa/aaaaa/?q=43-23"> blah blah <a
href="aaaaa/aaaaa/aaaaa/?q=43-11-1"> blah blah blah
blah blah blah blah blah blah blah

This is not the simplest search and replace because of how regex engines handle repeated capture groups. Applying repeated capture group principles, you can use the regex to capture the repeating group and then do a simple string replace.
preg_replace_callback('/
( # start capture
\? # question mark
.+? # reluctantly capture all until...
) # end capture
( # start capture
(?: # start group (no capture)
\/ # ...a literal slash
.+? # reluctantly capture all until...
) # end group
+ # repeat capture group
) # end capture
( # start capture
\b # ...a word boundary
) # end capture
/isx', function ($matches) {
return $matches[1] . str_replace('/', '-', $matches[2]) . $matches[3];
}, $str));
You do the string replace on the second match which is the repeated group capture. The word boundary at the end is necessary, but it can be replaced with something more sensible or correct such as " (if you know the URL ends here), or even ("|').

You can use this regex to match an unlimited amount of (slash) levels after the query parameter q=.
// Using tilde delimiters because hash signs are interpreted as comments here :)
~q=((?:[^/]+|/|)*)$~i
For example with the string "aaaaa/aaaaa/aaaaa/?q=43/11/1/5/10" the first captured group will contain 43/11/1/5/10.
Afterwards you can do the following to replace slashes with hyphens:
<?php str_replace( '/', '-', $string );

Related

How to get a string in regex and delete other after matching the string

my input is following
1 blah blah blah #username_. sblah sblah sblah
the output I need is following
username_.
for now, I make this expression
^.*\#([a-zA-Z0-9\.\_]+)$
which working in following
1 blah blah blah #username_.
but if I use it for the full line it's not working
so its get the user and delete before the user
but how I can make it delete the rest once it gets the user
Note I use regex101 for testing if you have a better tool please write it below.
Your pattern uses ^$ which means it needs a full match, your pattern is only partial.
By adding a .* it becomes a full regex and it matches as expected.
"/^.*\#([a-zA-Z0-9\.\_]+).*$/"
https://3v4l.org/i4pVd
Another way to do it is to use a partial regex like this.
It skips anything up to the # and then captures all to a dot
$str = "1 blah blah blah #username_. sblah sblah sblah";
preg_match("/.*?\#(.*?\.)/", $str, $match);
var_dump($match);
https://3v4l.org/mvBYI
To match the username in you example data, you could preg_match and omit the $ to assert the position at the end of the string as in this demo. Note that you don't have to escape the # and the dot and underscore in the character class.
To get the username in you example data, you could also use:
#\K[\w.]+
That would match
# Match literally
\K Forget what was previously matched
[\w.]+ Match 1+ times a word character or a dot
Regex demo
$re = '/#\K[\w.]+/';
$str = '1 blah blah blah #username_. sblah sblah sblah #test';
preg_match($re, $str, $matches);
echo $matches[0]; // username_.
Demo php

PHP Regex Negation For Youtube URLs

Let's say I have HTML in a database that looks like this:
Hello world!
ABC
Blah blah blah...
https://www.youtube.com/watch?v=df82vnx07s
Blah blah blah...
<p>https://www.youtube.com/watch?v=nvs70fh17f3fg</p>
Now I want to use PHP regex to grab the 2nd and 3rd URLs, but ignore the first.
The regex equation I have so far is:
\s*[a-zA-Z\/\/:\.]*youtu(be.com\/watch\?v=|.be\/)([a-zA-Z0-9\-_]+)
It works pretty well, but I don't know how to make it exclude/negate the first type of URL, one which starts with: href="
Please help, thanks!
You can use the "negative lookbehind" regular expression feature to accomplish what you're after. I've modified the very beginning of your regex by adding ((?<!href=[\'"])http) to implement one. Hope it helps!
$regex = '/((?<!href=[\'"])http)[a-zA-Z\/\/:\.]*youtu(be.com\/watch\?v=|.be\/)([a-zA-Z0-9\-_]+)/';
$useCases = [
1 => 'ABC',
2 => "<a href='https://www.youtube.com/watch?v=m7t75u72vd'>ABC</a>",
3 => 'https://www.youtube.com/watch?v=df82vnx07s',
4 => '<p>https://www.youtube.com/watch?v=nvs70fh17f3fg</p>'
];
foreach ($useCases as $index => $useCase) {
$matches = [];
preg_match($regex, $useCase, $matches);
if ($matches) {
echo 'The regex was matched in usecase #' . $index . PHP_EOL;
}
}
// Echoes:
// The regex was matched in usecase #3
// The regex was matched in usecase #4
All you need is to add a (?![^<]*>) negative lookahead that will fail the match if the match is followed with 0+ chars other than < followed with >:
[a-zA-Z\/:.]*youtu(?:be\.com\/watch\?v=|\.be\/)([a-zA-Z0-9\-_]+)(?![^<]*>)
^^^^^^^^^^
See the regex demo
Note I also escaped . symbols to match literal dots, and used a non-capturing group with be part. You may replace ([a-zA-Z0-9\-_]+) with [a-zA-Z0-9_-]+ if you are not interested in the capture, and you also may replace [a-zA-Z\/\/:\.]* part with a more precise pattern, like https?:\/\/[a-zA-Z.]*.
Example solution:
(?![^<]*>)[a-zA-Z\/\/:\.]*youtu(be.com\/watch\?v=|.be\/)([a-zA-Z0-9\-_]+)
Visualization with an explanation

Replace a character between two words

I have a string like blah blah [START]Hello-World[END] blah blah.
I want to replace - with , between [START] and [END].
So the result should be blah blah[START]Hello,World[END] blah blah.
I would suggest to use preg_replace_callback:
$string = "blah-blah [START]Hello-World. How-are-you?[END] blah-blah" .
" [START]More-text here, [END] end of-message";
$string = preg_replace_callback('/(\[START\])(.*?)(\[END\])/', function($matches) {
return $matches[1] . str_replace("-", ",", $matches[2]). $matches[3];
}, $string);
echo $string;
Output:
blah-blah [START]Hello,World. How,are,you?[END] blah-blah [START]More,text here, [END] end of-message
The idea of the regular expression is to get three parts: "START", "END" and the part between it. The function passes these three text fragments to the callback function, which performs a simple str_replace of the middle part, and returns the three fragments.
This way you are sure that the replacements will happen not only for the first occurrence (of the hyphen or whatever character you replace), but for every occurrence of it.
You will have to use regular expressions to accomplish what you need
$string = "blah blah [START]Hello-World[END] blah blah";
$string = preg_replace('/\[START\](.*)-(.*)\[END\]/', '[START]$1,$2[END]', $string));
Here's what the regular expression does:
\[START\] The backslash is needed to escape the square brackets. It also tells the preg_replace to look in the string where it starts with [START].
(.*) This will capture anything after the [START] and will be referenced later on as $1.
- This will capture the character you want to replace, in our case, the dash.
(.*) This will target anything after the dash and be referenced as $2 later on.
\[END\] Look for the [END] to end the regex.
Now as for the replace part [START]$1,$2[END], this will replace the string it found with the regular expression where the $1 and $2 is the references we got from earlier.
The var_dump of $string would be:
string(43) "blah blah [START]Hello,World[END] blah blah"

PHP - preg_match/preg_replace problems

I'm a little confused with preg_match and preg_replace. I have a very long content string (from a blog), and I want to find, separate and replace all [caption] tags. Possible tags can be:
[caption]test[/caption]
[caption align="center" caption="test" width="123"]<img src="...">[/caption]
[caption caption="test" align="center" width="123"]<img src="...">[/caption]
etc.
Here's the code I have (but I'm finding that's it not working the way I want it to...):
public function parse_captions($content) {
if(preg_match("/\[caption(.*) align=\"(.*)\" width=\"(.*)\" caption=\"(.*)\"\](.*)\[\/caption\]/", $content, $c)) {
$caption = $c[4];
$code = "<div>Test<p class='caption-text'>" . $caption . "</p></div>";
// Here, I'd like to ONLY replace what was found above (since there can be
// multiple instances
$content = preg_replace("/\[caption(.*) width=\"(.*)\" caption=\"(.*)\"\](.*)\[\/caption\]/", $code, $content);
}
return $content;
}
The goal is to ignore the content position. You can try this:
$subject = <<<'LOD'
[caption]test1[/caption]
[caption align="center" caption="test2" width="123"][/caption]
[caption caption="test3" align="center" width="123"][/caption]
LOD;
$pattern = <<<'LOD'
~
\[caption # begining of the tag
(?>[^]c]++|c(?!aption\b))* # followed by anything but c and ]
# or c not followed by "aption"
(?| # alternation group
caption="([^"]++)"[^]]*+] # the content is inside the begining tag
| # OR
]([^[]+) # outside
) # end of alternation group
\[/caption] # closing tag
~x
LOD;
$replacement = "<div>Test<p class='caption-text'>$1</p></div>";
echo htmlspecialchars(preg_replace($pattern, $replacement, $subject));
pattern (condensed version):
$pattern = '~\[caption(?>[^]c]++|c(?!aption\b))*(?|caption="([^"]++)"[^]]*+]|]([^[]++))\[/caption]~';
pattern explanation:
After the begining of the tag you could have content before ] or the caption attribute. This content is describe with:
(?> # atomic group
[^]c]++ # all characters that are not ] or c, 1 or more times
| # OR
c(?!aption\b) # c not followed by aption (to avoid the caption attribute)
)* # zero or more times
The alternation group (?| allow multiple capture groups with the same number:
(?|
# case: the target is in the caption attribute #
caption=" # (you can replace it by caption\s*+=\s*+")
([^"]++) # all that is not a " one or more times (capture group)
"
[^]]*+ # all that is not a ] zero or more times
| # OR
# case: the target is outside the opening tag #
] # square bracket close the opening tag
([^[]+) # all that is not a [ 1 or more times (capture group)
)
The two captures have now the same number #1
Note: if you are sure that each caption tags aren't on several lines, you can add the m modifier at the end of the pattern.
Note2: all quantifiers are possessive and i use atomic groups when it's possible for quick fails and better performances.
Hint (and not an answer, per se)
Your best method of action would be:
Match everything after caption.
preg_match("#\[caption(.*?)\]#", $q, $match)
Use an explode function for extracting values in $match[1], if any.
explode(' ', trim($match[1]))
Check the values in array returned, and use in your code accordingly.

Looping within a regular expression

can regex able to find a patter to this?
{{foo.bar1.bar2.bar3}}
where in the groups would be
$1 = foo $2 = bar1 $3 = bar2 $4 = bar3 and so on..
it would be like re-doing the expression over and over again until it fails to get a match.
the current expression i am working on is
(?:\{{2})([\w]+).([\w]+)(?:\}{2})
Here's a link from regexr.
http://regexr.com?3203h
--
ok I guess i didn't explain well what I'm trying to achieve here.
let's say I am trying to replace all
.barX inside a {{foo . . . }}
my expected results should be
$foo->bar1->bar2->bar3
This should work, assuming no braces are allowed within the match:
preg_match_all(
'%(?<= # Assert that the previous character(s) are either
\{\{ # {{
| # or
\. # .
) # End of lookbehind
[^{}.]* # Match any number of characters besides braces/dots.
(?= # Assert that the following regex can be matched here:
(?: # Try to match
\. # a dot, followed by
[^{}]* # any number of characters except braces
)? # optionally
\}\} # Match }}
) # End of lookahead%x',
$subject, $result, PREG_PATTERN_ORDER);
$result = $result[0];
I'm not a PHP person, but I managed to construct this piece of code here:
preg_match_all("([a-z0-9]+)",
"{{foo.bar1.bar2.bar3}}",
$out, PREG_PATTERN_ORDER);
foreach($out[0] as $val)
{
echo($val);
echo("<br>");
}
The code above prints the following:
foo
bar1
bar2
bar3
It should allow you to exhaustively search a given string by using a simple regular expression. I think that you should also be able to get what you want by removing the braces and splitting the string.
I don't think so, but it's relatively painless to just split the string on periods like so:
$str = "{{foo.bar1.bar2.bar3}}";
$str = str_replace(array("{","}"), "", $str);
$values = explode(".", $str);
print_r($values); // Yields an array with values foo, bar1, bar2, and bar3
EDIT: In response to your question edit, you could replace all barX in a string by doing the following:
$str = "{{foo.bar1.bar2.bar3}}";
$newStr = preg_replace("#bar\d#, "hi", $str);
echo $newStr; // outputs "{{foo.hi.hi.hi}}"
I don't know the correct syntax in PHP, for pulling out the results, but you could do:
\{{2}(\w+)(?:\.(\w+))*\}{2}
That would capture the first hit in the first capturing group and the rest in second capturing group. regexr.com is lacking the ability to show that as far as I can see though. Try out Expresso, and you'll see what I mean.

Categories