PHP - Preg match reversal? - php

How do you inverse a Regex expression in PHP?
This is my code:
preg_match("!<div class=\"foo\">.*?</div>!is", $source, $matches);
This is checking the $source String for everything within the Container and stores it in the $matches variable.
But what I want to do is reversing the expression i.e. I want to get everything that is NOT inside the container.
I know there is something called negative lookahead, but I am really bad with Regular expressions and didn't manage to come up with a working solution.
Simply using ?!
preg_match("?!<div class=\"foo\">.*?</div>!is", $source, $matches);
Does not seem to work.
Thanks!

New solution
Since your goal is to remove the matching divs, as mentioned in the comment, using the original regex with preg_split, plus implode would be the simpler solution:
implode('', preg_split('~<div class="foo">.*?</div>~is', $text))
Demo on ideone
Old solution
I'm not sure whether this is a good idea, but here is my solution:
~(.*?)(?:<div class="foo">.*?</div>|$)~is
Demo on regex101
The result can be picked out from capturing group 1 of each matches.
Note that the last match is always an empty string, and there can be empty string match between 2 matching divs or if the string starts with matching div. However, you need to concatenate them anyway, so it seems to be a non-issue.
The idea is to rely on the fact that lazy quantifier .*? will always try the sequel (whatever comes after it) first before advancing itself, resulting in something similar to look-ahead assertion that makes sure that whatever matched by .*? will not be inside <div class="foo">.*?</div>.
The div tag is matched along in each match in order to advance the cursor past the closing tag. $ is used to match the text after the last matching div.
The s flag makes . matches any character, including line separators.
Revision: I had to change .+? to .*?, since .+? handle strings with 2 matching div next to each other and strings start with matching div.
Anyway, it's not a good idea to modify HTML with regular expression. Use a parser instead.

<div class=\"foo\">.*?</div>\K|.
You can simply do this by using \K.
\K resets the starting point of the reported match. Any previously consumed characters are no longer included in the final match

Related

Query regarding Regex pattern

The data along with the regex pattern I'm using is linked here:
(?m)(?<=Note:)(\w+|\s+)*$
The sample text is:
Date:21
Month:03
Year:2017
Amount:50
Category:Test
Account:Testimg
Note:Tested
Date:21
Month:03
Year:2017
Amount:48
Category:Great
Account:Good
Note:Better
As you can imagine, I want all the text after the word "Note:" including the spaces and right up to the end of the line. I'm getting the results I need, but I'm not sure if this is a proper solution.
Is this the right way of going about it? Could it be made simpler?
Thank you.
Since your lines start with Note: and you need to use ^ anchor before it. You may use capturing as I suggested in my first comment:
preg_match_all('/^Note:(.+)/m', $s, $matches)
See this demo.
Here, ^Note:(.+) will assert the position at the start of the line, then Note: will get matched, and then any 1+ chars other than line break chars will get captured into Group 1, you will just need to access it using the right index.
Alternatively, use \K to drop the Note::
preg_match_all('/^Note:\K.+/m', $s, $matches)
See another regex demo
Here, ^Note:\K.+ will also match the Note: at the start of the line, and then the text will be dropped due to \K match reset operator, and then 1+ chars other than line break chars will get consumed and placed into the match buffer.
Note the $ anchor is not even necessary here, since .+ will only match greedily up to the end of line on its own.
You can simplify this to just /Note:(.*)$/gm, I've updated your regex101 example. But other than that yes you're going about it the right way.

Fall back to begining of string in RegEx

Is it possible to have a RegEx fall back to the beginning of the string and begin matching again?
Here's why I ask. Given the below string, I'd like to capture the sub strings black, red, blue, and green in that order, regardless of the order of occurrence in the subject string and only if all substrings are present in the subject string.
$str ='blue-ka93-red-kdke3-green-weifk-black'
So, for all of the below strings, the RegEx should capture black, red, blue, and green (in that order)
'blue-ka93-red-kdke3-green-weifk-black'
'green-ka93-red-kdke3-blue-weifk-black'
'blue-ka93-black-kdke3-green-weifk-red'
'green-ka93-black-kdke3-blue-weifk-red'
I wonder if there isn't a way to match a capture group then fall back to the start of the string and find the next capture group. I was hoping that something like ^.*(?=(black))^.*(?=(red))^.*(?=(blue))^.*(?=(green)) would work but of course the ^ and lookaheads do not behave this way.
Is it possible to construct such a RegEx?
For context, I'll be using the RegEx in PHP.
You can use
^(?=.*(black))(?=.*(red))(?=.*(blue))(?=.*(green))
Note: This will require all these keywords to be in the string.
See demo
There is no way to reset RegEx index when matching, so, you can only use capturing mechanism inside a positive lookahead anchored at the start. The lookahead will match an empty location at the start of the string (due to ^) and each of tose lookaheads in the RegEx above will be executed one after another if the previous one returned true (found a string of text meeting its pattern).
Your RegEx did not work the same way because you matched, consumed the text with.* (this subpattern was outside the lookaheads) and repeated the start of string anchor that automatically fails a RegEx if you do not use a multiline modifier.
Why not just use capture groups for maintaining the order.
^(?:(black)|(red)|(blue)|(green)|.)+$
This will match any string, all colors are optional.
See demo at regex101 or php demo at eval.in

(PHP) How to find words beginning with a pattern and replace all of them?

I have a string. An example might be "Contact /u/someone on reddit, or visit /r/subreddit or /r/subreddit2"
I want to replace any instance of "/r/x" and "/u/x" with "[/r/x](http://reddit.com/r/x)" and "[/u/x](http://reddit.com/u/x)" basically.
So I'm not sure how to 1) find "/r/" and then expand that to the rest of the word (until there's a space), then 2) take that full "/r/x" and replace with my pattern, and most importantly 3) do this for all "/r/" and "/u/" matches in a single go...
The only way I know to do this would be to write a function to walk the string, character by character, until I found "/", then look for "r" and "/" to follow; then keep going until I found a space. That would give me the beginning and ending characters, so I could do a string replacement; then calculate the new end point, and continue walking the string.
This feels... dumb. I have a feeling there's a relatively simple way to do this, and I just don't know how to google to get all the relevant parts.
A simple preg_replace will do what you want.
Try:
$string = preg_replace('#(/(?:u|r)/[a-zA-Z0-9_-]+)#', '[\1](http://reddit.com\1)', $string);
Here is an example: http://ideone.com/dvz2zB
You should see if you can discover what characters are valid in a Reddit name or in a Reddit username and modify the [a-zA-Z0-9_-] charset accordingly.
You are looking for a regular expression.
A basic pattern starts out as a fixed string. /u/ or /r/ which would match those exactly. This can be simplified to match one or another with /(?:u|r)/ which would match the same as those two patterns. Next you would want to match everything from that point up to a space. You would use a negative character group [^ ] which will match any character that is not a space, and apply a modifier, *, to match as many characters as possible that match that group. /(?:u|r)/[^ ]*
You can take that pattern further and add a lookbehind, (?<= ) to ensure your match is preceded by a space so you're not matching a partial which results in (?<= )/(?:u|r)/[^ ]*. You wrap all of that to make a capturing group ((?<= )/(?:u|r)/[^ ]*). This will capture the contents within the parenthesis to allow for a replacement pattern. You can express your chosen replacement using the \1 reference to the first captured group as [\1](http://reddit.com\1).
In php you would pass the matching pattern, replacement pattern, and subject string to the preg_replace function.
In my opinion regex would be an overkill for such a simple operation. If you just want to replace instance of "/r/x" with "[r/x](http://reddit.com/r/x)" and "/u/x" with "[/u/x](http://reddit.com/u/x)" you should use str_replace although with preg_replace it'll lessen the code.
str_replace("/r/x","[/r/x](http://reddit.com/r/x)","whatever_string");
use regex for intricate search string and replace. you can also use http://www.jslab.dk/tools.regex.php regular expression generator if you have something complex to capture in the string.

PHP regex lookbehind with wildcard

I have two strings in PHP:
$string = '<a href="http://localhost/image1.jpeg" /></a>';
and
$string2 = '[caption id="attachment_5" align="alignnone" width="483"]<a href="http://localhost/image1.jpeg" /></a>[/caption]';
I'm trying to match strings of the first type. That is strings that are not surrounded by '[caption ... ]' and '[/caption]'. So far, I would like to use something like this:
$pattern = '/(?<!\[caption.*\])(?!\[\/caption\])(<a.*><img.*><\/a>)/';
but PHP matches out the first string as well with this pattern even though it is NOT preceeded by '[caption' and zero or more characters followed by ']'. What gives? Why is this and what's the correct pattern?
Thanks.
Variable length look-behind is not supported in PHP, so this part of your pattern is not valid:
(?<!\[caption.*\])
It should be warning you about this.
In addition, .* always matches the larges possible amount. Thus your pattern may result in a match that overlaps multiple tags. Instead, use [^>] (match anything that is not a closing bracket), because closing brackets should not occur inside the img tag.
To solve the look-behind problem, why not just check for the closing tag only? This should be sufficient (assuming the caption tags are only used in a way similar to what you have shown).
$pattern = '|(<a[^>]*><img[^>]*></a>)(?!\[/caption\])|';
When matching patterns that contain /, use another character as the pattern delimiter to avoid leaning toothpick syndrome. You can use nearly any non-alphanumeric character around the pattern.
Update: the previous regex is based on the example regex you gave, rather than the example data. If you want to match links that don't contain images, do this:
$pattern = '|(<a[^>]*>[^<]*</a>)(?!\[/caption\])|';
Note that this doesn't allow any tags in the middle of the link. If you allow tags (such as by using .*?), a regex could match something starting within the [caption] and ending elsewhere.
I don't see how your regexp could match either string, since you're looking for <a.*><img.*><\/a>, and both anchors don't contain an <img... tag. Also, the two subexpressions looking for and prohibiting the caption-bits look oddly positioned to me. Finally, you need to ensure your tag-matching bits don't act greedy, i.e. don't use .* but [^>]*.
Do you mean something like this?
$pattern = '/(<a[^>]*>(<img[^>]*>)?<\/a>)(?!\[\/caption\])/'
Test it on regex101.
Edit: Removed useless lookahead as per dan1111's suggestion and updated regex101 link.
Lookbehind doesn't allow non fixed length pattern i.e. (*,+,?), I think this /<a.*><\/a>(?!\[\/caption\])/ is enough for your requirement

recursive regular expression to process nested strings enclosed by {| and |}

In a project I have a text with patterns like that:
{| text {| text |} text |}
more text
I want to get the first part with brackets. For this I use preg_match recursively. The following code works fine already:
preg_match('/\{((?>[^\{\}]+)|(?R))*\}/x',$text,$matches);
But if I add the symbol "|", I got an empty result and I don't know why:
preg_match('/\{\|((?>[^\{\}]+)|(?R))*\|\}/x',$text,$matches);
I can't use the first solution because in the text something like { text } can also exist. Can somebody tell me what I do wrong here? Thx
Try this:
'/(?s)\{\|(?:(?:(?!\{\||\|\}).)++|(?R))*\|\}/'
In your original regex you use the character class [^{}] to match anything except a delimiter. That's fine when the delimiters are only one character, but yours are two characters. To not-match a multi-character sequence you need something this:
(?:(?!\{\||\|\}).)++
The dot matches any character (including newlines, thank to the (?s)), but only after the lookahead has determined that it's not part of a {| or |} sequence. I also dropped your atomic group ((?>...)) and replaced it with a possessive quantifier (++) to reduce clutter. But you should definitely use one or the other in that part of the regex to prevent catastrophic backtracking.
You've got a few suggestions for working regular expressions, but if you're wondering why your original regexp failed, read on. The problem lies when it comes time to match a closing "|}" tag. The (?>[^{}]+) (or [^{}]++) sub expression will match the "|", causing the |} sub expression to fail. With no backtracking in the sub expression, there's no way to recover from the failed match.
See PHP - help with my REGEX-based recursive function
To adapt it to your use
preg_match_all('/\{\|(?:^(\{\||\|\})|(?R))*\|\}/', $text, $matches);

Categories