PHP PCRE regex with multiple SKIP FAIL in a pattern

PHP PCRE regex with multiple SKIP FAIL in a pattern - php

I have a simple string:
$string = '--#--%--%2B--';
I want to percent-encode all characters (inclusive the "lonely" %), except the - character and the triplets of the form %xy. So I wrote the following pattern alternatives:
$pattern1 = '/(?:[\-]+|%[A-Fa-f0-9]{2})(*SKIP)(*FAIL)|./us';
$pattern2 = '/(?:[\-]+)(*SKIP)(*FAIL)|(?:%[A-Fa-f0-9]{2})(*SKIP)(*FAIL)|./us';
Please notice the use of (multiple) (*SKIP)(*FAIL) and of (?:).
The result of matching and replacing is the same - and the correct one too:
--%23--%25--%2B--
I would like to ask:
Are the two patterns equivalent? If not, which one whould be the proper one to use for url-encoding? Could you please explain in few words, why?
Would you suggest other alternatives (implying backtracking control verbs), or are my patterns a good choice?
Can I apply only one (?:) around the whole (chosen) pattern, even if the (multiple) (*SKIP)(*FAIL) will be inside it?
I know that I request a little too much from you by asking more questions at once. Please accept my apology! Thank you very much.
P.S: I've tested with the following PHP code:
$result = preg_replace_callback($patternX, function($matches) {
return rawurlencode($matches[0]);
}, $string);
echo $result;

First of all, both the patterns leverage the SKIP-FAIL PCRE verb sequence that is quite a well-known "trick" to match some text and skip it. See How do (*SKIP) or (*F) work on regex? for some more details.
The two patterns yield the same results, (?:[\-]+|%[A-Fa-f0-9]{2})(*SKIP)(*FAIL) matches either [\-]+ or %[A-Fa-f0-9]{2} and then skips the match, and (?:[\-]+)(*SKIP)(*FAIL)|(?:%[A-Fa-f0-9]{2})(*SKIP)(*FAIL) first tries to match [\-]+ and skips it if found, and then tries to match %[A-Fa-f0-9]{2} and skips the match if it is found. The (?:...) non-capturing groups in the second pattern are redundant as there is no alternation inside and the groups are not quantified. You may use any number of (*SKIP)(*FAIL) in your pattern, just make sure you use them before the | to skip the relevant match.
SKIP-FAIL technique is used when you want to match some text in specific context, when a char should be skipped/"avoided" if it is preceded and followed with some chars, or when you need to "avoid" matching a whole sequence of chars, like in this scenario, thus, the SKIP-FAIL is good to use.

Related

Optional Group Expression

Today I was working with regular expressions at work and during some experimentation I noticed that a regex such as (\w|) compiled. This seems to be an optional group but looking online didn't yield any results.
Is there any practical use of having a group that matches something, but otherwise can match anything? What's the difference between that and (\w|.*)? Thanks.

(\w|) is a verbose way of writing \w?, which checks for \w first, then empty string.
I remove the capturing group, since it seems that () is used for grouping property only. If you actually need the capturing group, then (\w?).
On the same vein, (|\w) is a verbose way of writing \w??, which tries for empty string first, before trying for \w.
(\w|.*) is a different regex altogether. It tries to match (in that order) one word character \w, or 0 or more of any character (except line terminators) .*.
I can't imagine how this regex fragment would be useful, though.

PHP - Preg match reversal?

How do you inverse a Regex expression in PHP?
This is my code:
preg_match("!<div class=\"foo\">.*?</div>!is", $source, $matches);
This is checking the $source String for everything within the Container and stores it in the $matches variable.
But what I want to do is reversing the expression i.e. I want to get everything that is NOT inside the container.
I know there is something called negative lookahead, but I am really bad with Regular expressions and didn't manage to come up with a working solution.
Simply using ?!
preg_match("?!<div class=\"foo\">.*?</div>!is", $source, $matches);
Does not seem to work.
Thanks!

New solution
Since your goal is to remove the matching divs, as mentioned in the comment, using the original regex with preg_split, plus implode would be the simpler solution:
implode('', preg_split('~<div class="foo">.*?</div>~is', $text))
Demo on ideone
Old solution
I'm not sure whether this is a good idea, but here is my solution:
~(.*?)(?:<div class="foo">.*?</div>|$)~is
Demo on regex101
The result can be picked out from capturing group 1 of each matches.
Note that the last match is always an empty string, and there can be empty string match between 2 matching divs or if the string starts with matching div. However, you need to concatenate them anyway, so it seems to be a non-issue.
The idea is to rely on the fact that lazy quantifier .*? will always try the sequel (whatever comes after it) first before advancing itself, resulting in something similar to look-ahead assertion that makes sure that whatever matched by .*? will not be inside <div class="foo">.*?</div>.
The div tag is matched along in each match in order to advance the cursor past the closing tag. $ is used to match the text after the last matching div.
The s flag makes . matches any character, including line separators.
Revision: I had to change .+? to .*?, since .+? handle strings with 2 matching div next to each other and strings start with matching div.
Anyway, it's not a good idea to modify HTML with regular expression. Use a parser instead.

<div class=\"foo\">.*?</div>\K|.
You can simply do this by using \K.
\K resets the starting point of the reported match. Any previously consumed characters are no longer included in the final match

(PHP) How to find words beginning with a pattern and replace all of them?

I have a string. An example might be "Contact /u/someone on reddit, or visit /r/subreddit or /r/subreddit2"
I want to replace any instance of "/r/x" and "/u/x" with "[/r/x](http://reddit.com/r/x)" and "[/u/x](http://reddit.com/u/x)" basically.
So I'm not sure how to 1) find "/r/" and then expand that to the rest of the word (until there's a space), then 2) take that full "/r/x" and replace with my pattern, and most importantly 3) do this for all "/r/" and "/u/" matches in a single go...
The only way I know to do this would be to write a function to walk the string, character by character, until I found "/", then look for "r" and "/" to follow; then keep going until I found a space. That would give me the beginning and ending characters, so I could do a string replacement; then calculate the new end point, and continue walking the string.
This feels... dumb. I have a feeling there's a relatively simple way to do this, and I just don't know how to google to get all the relevant parts.

A simple preg_replace will do what you want.
Try:
$string = preg_replace('#(/(?:u|r)/[a-zA-Z0-9_-]+)#', '[\1](http://reddit.com\1)', $string);
Here is an example: http://ideone.com/dvz2zB
You should see if you can discover what characters are valid in a Reddit name or in a Reddit username and modify the [a-zA-Z0-9_-] charset accordingly.

You are looking for a regular expression.
A basic pattern starts out as a fixed string. /u/ or /r/ which would match those exactly. This can be simplified to match one or another with /(?:u|r)/ which would match the same as those two patterns. Next you would want to match everything from that point up to a space. You would use a negative character group [^ ] which will match any character that is not a space, and apply a modifier, *, to match as many characters as possible that match that group. /(?:u|r)/[^ ]*
You can take that pattern further and add a lookbehind, (?<= ) to ensure your match is preceded by a space so you're not matching a partial which results in (?<= )/(?:u|r)/[^ ]*. You wrap all of that to make a capturing group ((?<= )/(?:u|r)/[^ ]*). This will capture the contents within the parenthesis to allow for a replacement pattern. You can express your chosen replacement using the \1 reference to the first captured group as [\1](http://reddit.com\1).
In php you would pass the matching pattern, replacement pattern, and subject string to the preg_replace function.

In my opinion regex would be an overkill for such a simple operation. If you just want to replace instance of "/r/x" with "[r/x](http://reddit.com/r/x)" and "/u/x" with "[/u/x](http://reddit.com/u/x)" you should use str_replace although with preg_replace it'll lessen the code.
str_replace("/r/x","[/r/x](http://reddit.com/r/x)","whatever_string");
use regex for intricate search string and replace. you can also use http://www.jslab.dk/tools.regex.php regular expression generator if you have something complex to capture in the string.

How to match required characters in random order using regular expression?

I need to match text which has #, #, and any number in it. The characters can be in random position as long as they are in the text. Given this input:
abc##d9
a9b#c#d
##abc#9
abc9d##
a#b#c#d
The regex should match the first 3 lines. Currently my regex is:
/#.*?#.*?[0-9]/
Which doesn't work since it will only match the three chars in sequence. How to match the three chars in random order?

Found one of this ugly regex, if you really must use one:
/(?=.*#)(?=.*#)(?=.*[0-9]).*/
http://jsfiddle.net/BP53f/2/
The regex is basically using what they call lookahead
http://www.regular-expressions.info/lookaround.html
A simple case from the link above is trying to match q, followed by u, by doing q(?=u), that's why it's called lookahead, it finds q followed by u ahead.
Let's take one of your valid case: a9b#c#d
The first lookahead is (?=.*#), which states: Match anything, followed by a #. So it does, which is the string a9b#c, then since the match from the lookahead must be discarded, the engine steps back to the start of the string, which is an a. Then it goes to
(?=.*#), which states: Match anything that is followed by #, then it finds it at a9b. etc. The difference between using lookahead and (a)(b)(c) is basically the stepping back.
From the link above:
Let's take one more look inside, to make sure you understand the
implications of the lookahead. Let's apply q(?=u)i to quit. I have
made the lookahead positive, and put a token after it. Again, q
matches q and u matches u. Again, the match from the lookahead must be
discarded, so the engine steps back from i in the string to u. The
lookahead was successful, so the engine continues with i. But i cannot
match u. So this match attempt fails. All remaining attempts will fail
as well, because there are no more q's in the string.
It is ugly because it's difficult to maintain... You basically have 3 different sub-regex inside the brackets.

Use separate expressions to make sure # and # are present. Once they are, remove them and match for the rest of the characters/digits.

Decided I better write this as an answer:
$text = "a9b#c#d";
$themAll = "##";
$themAny = "0123456789";
echo (strspn($themAll, $text)==strlen($themAll) && strpbrk($text, $themAny));
For maintenance and some (limited) extending this should be as easy as it gets, especially whth longer $themAll lists.

Regex/PHP check if group of characters appears only once

I am trying to validate an input in PHP with REGEX. I want to check whether the input has the %s character group inside it and that it appears only once. Otherwise, the rule should fail.
Here's what I've tried:
preg_match('|^[0-9a-zA-Z_-\s:;,\.\?!\(\)\p{L}(%s){1}]*$|u', $value); (there are also some other rules besides this; I've tried the (%s){1} part and it doesn't work).
I believe it is a very easy solution to this, but I'm not really into REGEX's...Thank you for your help!

If I understand your question, you need a positive lookahead. The lookahead causes the expression to only match if it finds a single %s.
preg_match('|^(?=[^%s].*?[%s][^%s]*$)[0-9a-zA-Z_-\s:;,\.\?!\(\)\p{L}(%s){1}]*$|u', $value);
I'll explain how each part works
^(?=[^%s].*?[%s][^%s]*$) is a zero-width assertion -- (?=regex) a positive lookahead -- (meaning it must match, but does not "eat" any characters). It means that the whole line can have only 1 %s.
[0-9a-zA-Z_-\s:;,\.\?!\(\)\p{L}(%s){1}]*$ The remaining part of the regex also looks at the entire string and ensures that the whole string is composed only of the characters in the character class (like your original regex).

I managed to do this with PHP's substr_count() function, following Johnsyweb suggestion to use an alternate way to perform the validation and because the REGEX's suggested seem pretty complicated.
Thank you again!

Alternatively, you can use preg_match_all with your pattern and check the number of matches. If it's 1, then you're ok - something like this:
$result = (preg_match_all('|^[0-9a-zA-Z_-\s:;,\.\?!\(\)\p{L}(%s){1}]*$|u', $value) == 1)

Try this:
'|^(?=(?:(?!%s).)*%s(?:(?!%s).)*$)[0-9_\s:;,.?!()\p{L}-]+$|u'
The (%s){1} sequence inside the square brackets probably doesn't do what you think it does, but never mind, the solution is more complex. In fact, {1} should never appear anywhere in a regex. It doesn't ensure that there's only one of something, as many people assume. As a matter of fact, it doesn't do anything; it's pure clutter.
EDIT (in answer to the comment): To ensure that only one of a particular sequence is present in a string, you have to actively examine every single character, classifying it as either part-of-%s or not part-of-%s. To that end, (?:(?!%s).)* consumes one character at a time, after the negative lookahead has confirmed that the character is not the start of %s.
When that part of the lookahead expression quits matching, the next thing in the string has to be %s. Then the second (?:(?!%s).)*$ kicks in to confirm that there are no more %s sequences until the end of the string.
And don't forget that the lookahead expression must be anchored at both ends. Because the lookahead is the first thing after the main regex's start anchor you don't need to add another ^. But the lookahead must end with its own $ anchor.

If you're not "into" regular expressions, why not solve this with PHP?
One call to the builtin strpos() will tell you if the string has a match. A second call will tell you if it appears more than once.
This will be easier for you to read and for others to maintain.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP PCRE regex with multiple SKIP FAIL in a pattern - php

Related

Optional Group Expression

PHP - Preg match reversal?

(PHP) How to find words beginning with a pattern and replace all of them?

How to match required characters in random order using regular expression?

Regex/PHP check if group of characters appears only once

Categories

Resources