regex matches numbers, but not letters - php

I have a string that looks like this:
[if-abc] 12345 [if-def] 67890 [/if][/if]
I have the following regex:
/\[if-([a-z0-9-]*)\]([^\[if]*?)\[\/if\]/s
This matches the inner brackets just like I want it to. However, when I replace the 67890 with text (ie. abcdef), it doesn't match it.
[if-abc] 12345 [if-def] abcdef [/if][/if]
I want to be able to match ANY characters, including line breaks, except for another opening bracket [if-.

This part doesn't work like you think it does:
[^\[if]
This will match a single character that is neither of [, i or f. Regardless of the combination. You can mimic the desired behavior using a negative lookahead though:
~\[if-([a-z0-9-]*)\]((?:(?!\[/?if).)*)\[/if\]~s
I've also included closing tags in the lookahead, as this avoid the ungreedy repetition (which is usually worse performance-wise). Plus, I've changed the delimiters, so that you don't have to escape the slash in the pattern.
So this is the interesting part ((?:(?!\[/?if).)*) explained:
( # capture the contents of the tag-pair
(?: # start a non-capturing group (the ?: are just a performance
# optimization). this group represents a single "allowed" character
(?! # negative lookahead - makes sure that the next character does not mark
# the start of either [if or [/if (the negative lookahead will cause
# the entire pattern to fail if its contents match)
\[/?if
# match [if or [/if
) # end of lookahead
. # consume/match any single character
)* # end of group - repeat 0 or more times
) # end of capturing group

Modifying a little results in:
/\[if-([a-z0-9-]+)\](.+?)(?=\[if)/s
Running it on [if-abc] 12345 [if-def] abcdef [/if][/if]
Results in a first match as: [if-abc] 12345
Your groups are: abc and 12345
And modifying even further:
/\[if-([a-z0-9-]+)\](.+?)(?=(?:\[\/?if))/s
matches both groups. Although the delimiter [/if] is not captured by either of these.
NOTE: Instead of matching the delimeters I used a lookahead ((?=)) in the regex to stop when the text ahead matches the lookahead.

Use a period to match any character.

Related

Match regular expression specific character quantities in any order

I need to match a series of strings that:
Contain at least 3 numbers
0 or more letters
0 or 1 - (not more)
0 or 1 \ (not more)
These characters can be in any position in the string.
The regular expression I have so far is:
([A-Z0-9]*[0-9]{3,}[\/]?[\-]?[0-9]*[A-Z]*)
This matches the following data in the following cases. The only one that does not match is the first one:
02ABU-D9435
013DFC
1123451
03323456782
ADS7124536768
03SDFA9433/0
03SDFA9433/
03SDFA9433/1
A41B03423523
O4AGFC4430
I think perhaps I am being too prescriptive about positioning. How can I update this regex to match all possibilities?
PHP PCRE
The following would not match:
01/01/2018 [multiple / or -]
AA-AA [no numbers]
Thanks
One option could be using lookaheads to assert 3 digits, not 2 backslashes and not 2 times a hyphen.
(?<!\S)(?=(?:[^\d\s]*\d){3})(?!(?:[^\s-]*-){2})(?!(?:[^\s\\]*\\){2})[A-Z0-9/\\-]+(?!\S)
About the pattern
(?<!\S) Assert what is on the left is not a non whitespace char
(?=(?:[^\d\s]*\d){3}) Assert wat is on the right is 3 times a whitespace char or digit
(?!(?:[^\s-]*-){2}) Assert what is on the right is not 2 times a whitespace char a hyphen
(?!(?:[^\s\\]*\\){2}) Assert what is on the right is not 2 times a whitespace char a backslash
[A-Z0-9/\\-]+ Match any of the listed 1+ times
(?!\S) Assert what is on the right is not a non whitespace char
Regex demo
Your patterns can be checked with positive/negative lookaheads anchored at the start of the string:
at least 3 digits -> find (not necessarily consecutive) 3 digits
no more than 1 '-' -> assert absence of (not necessarily consecutive) 2 '-' characters
no more than 1 '/' -> assert absence of (not necessarily consecutive) 2 '/' characters
0 or more letters -> no check needed.
If these conditions are met, any content is permitted.
The regex implementing this:
^(?=(([^0-9\r\n]*\d){3}))(?!(.*-){2})(?!(.*\/){2}).*$
Check out this Regex101 demo.
Remark
This solution assumes that each string tested resides on its own line, ie. not just being separated by whitespace.
In case the strings are separated by whitespace, choose the solution of user #TheFourthBird (which essentially is the same as this one but caters for the whitespace separation)
You can test the condition for both the hyphen and the slash into a same lookahead using a capture group and a backreference:
~\A(?!.*([-/]).*\1)(?:[A-Z/-]*\d){3,}[A-Z/-]*\z~
demo
detailled:
~ # using the tild as pattern delimiter avoids to escape all slashes in the pattern
\A # start of the string
(?! .* ([-/]) .* \1 ) # negative lookahead:
# check that there's no more than one hyphen and one slash
(?: [A-Z/-]* \d ){3,} # at least 3 digits
[A-Z/-]* # eventual other characters until the end of the string
\z # end of the string.
~
To better understand (if you are not familiar with): these three subpatterns start from the same position (in this case the beginning of the string):
\A
(?! .* ([-/]) .* \1 )
(?: [A-Z/-]* \d ){3,}
This is possible only because the two first are zero-width assertions that are simple tests and don't consume any character.

How to match multiple substrings that occur after a specific substring?

I am trying to read out the server names from a nginx config file.
I need to use regex on a line like this:
server_name this.com www.this.com someother-example.com;
I am using PHP's preg_match_all() and I've tried different things so far:
/^(?:server_name[\s]*)(?:(.*)(?:\s*))*;$/m
// no output
/^(?:server_name[\s]*)((?:(?:.*)(?:\s*))*);$/m
// this.com www.this.com someother-example.com
But I can't find the right one to list the domains as separate values.
[
0 => 'this.com',
1 => 'www.this.com',
2 => 'someother-example.com'
]
as Bob's your uncle wrote:
(?:server_name|\G(?!^))\s*\K[^;|\s]+
Does the trick!
The plain English requirement is to extract the space-delimited strings that immediately follow server_name then several spaces.
The dynamic duo of \G (start from the start / continue from the end of the last match) and \K (restart the fullstring match) will be the heroes of the day.
Code: (Demo)
$string = "server_name this.com www.this.com someother-example.com;";
var_export(preg_match_all('~(?:server_name +|\G(?!^) )\K[^; ]+~', $string, $out) ? $out[0] : 'no matches');
Output:
array (
0 => 'this.com',
1 => 'www.this.com',
2 => 'someother-example.com',
)
Pattern Explanation:
(?: # start of non-capturing group (to separate piped expressions from end of the pattern)
server_name + # literally match "server_name" followed by one or more spaces
| # OR
\G(?!^) # continue searching for matches immediately after the previous match, then match a single space
) # end of the non-capturing group
\K # restart the fullstring match (aka forget any previously matched characters in "this run through")
[^; ]+ # match one or more characters that are NOT a semicolon or a space
The reason that you see \G(?!^) versus just \G (which, for the record, will work just fine on your sample input) is because \G can potentially match from two different points by its default behavior. https://www.regular-expressions.info/continue.html
If you were to use the naked \G version of my pattern AND add a single space to the front of the input string, you would not make the intended matches. \G would successfully start at the beginning of the string, then match the single space, then server_name via the negated character class [^; ].
For this reason, disabling \G's "start at the start of the string` ability makes the pattern more stable/reliable/accurate.
preg_match_all() returns an array of matches. The first element [0] is a collection of fullstring matches (what is matched regardless of capture groups). If there are any capture groups, they begin from [1] and increment with each new group.
Because you need to match server_name before targeting the substrings to extract, using capture groups would mean a bloated output array and an unusable [0] subarray of fullstring matches.
To extract the desired space-delimited substrings and omit server_name from the results, \K is used to "forget" the characters that are matched prior to finding the desired substrings. https://www.regular-expressions.info/keep.html
Without the \K to purge the unwanted leading characters, the output would be:
array (
0 => 'server_name this.com',
1 => ' www.this.com',
2 => ' someother-example.com',
)
If anyone is comparing my answer to user3776824's or HamZa's:
I am electing to be very literal with space character matching. There are 4 spaces after server_name, so I could have used an exact quantifier {4} but opted for a bit of flexibility here. \s* isn't the most ideal because when matching there will always be "one or more spaces" to match. I don't have a problem with \s, but to be clear it does match spaces, tabs, newlines, and line returns.
I am using (?!^) -- a negative lookahead -- versus (?<!^) -- a negative lookbehind because it does the same job with a less character. You will more commonly see the use of \G(?!^) from experienced regex craftsmen.
There is never a need to use "alternative" syntax (|) within a character class to separate values. user3776824's pattern will actually exclude pipes in addition to semicolons and spaces -- though I don't expect any negative impact in the outcome based on the sample data. The pipe in the pattern simply should not be written.

Find words without repeated characters using php regex

I have an initial string of words, like:
abab sbs abc ffuuu qwerty uii onnl ghj
And I would like to be able to extract only the words that do not contain adjacently-repeating characters, so that the above string is returned as:
abc qwerty ghj
How to accomplish this task using Regular Expressions?
I guess the post is open again after a little rewording of the question.
This is moved from the comments, to the answer region.
A while ago I saw this style problem on a question about no duplicate characters
that encompased the entire string. I just translated it to word boundries.
#Michael J Mulligan did a test case for it (see comments).
The links:
'Working Regex test (regex101.com/r/bA2wB0/1) and a working PHP example (ideone.com/7ID8Ct)'
# For NO duplicate letters anywhere within word characters
# -----------------------------------------------------------
# \b(?!\w*(\w)\w*\1)\w+
\b # Word boundry
# Only word chars now
(?! # Lookahead assertion (like a true/false conditional)
# It doesn't matter if the assertion is negative or positive.
# In this section, the engine is forced to match if it can,
# it has no choice, it can't backtrack its way out of here.
\w*
( \w ) # (1), Pick a word char, any word char
\w*
\1 # Now it is here again
# Ok, the expression matched, time to check if the assertion is correct.
) # End assertion
\w+ # Its here now, looks like the assertion let us through
# The assert is that no duplicate word chars ahead,
# so free to match word chars 'en masse'
# For ONLY duplicate letters anywhere within word characters
# just do the inverse. In this case, the inverse is changing
# the lookahead assertion to positive (want duplicates).
# -----------------------------------------------------------
# \b(?=\w*(\w)\w*\1)\w+

Regular Expression to match ([^>(),]+) but include some \w's in it?

I'm using php's preg_replace function, and I have the following regex:
(?:[^>(),]+)
to match any characters but >(),. The problem is that I want to make sure that there is at least one letter in it (\w) and the match is not empty, how can I do that?
Is there a way to say what i DO WANT to match in the [^>(),]+ part?
You can add a lookahead assertion:
(?:(?=.*\p{L})[^>(),]+)
This makes sure that there will be at least one letter (\p{L}; \w also matches digits and underscores) somewhere in the string.
You don't really need the (?:...) non-capturing parentheses, though:
(?=.*\p{L})[^>(),]+
works just as well. Also, to ensure that we always match the entire string, it might be a good idea to surround the regex with anchors:
^(?=.*\p{L})[^>(),]+$
EDIT:
For the added requirement of not including surrounding whitespace in the match, things get a little more complicated. Try
^(?=.*\p{L})(\s*)((?:(?!\s*$)[^>(),])+)(\s*)$
In PHP, for example to replace all those strings we found with REPLACEMENT, leaving leading and trailing whitespace alone, this could look like this:
$result = preg_replace(
'/^ # Start of string
(?=.*\p{L}) # Assert that there is at least one letter
(\s*) # Match and capture optional leading whitespace (--> \1)
( # Match and capture... (--> \2)
(?: # ...at least one character of the following:
(?!\s*$) # (unless it is part of trailing whitespace)
[^>(),] # any character except >(),
)+ # End of repeating group
) # End of capturing group
(\s*) # Match and capture optional trailing whitespace (--> \3)
$ # End of string
/xu',
'\1REPLACEMENT\3', $subject);
You can just "insert" \w inside (?:[^>(),]+\w[^>(),]+). So it will have at least one letter and obviously not empty. BTW \w captures digits as well as letters. If you want only letters you can use unicode letter character class \p{L} instead of \w.
How about this:
(?:[^>(),]*\w[^>(),]*)

How does this PCRE pattern detect palindromes?

This question is an educational demonstration of the usage of lookahead, nested reference, and conditionals in a PCRE pattern to match ALL palindromes, including the ones that can't be matched by the recursive pattern given in the PCRE man page.
Examine this PCRE pattern in PHP snippet:
$palindrome = '/(?x)
^
(?:
(.) (?=
.*
(
\1
(?(2) \2 | )
)
$
)
)*
.?
\2?
$
/';
This pattern seems to detect palindromes, as seen in this test cases (see also on ideone.com):
$tests = array(
# palindromes
'',
'a',
'aa',
'aaa',
'aba',
'aaaa',
'abba',
'aaaaa',
'abcba',
'ababa',
# non-palindromes
'aab',
'abab',
'xyz',
);
foreach ($tests as $test) {
echo sprintf("%s '%s'\n", preg_match($palindrome, $test), $test);
}
So how does this pattern work?
Notes
This pattern uses a nested reference, which is a similar technique used in How does this Java regex detect palindromes?, but unlike that Java pattern, there's no lookbehind (but it does use a conditional).
Also, note that the PCRE man page presents a recursive pattern to match some palindromes:
# the recursive pattern to detect some palindromes from PCRE man page
^(?:((.)(?1)\2|)|((.)(?3)\4|.))$
The man page warns that this recursive pattern can NOT detect all palindromes (see: Why will this recursive regex only match when a character repeats 2n - 1 times? and also on ideone.com), but the nested reference/positive lookahead pattern presented in this question can.
Let's try to understand the regex by constructing it. Firstly, a palindrome must start and end with the same sequence of character in the opposite direction:
^(.)(.)(.) ... \3\2\1$
we want to rewrite this such that the ... is only followed by a finite length of patterns, so that it could be possible for us to convert it into a *. This is possible with a lookahead:
^(.)(?=.*\1$)
(.)(?=.*\2\1$)
(.)(?=.*\3\2\1$) ...
but there are still uncommon parts. What if we can "record" the previously captured groups? If it is possible we could rewrite it as:
^(.)(?=.*(?<record>\1\k<record>)$) # \1 = \1 + (empty)
(.)(?=.*(?<record>\2\k<record>)$) # \2\1 = \2 + \1
(.)(?=.*(?<record>\3\k<record>)$) # \3\2\1 = \3 + \2\1
...
which could be converted into
^(?:
(.)(?=.*(\1\2)$)
)*
Almost good, except that \2 (the recorded capture) is not empty initially. It will just fail to match anything. We need it to match empty if the recorded capture doesn't exist. This is how the conditional expression creeps in.
(?(2)\2|) # matches \2 if it exist, empty otherwise.
so our expression becomes
^(?:
(.)(?=.*(\1(?(2)\2|))$)
)*
Now it matches the first half of the palindrome. How about the 2nd half? Well, after the 1st half is matched, the recorded capture \2 will contain the 2nd half. So let's just put it in the end.
^(?:
(.)(?=.*(\1(?(2)\2|))$)
)*\2$
We want to take care of odd-length palindrome as well. There would be a free character between the 1st and 2nd half.
^(?:
(.)(?=.*(\1(?(2)\2|))$)
)*.?\2$
This works good except in one case — when there is only 1 character. This is again due to \2 matches nothing. So
^(?:
(.)(?=.*(\1(?(2)\2|))$)
)*.?\2?$
# ^ since \2 must be at the end in the look-ahead anyway.
I want to bring my very own solution to the table.
This is a regex that I've written a while ago to solve matching palindromes using PCRE/PCRE2
^((\w)(((\w)(?5)\5?)*|(?1)|\w?)\2)$
Example:
https://regex101.com/r/xvZ1H0/1

Categories