PHP regex and adjacent capturing groups - php

I'm using capturing groups in regular expressions for the first time and I'm wondering what my problem is, as I assume that the regex engine looks through the string left-to-right.
I'm trying to convert an UpperCamelCase string into a hyphened-lowercase-string, so for example:
HelloWorldThisIsATest => hello-world-this-is-a-test
My precondition is an alphabetic string, so I don't need to worry about numbers or other characters. Here is what I tried:
mb_strtolower(preg_replace('/([A-Za-z])([A-Z])/', '$1-$2', "HelloWorldThisIsATest"));
The result:
hello-world-this-is-atest
This is almost what I want, except there should be a hyphen between a and test. I've already included A-Z in my first capturing group so I would assume that the engine sees AT and hyphenates that.
What am I doing wrong?

The Reason your Regex will Not Work: Overlapping Matches
Your regex matches sA in IsATest, allowing you to insert a - between the s and the A
In order to insert a - between the A and the T, the regex would have to match AT.
This is impossible because the A is already matched as part of sA. You cannot have overlapping matches in direct regex.
Is all hope lost? No! This is a perfect situation for lookarounds.
Do it in Two Easy Lines
Here's the easy way to do it with regex:
$regex = '~(?<=[a-zA-Z])(?=[A-Z])~';
echo strtolower(preg_replace($regex,"-","HelloWorldThisIsATest"));
See the output at the bottom of the php demo:
Output: hello-world-this-is-a-test
Will add explanation in a moment. :)
The regex doesn't match any characters. Rather, it targets positions in the string: the positions between the change in letter case. To do so, it uses a lookbehind and a lookahead
The (?<=[a-zA-Z]) lookbehind asserts that what precedes the current position is a letter
The (?=[A-Z]) lookahead asserts that what follows the current position is an upper-case letter.
We just replace these positions with a -, and convert the lot to lowercase.
If you look carefully on this regex101 screen, you can see lines between the words, where the regex matches.
Reference
Lookahead and Lookbehind Zero-Length Assertions
Mastering Lookahead and Lookbehind

I've separated the two regular expressions for simplicity:
preg_replace(array('/([a-z])([A-Z])/', '/([A-Z]+)([A-Z])/'), '$1-$2', $string);
It processes the string twice to find:
lowercase -> uppercase boundaries
multiple uppercase letters followed by another uppercase letter
This will have the following behaviour:
ThisIsHTMLTest -> This-Is-HTML-Test
ThisIsATest -> This-Is-A-Test
Alternatively, use a look-ahead assertion (this will effect the reuse of the last capital letter that was used in the previous match):
preg_replace('/([A-Z]+|[a-z]+)(?=[A-Z])/', '$1-', $string);

To fix the interesting use case Jack mentioned in your comments (avoid splitting of abbreviations), I went with zx81's route of using lookahead and lookbehinds.
(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])
You can split it in two for the explanation:
First part
(?<= look behind to see if there is:
[a-z] any character of: 'a' to 'z'
) end of look-behind
(?= look ahead to see if there is:
[A-Z] any character of: 'A' to 'Z'
) end of look-ahead
(TL;DR: Match between strings of the CamelCase Pattern.)
Second part
(?<= look behind to see if there is:
[A-Z] any character of: 'A' to 'Z'
) end of look-behind
(?= look ahead to see if there is:
[A-Z] any character of: 'A' to 'Z'
[a-z] any character of: 'a' to 'z'
) end of look-ahead
(TL;DR: Special case, match between abbreviation and CamelCase pattern)
So your code would then be:
mb_strtolower(preg_replace('/(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])/', '-', "HelloWorldThisIsATest"));
Demo of matches
Demo of code

Related

How to capture all phrases which doesn't have a pattern in the middle of theirself?

I want to capture all strings that doesn't have the pattern _ a[a-z]* _ in the specified position in the example below:
<?php
$myStrings = array(
"123-456",
"123-7-456",
"123-Apple-456",
"123-0-456",
"123-Alphabet-456"
);
foreach($myStrings as $myStr){
echo var_dump(
preg_match("/123-(?!a[a-z]*)-456/i", $myStr)
);
}
?>
You can check the following solution at this Regex101 share link.
^(123-(?:(?![aA][a-zA-Z]*).*)-456)|(123-456)$
It uses regex non-capturing group (?:) and regex negative lookahead (?!) to find all inner sections that do not start with 'a' (or 'A') and any letters after that. Also, the case with no inner section (123-456) is added (with the | sign) as a 2nd alternative for a wrong pattern.
A lookahead is a zero-length assertion. The middle part also needs to be consumed to meet 456. For consuming use e.g. \w+- for one or more word characters and hyphen inside an optional group that starts with your lookahead condition. See this regex101 demo (i flag for caseless matching).
Further for searching an array preg_grep can be used (see php demo at tio.run).
preg_grep('~^123-(?:(?!a[a-z]*-)\w+-)?456$~i', $myStrings);
There is also an invert option: PREG_GREP_INVERT. If you don't need to check for start and end a more simple pattern like -a[a-z]*- without lookahead could be used (another php demo).
Match the pattern and invert the result:
!preg_match('/a[a-z]*/i', $yourStr);
Don't try to do everything with a regex when programming languages exist to do the job.
You are not getting a match because in the pattern 123-(?!a[a-z]*)-456 the lookahead assertion (?!a[a-z]*) is always true because after matching the first - it has to directly match another hyphen like the pattern actually 123--456
If you move the last hyphen inside the lookahead like 123-(?!a[a-z]*-)456 you only get 1 match for 123-456 because you are actually not matching the middle part of the string.
Another option with php can be to consume the part that you don't want, and then use SKIP FAIL
^123-(?:a[a-z]*-(*SKIP)(*F)|\w+-)?456$
Explanation
^ Start of string
123- Match literally
(?: Non capture group for the alternation
a[a-z]*-(*SKIP)(*F) Match a, then optional chars a-z, then match - and skip the match
| Or
\w+- Match 1+ word chars followed by -
)? Close the non capture group and make it optional to also match when there is no middle part
456 Match literally
$ End of string
Regex demo
Example
$myStrings = array(
"123-456",
"123-7-456",
"123-Apple-456",
"123-0-456",
"123-Alphabet-456",
"123-b-456"
);
foreach($myStrings as $myStr) {
if (preg_match("/^123-(?:a[a-z]*-(*SKIP)(*F)|\w+-)?456$/i", $myStr, $match)) {
echo "Match for $match[0]" . PHP_EOL;
} else {
echo "No match for $myStr" . PHP_EOL;
}
}
Output
Match for 123-456
Match for 123-7-456
No match for 123-Apple-456
Match for 123-0-456
No match for 123-Alphabet-456
Match for 123-b-456

PHP/Laravel trim all but last word in a namespace

Trying to trim a fully qualified namespace so to use just the last word. Example namepspace is App\Models\FruitTypes\Apple where that final word could be any number of fruit types. Shouldn't this...
$fruitName = 'App\Models\FruitTypes\Apple';
trim($fruitName, "App\\Models\\FruitTypes\\");
...do the trick? It is returning an empty string. If I try to trim just App\\Models\\ it returns FruitTypes\Apples as expected. I know the backslash is an escape character, but doubling should treat those as actual backslashes.
If you want to use native functionality for this rather than string manipulation, then ReflectionClass::getShortName will do the job:
$reflection = new ReflectionClass('App\\Models\\FruitTypes\\Apple');
echo $reflection->getShortName();
Apple
See https://3v4l.org/eVl9v
preg_match() with the regex pattern \\([[:alpha:]]*)$ should do the trick.
$trimmed = preg_match('/\\([[:alpha:]]*)$/', $fruitName);
Your result will then live in `$trimmed1'. If you don't mind the pattern being a bit less explicit, you could do:
preg_match('/([[:alpha:]]*)$/', $fruitName, $trimmed);
And your result would then be in $trimmed[0].
If matches is provided, then it is filled with the results of search. $matches[0] will contain the text that matched the full pattern, $matches[1] will have the text that matched the first captured parenthesized subpattern, and so on.
preg_match - php.net
(matches is the third parameter that I named $trimmed, see documentation for full explanation)
An explanation for the regex pattern
\\ matches the character \ literally to establish the start of the match.
The parentheses () create a capturing group to return the match or a substring of the match.
In the capturing group ([[:alpha:]]*):
[:alpha:] matches a alphabetic character [a-zA-Z]
The * quantifier means match between zero and unlimited times, as many times as possible
Then $ asserts position at the end of the string.
So basically, "Find the last \ then return all letter between this and the end of the string".

Regex capturing words that have at least one lowercase letter

I'm trying to capture words in a string like:
1vTvFpU
KOoy6Cc
With regex pattern:
\b(?=(?:.*?[a-z]){1,})[A-Za-z0-9\/\-_.]{7,7}\b
But I have a problem because it also matches words like:
FDSFDFI
WEWEFDP
RRRRRRR
In a string:
FDSFDFI sdfdfdf
WEWEFDP traliii
RRRRRRR sdfdfdf
What Am I doing wrong?
I suggest you to use \S* instead of .* inside the lookahead. Because when you include .*? inside the lookahead, it checks for atleast one lower-case letter for the whole line not for the word.
\b(?=(?:\S*?[a-z]))[A-Za-z0-9\/\-_.]{7}\b
{7,7} is equal to {7}
DEMO
No need to use a lookahead to do that, character classes suffice:
[^\Wa-z]*+\w+
Then checks the string length with php (for example with array_filter).

REGEX - match words that contain letters repeating next to each other

im looking for a regex that matches words that repeat a letter(s) more than once and that are next to each other.
Here's an example:
This is an exxxmaple oooonnnnllllyyyyy!
By far I havent found anything that can exactly match:
exxxmaple and oooonnnnllllyyyyy
I need to find it and place them in an array, like this:
preg_match_all('/\b(???)\b/', $str, $arr) );
Can somebody explain what regexp i have to use?
You can use a very simple regex like
\S*(\w)(?=\1+)\S*
See how the regex matches at http://regex101.com/r/rF3pR7/3
\S matches anything other than a space
* quantifier, zero or more occurance of \S
(\w) matches a single character, captures in \1
(?=\1+) postive look ahead. Asserts that the captrued character is followed by itsef \1
+ quantifiers, one or more occurence of the repeated character
\S* matches anything other than space
EDIT
If the repeating must be more than once, a slight modification of the regex would do the trick
\S*(\w)(?=\1{2,})\S*
for example http://regex101.com/r/rF3pR7/5
Use this if you want discard words like apple etc .
\b\w*(\w)(?=\1\1+)\w*\b
or
\b(?=[^\s]*(\w)\1\1+)\w+\b
Try this.See demo.
http://regex101.com/r/kP8uF5/20
http://regex101.com/r/kP8uF5/21
You can use this pattern:
\b\w*?(\w)\1{2}\w*
The \w class and the word-boundary \b limit the search to words. Note that the word boundary can be removed, however, it reduces the number of steps to obtain a match (as the lazy quantifier). Note too, that if you are looking for words (in the common meaning), you need to remove the word boundary and to use [a-zA-Z] instead of \w.
(\w)\1{2} checks if a repeated character is present. A word character is captured in group 1 and must be followed with the content of the capture group (the backreference \1).

Regex: how to match an word that doesn't end with a specific character

I would like to match the whole "word"—one that starts with a number character and that may include special characters but does not end with a '%'.
Match these:
112 (whole numbers)
10-12 (ranges)
11/2 (fractions)
11.2 (decimal numbers)
1,200 (thousand separator)
but not
12% (percentages)
A38 (words starting with a alphabetic character)
I've tried these regular expressions:
(\b\p{N}\S)*)
but that returns '12%' in '12%'
(\b\p{N}(?:(?!%)\S)*)
but that returns '12' in '12%'
Can I make an exception to the \S term that disregards %?
Or will have to do something else?
I'll be using it in PHP, but just write as you would like and I'll convert it to PHP.
This matches your specification:
\b\p{N}\S*+(?<!%)
Explanation:
\b # Start of number
\p{N} # One Digit
\S*+ # Any number of non-space characters, match possessively
(?<!%) # Last character must not be a %
The possessive quantifier \S*+ makes sure that the regex engine will not backtrack into a string of non-space characters it has already matched. Therefore, it will not "give back" a % to match 12 within 12%.
Of course, that will also match 1!abc, so you might want to be more specific than \S which matches anything that's not a whitespace character.
Can i make an exception to the \S term that disregards %
Yes you can:
[^%\s]
See this expression \b\d[^%\s]* here on Regexr
\d+([-/\.,]\d+)?(?!%)
Explanation:
\d+ one or more digits
(
[-/\.,] one "-", "/", "." or ","
\d+ one or more digits
)? the group above zero or one times
(?!%) not followed by a "%" (negative lookahead)
KISS (restrictive):
/[0-9][0-9.,-/]*\s/
try this one
preg_match("/^[0-9].*[^%]$/", $string);
Try this PCRE regex:
/^(\d[^%]+)$/
It should give you what you need.
I would suggest just:
(\b[\p{N},.-]++(?!%))
That's not very exact regarding decimal delimiters or ranges. (As example). But the ++ possessive quantifier will eat up as many decimals as it can. So that you really just need to check the following character with a simple assertion. Did work for your examples.

Categories