why regexp doesn't match? - php

Below is a pattern that matches numbers. It works almost. The second line should be matched with 99 but there is no match? Why?
(?<!\d[- ]|[\d.,])\(?-?(?:(?:[1-9]\d{0,2}(?:(?:[. ]\d{3})*|\d*))|0)(?:\b|[,]\d{1,3})-?\)?(?![\d.,\/]|-[\d\/])
100,00stk => 100,00
99stk => 99 \\ this is not matched
10,45stk => 10,45
https://regex101.com/r/nwRCKo/1

The main problem here is the use of word boundary, but fixing the issue is not that evident.
The main point about the regex you have is that it matches some numbers in some specific context, and the lookarounds on both sides are meant to fail the match, so that you do not get a match at all. If you place a negative lookahead after an optional ) char, the regex engine may backtrack and you will still get this match. You need to prevent any backtracking here after removing the word boundary.
So, replace (?:\b|[,]\d{1,3}) with (?:[,]\d{1,3})? and make all the subsequent optional patterns atomic by applying the possessive quantifiers:
(?<!\d[- ]|[\d.,])\(?-?(?:(?:[1-9]\d{0,2}(?:(?:[. ]\d{3})*|\d*))|0)(?:,\d{1,3})?+-?+\)?+(?![\d.,\/]|-[\d\/])
See this regex demo.

Related

regexp - match pattern and prefix before pattern

I need to match a specific pattern
(?<!\d|\d )(?:dk)?(\d{2})\D?(\d{2})\D?(\d{2})\D?(\d{2})(?!\d)
eg.
dk30344510
dk30 34 45 10
30344510
30 34 45 10
But I also need to fetch the "prefix" string before the pattern
This is my solution, but it doesn't always work
^(.*)(?<!\d|\d )(?:dk)?(\d{2})\D?(\d{2})\D?(\d{2})\D?(\d{2})(?!\d)
It's hard to explain so check it here.
https://regex101.com/r/fM1xD3/2
It's too "greedy" and match multiple pattern in the string. The actual match is here a part of the "prefix" of the second match
The example should output two matches. One with dk30344510 and 62226420
The example should output CVR-nr. as prefix and dk30344510 as the pattern and second match should be / Tlf. as prefix and 62226420 as the pattern
Your regex doesn't output expected results since you have a start of string anchor ^ and a greedy dot .*. It means it starts at only start of a string and ends to one successful match only.
Solution
Regex:
\s*(.*?)\s*\b((?i:dk)?(?:\d{2}\D?){3}\d{2})\b
I didn't apply many changes to your main regex. What I did is reducing repeating pattern \d{2}\D? and replacing lookarounds with word boundary \b token.
Live demo
you can try this one with the optionn 'g' to get multiple resultes
^(.*?)\s(dk\d+)\s(.*?)\s(\d+)
demo

Php regex that matches substring followed by any length of character and then comma

I have a long string containing Copyright: 'any length of unknown string here',
what regex should I write to exactly match this as substring in a string?
I tried this preg_replace('/Copyright:(.*?)/', 'mytext', $str); but its not working, it only matches the Copyright:
A lazily quantified pattern at the end of the pattern will always match no text in case of *? and 1 char only in case of +?, i.e. will match as few chars as possible to return a valid match.
You need to make sure you get to the ', by putting them into the pattern:
'/Copyright:.*?\',/'
^^^
See the regex demo
The ? in your group 1 (.*?) makes this block lazy, i.e. matching as few characters as possible. Removing that would solve it.
Copyright:(.*)',
However, that would match everything in that same line. If you have text in that same line, make sure to limit it further. My screenshot below just just grouping () to make it easier for you to look, you can do without the parentheses.
I usually use Regxr.com to test my regular expression, there's also many other similar tools online, note that this one is great in UX, but does not support lookbehind.

REGEX - match words that contain letters repeating next to each other

im looking for a regex that matches words that repeat a letter(s) more than once and that are next to each other.
Here's an example:
This is an exxxmaple oooonnnnllllyyyyy!
By far I havent found anything that can exactly match:
exxxmaple and oooonnnnllllyyyyy
I need to find it and place them in an array, like this:
preg_match_all('/\b(???)\b/', $str, $arr) );
Can somebody explain what regexp i have to use?
You can use a very simple regex like
\S*(\w)(?=\1+)\S*
See how the regex matches at http://regex101.com/r/rF3pR7/3
\S matches anything other than a space
* quantifier, zero or more occurance of \S
(\w) matches a single character, captures in \1
(?=\1+) postive look ahead. Asserts that the captrued character is followed by itsef \1
+ quantifiers, one or more occurence of the repeated character
\S* matches anything other than space
EDIT
If the repeating must be more than once, a slight modification of the regex would do the trick
\S*(\w)(?=\1{2,})\S*
for example http://regex101.com/r/rF3pR7/5
Use this if you want discard words like apple etc .
\b\w*(\w)(?=\1\1+)\w*\b
or
\b(?=[^\s]*(\w)\1\1+)\w+\b
Try this.See demo.
http://regex101.com/r/kP8uF5/20
http://regex101.com/r/kP8uF5/21
You can use this pattern:
\b\w*?(\w)\1{2}\w*
The \w class and the word-boundary \b limit the search to words. Note that the word boundary can be removed, however, it reduces the number of steps to obtain a match (as the lazy quantifier). Note too, that if you are looking for words (in the common meaning), you need to remove the word boundary and to use [a-zA-Z] instead of \w.
(\w)\1{2} checks if a repeated character is present. A word character is captured in group 1 and must be followed with the content of the capture group (the backreference \1).

PHP regex and adjacent capturing groups

I'm using capturing groups in regular expressions for the first time and I'm wondering what my problem is, as I assume that the regex engine looks through the string left-to-right.
I'm trying to convert an UpperCamelCase string into a hyphened-lowercase-string, so for example:
HelloWorldThisIsATest => hello-world-this-is-a-test
My precondition is an alphabetic string, so I don't need to worry about numbers or other characters. Here is what I tried:
mb_strtolower(preg_replace('/([A-Za-z])([A-Z])/', '$1-$2', "HelloWorldThisIsATest"));
The result:
hello-world-this-is-atest
This is almost what I want, except there should be a hyphen between a and test. I've already included A-Z in my first capturing group so I would assume that the engine sees AT and hyphenates that.
What am I doing wrong?
The Reason your Regex will Not Work: Overlapping Matches
Your regex matches sA in IsATest, allowing you to insert a - between the s and the A
In order to insert a - between the A and the T, the regex would have to match AT.
This is impossible because the A is already matched as part of sA. You cannot have overlapping matches in direct regex.
Is all hope lost? No! This is a perfect situation for lookarounds.
Do it in Two Easy Lines
Here's the easy way to do it with regex:
$regex = '~(?<=[a-zA-Z])(?=[A-Z])~';
echo strtolower(preg_replace($regex,"-","HelloWorldThisIsATest"));
See the output at the bottom of the php demo:
Output: hello-world-this-is-a-test
Will add explanation in a moment. :)
The regex doesn't match any characters. Rather, it targets positions in the string: the positions between the change in letter case. To do so, it uses a lookbehind and a lookahead
The (?<=[a-zA-Z]) lookbehind asserts that what precedes the current position is a letter
The (?=[A-Z]) lookahead asserts that what follows the current position is an upper-case letter.
We just replace these positions with a -, and convert the lot to lowercase.
If you look carefully on this regex101 screen, you can see lines between the words, where the regex matches.
Reference
Lookahead and Lookbehind Zero-Length Assertions
Mastering Lookahead and Lookbehind
I've separated the two regular expressions for simplicity:
preg_replace(array('/([a-z])([A-Z])/', '/([A-Z]+)([A-Z])/'), '$1-$2', $string);
It processes the string twice to find:
lowercase -> uppercase boundaries
multiple uppercase letters followed by another uppercase letter
This will have the following behaviour:
ThisIsHTMLTest -> This-Is-HTML-Test
ThisIsATest -> This-Is-A-Test
Alternatively, use a look-ahead assertion (this will effect the reuse of the last capital letter that was used in the previous match):
preg_replace('/([A-Z]+|[a-z]+)(?=[A-Z])/', '$1-', $string);
To fix the interesting use case Jack mentioned in your comments (avoid splitting of abbreviations), I went with zx81's route of using lookahead and lookbehinds.
(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])
You can split it in two for the explanation:
First part
(?<= look behind to see if there is:
[a-z] any character of: 'a' to 'z'
) end of look-behind
(?= look ahead to see if there is:
[A-Z] any character of: 'A' to 'Z'
) end of look-ahead
(TL;DR: Match between strings of the CamelCase Pattern.)
Second part
(?<= look behind to see if there is:
[A-Z] any character of: 'A' to 'Z'
) end of look-behind
(?= look ahead to see if there is:
[A-Z] any character of: 'A' to 'Z'
[a-z] any character of: 'a' to 'z'
) end of look-ahead
(TL;DR: Special case, match between abbreviation and CamelCase pattern)
So your code would then be:
mb_strtolower(preg_replace('/(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])/', '-', "HelloWorldThisIsATest"));
Demo of matches
Demo of code

Lookahead, lookbehind condition in regular expression

The following example is about using lookahead assertion as a condition. I found it in the PHP manual at: http://www.php.net/manual/en/regexp.reference.conditional.php
(?(?=[^a-z]*[a-z])
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
Here's the description about this regex:
The condition is a positive lookahead assertion that matches an optional sequence of non-letters followed by a letter. In other words, it tests for the presence of at least one letter in the subject. If a letter is found, the subject is matched against the first alternative; otherwise it is matched against the second. This pattern matches strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
Could anyone tell me why we use lookahead assertion as the condition in this example? Why don't we use lookbehind assertion? I get confused when they're used as conditions like this because I don't know how do they match the subject string. Thanks in advance!
In this case we're using a lookahead assertion to decide which regex to use. It looks like it's deciding between matching dates of the form 01-Jan-12 and 01-01-12. The lookahead assertion sees if there are any letters within what we're trying to match and if so uses the \d{2}-[a-z]{3}-\d{2} to try and match 01-Jan-12 if not it uses \d{2}-\d{2}-\d{2} to try and match 01-01-12.

Categories