regexp - match pattern and prefix before pattern - php

I need to match a specific pattern
(?<!\d|\d )(?:dk)?(\d{2})\D?(\d{2})\D?(\d{2})\D?(\d{2})(?!\d)
eg.
dk30344510
dk30 34 45 10
30344510
30 34 45 10
But I also need to fetch the "prefix" string before the pattern
This is my solution, but it doesn't always work
^(.*)(?<!\d|\d )(?:dk)?(\d{2})\D?(\d{2})\D?(\d{2})\D?(\d{2})(?!\d)
It's hard to explain so check it here.
https://regex101.com/r/fM1xD3/2
It's too "greedy" and match multiple pattern in the string. The actual match is here a part of the "prefix" of the second match
The example should output two matches. One with dk30344510 and 62226420
The example should output CVR-nr. as prefix and dk30344510 as the pattern and second match should be / Tlf. as prefix and 62226420 as the pattern

Your regex doesn't output expected results since you have a start of string anchor ^ and a greedy dot .*. It means it starts at only start of a string and ends to one successful match only.
Solution
Regex:
\s*(.*?)\s*\b((?i:dk)?(?:\d{2}\D?){3}\d{2})\b
I didn't apply many changes to your main regex. What I did is reducing repeating pattern \d{2}\D? and replacing lookarounds with word boundary \b token.
Live demo

you can try this one with the optionn 'g' to get multiple resultes
^(.*?)\s(dk\d+)\s(.*?)\s(\d+)
demo

Related

why regexp doesn't match?

Below is a pattern that matches numbers. It works almost. The second line should be matched with 99 but there is no match? Why?
(?<!\d[- ]|[\d.,])\(?-?(?:(?:[1-9]\d{0,2}(?:(?:[. ]\d{3})*|\d*))|0)(?:\b|[,]\d{1,3})-?\)?(?![\d.,\/]|-[\d\/])
100,00stk => 100,00
99stk => 99 \\ this is not matched
10,45stk => 10,45
https://regex101.com/r/nwRCKo/1
The main problem here is the use of word boundary, but fixing the issue is not that evident.
The main point about the regex you have is that it matches some numbers in some specific context, and the lookarounds on both sides are meant to fail the match, so that you do not get a match at all. If you place a negative lookahead after an optional ) char, the regex engine may backtrack and you will still get this match. You need to prevent any backtracking here after removing the word boundary.
So, replace (?:\b|[,]\d{1,3}) with (?:[,]\d{1,3})? and make all the subsequent optional patterns atomic by applying the possessive quantifiers:
(?<!\d[- ]|[\d.,])\(?-?(?:(?:[1-9]\d{0,2}(?:(?:[. ]\d{3})*|\d*))|0)(?:,\d{1,3})?+-?+\)?+(?![\d.,\/]|-[\d\/])
See this regex demo.

I cannot make this regular expression work

May be it's simple but I cannot do it work.
I have two filename strings:
wrap.html
wrap-popup.html
I try to select both using
/.*wrap.*\.htm.*/ mask
But it only matches the first one "wrap.html".
If I use /.*wrap.+\.htm.*/, it only matches the second one "wrap-popup.html"
I thought * sounds 0 to infinite characters.
What's the correct mask to select both strings ???
Consider the string "this is text with 2 html pages: wrap.html and wrap-popup.html"
The first regex /.*wrap.*\.htm.*/ will match that whole string.
So if you don't want to include the first part of the string then you need to remove the first .*
Now /wrap.*\.htm.*/ will match "wrap.html and wrap-popup.html" from the string.
That's because the first .* is a greedy match.
So when we change the regex to /wrap.*?\.html?/ the .*? is now a lazy match. And the l? is an optional l. So the regex will return "wrap.html".
But if we want to retrieve both we need a global search, or it would only find the first match.
A preg_match_all (instead of preg_match) with the regex /wrap[\w\-]*?\.html?/ will match both "wrap.html" and "wrap-popup.html".
That second regex of yours wouldn't match wrap.html because with the .+ it expected at least 1 character between "match" and the dot.

Regex to get the first number after a certain string followed by any data until the number

I have a piece of data, retrieved from the database and containing information I need. Text is entered in a free form so it's written in many different ways. The only thing I know for sure is that I'm looking for the first number after a given string, but after that certain string (before the number) can be any text as well.
I tried this (where mytoken is the string I know for sure its there) but this doesn't work.
/(mytoken|MYTOKEN)(.*)\d{1}/
/(mytoken|MYTOKEN)[a-zA-Z]+\d{1}/
/(mytoken|MYTOKEN)(.*)[0-9]/
/(mytoken|MYTOKEN)[a-zA-Z]+[0-9]/
Even mytoken can be written in capitals, lowercase or a mix of capitals and lowercase character. Can the expression be case insensitive?
You do not need any lazy matching since you want to match any number of non-digit symbols up to the first digit. It is better done with a \D*:
/(mytoken)(\D*)(\d+)/i
See the regex demo
The pattern details:
(mytoken) - Group 1 matching mytoken (case insensitively, as there is a /i modifier)
(\D*) - Group 2 matching zero or more characters other than a digit
(\d+) - Group 3 matching 1 or more digits.
Note that \D also matches newlines, . needs a DOTALL modifier to match across newlines.
You need to use a lazy quantifier. You can do that by putting a question mark after the star quantifier in the regex: .*?. Otherwise, the numbers will be matched by the dot operator until the last number, which will be matched by \d.
Regex: /(mytoken|MYTOKEN)(.*?)\d/
Regex demo
You can use the opposite:
/(mytoken|MYTOKEN)(\D+)(\d)/
This says: mytoken, followed by anything not a number, followed by a number. The (lazy) dot-star-soup is not always your best bet. The desired number will be in $3 in this example.

PHP RegEx get first letter after set of characters

I have some text with heading string and set of letters.
I need to get first one-digit number after set of string characters.
Example text:
ABC105001
ABC205001
ABC305001
ABCD105001
ABCD205001
ABCD305001
My RegEx:
^(\D*)(\d{1})(?=\d*$)
Link: http://www.regexr.com/390gv
As you cans see, RegEx works ok, but it captures first groups in results also. I need to get only this integer and when I try to put ?= in first group like this: ^(?=\D*)(\d{1})(?=\d*$) , Regex doesn't work.
Any ideas?
Thanks in advance.
(?=..) is a lookahead that means followed by and checks the string on the right of the current position.
(?<=...) is a lookbehind that means preceded by and checks the string on the left of the current position.
What is interesting with these two features, is the fact that contents matched inside them are not parts of the whole match result. The only problem is that a lookbehind can't match variable length content.
A way to avoid the problem is to use the \K feature that remove all on the left from match result:
^[A-Z]+\K\d(?=\d*$)
You're trying to use a positive lookahead when really you want to use non-capturing groups.
The one match you want will work with this regex:
^(?:\D*\d{1})(\d*)$
The (?: string will start a non-capturing group. This will not come back in matches.
So, if you used preg_match(';^(?:\D*\d{1})(\d*)$;', $string, $matches) to find your match, $matches[1] would be the string for which you're looking. (This is because $matches[0] will always be the full match from preg_match.)
try:
^(?:\D*)(\d{1})(?=\d*$) // (?: is the beginning of a no capture group

Matching ugly extra abbreviations and numbers in titles with PHP regex

I have to create regex to match ugly abbreviations and numbers. These can be one of following "formats":
1) [any alphabet char length of 1 char][0-9]
2) [double][whitespace][2-3 length of any alphabet char]
I tried to match double:
preg_match("/^-?(?:\d+|\d*\.\d+)$/", $source, $matches);
But I coldn't get it to select following example: 1.1 AA My test title. What is wrong with my regex and how can I add those others to my regex too?
In your regex you say "start of string, followed by maybe a - followed by at least one digit or followed by 0 or more digits, followed by a dot and followed by at least one digit and followed by the end of string.
So you regex could match for example.. 4.5, -.1 etc. This is exactly what you tell it to do.
You test input string does not match since there are other characters present after the number 1.1 and even if it somehow magically matched your "double" matching regex is wrong.
For a double without scientific notation you usually use this regex :
[-+]?\b[0-9]+(\.[0-9]+)?\b
Now that we have this out of our way we need a whitespace \s and
[2-3 length of alphabet]
Now I have no idea what [2-3 length of alphabet] means but by combining the above you get a regex like this :
[-+]?\b[0-9]+(\.[0-9]+)?\b\s[2-3 length of alphabet]
You can also place anchors ^$ if you want the string to match entirely :
^[-+]?\b[0-9]+(\.[0-9]+)?\b\s[2-3 length of alphabet]$
Feel free to ask if you are stuck! :)
I see multiple issues with your regex:
You try to match the whole string (as a number) by the anchors: ^ at the beginning and $ at the end. If you don't want that, remove those.
The number group is non-catching. It will be checked for matches, but those won't be added to $matches. That's because of the ?: internal options you set in (?:...). Remove ?: to make that group catching.
You place the shorter digit-pattern before the longer one. If you swap the order, the regex engine will look for it first and on success prefer it over the shorter one.
Maybe this already solves your issue:
preg_match("/-?(\d*\.\d+|\d+)/", $source, $matches);
Demo

Categories