Regex to get all numbers after a character - php

I have strings that are expected to be in the format of something like
"C 1,13,7,2,55"
I would expect matches to be [1,13,7,2,55].
I want to match on all numbers in that "csv" portion of the string. But only if it comes after "C " Note a space after the 'c'
This comes from user-input and so I want to account for case and multiple space(s) in between tokens and accidental double commas, etc..
I.e.
"c 1 , 12,15 , 8 , 9,10,11"
I want matches to be [1,12,15,8,9,10,11]
But I only want to attempt to match on numbers after the "C" char (case-insensitive).
So "1,2 , 4,5" and "d 12456, 9890" should fail .
Here's the regex I have half-baked so far.
Note: This will ultimately get ported over to PHP and so I will be using preg_match_all
/(?<=C)*\d+/gim
I use a positive lookbehind (but match as many times as needed) for the "C" char. Then match on 1 or more digits globally.
I haven't created all my unit tests yet, but I think this may work.
Is there a better way to do this?
Is matching on 1or more positive lookbehinds standard?
Why don't I need to include a \s* after the 'C' in the positive lookbehind?
When would including the 'm' multi-line flag even make a difference here?
Thanks!

The simplest option is probably to first test for the "C" using the case-insensitive stripos before matching the digits with \d+. For example:
$input = "c 1 , 12,15 , 8 , 9,10,11";
if (stripos($input, "C") === 0) {
preg_match_all("/\d+/", $input, $matches);
print_r($matches);
}
The condition could be for example stripos($input, "C") !== false if the "C" does not have to be the first character.
To validate that the string starts with "C" (possibly after horizontal whitespace), and contains only horizontal whitespace, commas and digits then the test could instead be
if (preg_match("/^\h*C[\h\d,]+$/i", $input)) {
The lookbehind in your regex /(?<=C)*\d+/gim is made optional by the *, so the regex does not require that a "C" is present for the digits to be matched. It is functionally equivalent to just /\d+/g.
Is matching on 1 or more positive lookbehinds standard?
In this case, the lookbehind would need to be variable-width (?<=C.*) and php does not support variable-width lookbehinds.
Why don't I need to include a \s* after the 'C' in the positive lookbehind?
Php does not support the use of the * quantifier inside a lookbehind.
When would including the 'm' multi-line flag even make a difference here?
You only might want to use the m flag if your input is multi-lined and you are using ^ or $ which assert the start or end respectively of a line or the whole string.

Using this pattern /(?<=C)*\d+/gim; in for example Javascript it would not be valid due to the quantifier after the lookbehind assertion.
If you want to write it in JavaScript getting all the digits after C at the start of the string, and the quantifier in the lookbehind is supported:
(?<=^C [\d, ]*)\d+
Regex demo
Using (?<=C)*\d+ in PHP, the quantifier for the lookbehind is optional, and it would also match 8 and 9 in for example this string 8,9 C 1,13,7,2,55
Using a quantifier with infinite length in a lookbehind assertion is not supported in PHP so you can not use (?<=C\h+)\d+ where \h+ would match 1+ spaces due to S
If you are using PHP, you can make use of the \G anchor to match only consecutive numbers after the first C character.
For a single line, you don't need the multi line flag. You do need it for multiple lines due to the anchor.
(?:^\h*C\h+|\G(?!^))\h*,*\h*\K\d+
The pattern matches:
(?: Non capture group
^ Start of string
\h*C\h+ Match optional spaces, then C and 1+ spaces
| Or
\G(?!^) Assert the position at the end of the previous match (not at the start)
) Close the non capture group
\h*,*\h*\K Match optional comma's between optional spaces
\d+ Match 1 or more digits
Regex demo | Php demo
$regex = '/(?:\h*C\h+|\G(?!^))\h*,*\h*\K\d+/i';
$strings = [
"C 1,13,7,2,55",
"c 1 , 12,15 , 8 , 9,10,11",
"1,2 , 4,5",
"d 12456, 9890"
];
foreach ($strings as $s) {
if (preg_match_all($regex, $s, $matches)) {
print_r($matches[0]);
}
}
Output
Array
(
[0] => 1
[1] => 13
[2] => 7
[3] => 2
[4] => 55
)
Array
(
[0] => 1
[1] => 12
[2] => 15
[3] => 8
[4] => 9
[5] => 10
[6] => 11
)

Related

Double regex matches

I'm preg_match_all looping through a string using different patterns. Sometimes these patterns look a lot like each other, but differ slightly.
Right now I'm looking for a way to stop pattern A from matching strings that only pattern B - which has a 'T' in front of the 4 digits - should match.
The problem I'm running into is that pattern A also matches pattern B:
A:
(\d{4})(A|B)?(C|D)?
... matches 1234, 1234A, 1234AD, etc.
B:
I also have another pattern:
T(\d{4})\/(\d{4})
... which matches strings like: T7878/6767
The result
When running a preg_match_all on "T7878/6767 1234AD", A will give the following matches:
7878, 6767, 1234AD
Does anyone have a suggestion how to prevent A from matching B in a string like "Some text T7878/6767 1234AD and some more text"?
Your help is greatly appreciated!
Scenario with boundaries
If you only want to match those specific strings within some boundaries, use those boundary patterns on each side of the pattern.
If you expect a whitespace boundary before each match, then add the (?<!\S) negative lookbehind at the start of the pattern. If you expect a whitespace boundary at the end of the match, add the (?!\S) negative lookahead. If there can be any chars (as is in your original question), then SKIP-FAIL is the only way (see below).
So, in this first case, you may use
(?<!\S)(\d{4})([AB]?)([CD]?)(?!\S)
and
(?<!\S)T(\d{4})\/(\d{4})(?!\S)
See Pattern 1 demo and Pattern 2 demo.
Scenario with no specific boundaries
You need to make sure the second pattern is skipped when you parse the string with the first one. Use SKIP-FAIL technique for this:
'~T\d{4}/\d{4}(*SKIP)(*F)|(\d{4})(A|B)?(C|D)?~'
See the regex demo.
If you do not need the capturing groups, you may simplify it to
'~T\d{4}/\d{4}(*SKIP)(*F)|\d{4}[AB]?[CD]?~'
See another demo
Details
T\d{4}/\d{4} - T followed with 4 digits, / and another 4 digits
(*SKIP)(*F) - the matched text is discarded and the next match is searched from the matched text end
| - or
\d{4}[AB]?[CD]? - 4 digits, then optionally A or B and then optionally C or D.
From what you're asking, your current regexes don't really work. (A|B)?(C|D)? will never match AB. So I think you meant [ABCD]
Here's your new regex:
T(\d{4})\/(\d{4}) (\d{4}[ABCD]*)
For the string input:
T7878/6767 1234AB
We get the groups:
Match 1
Full match 0-17 `T7878/6767 1234AB`
Group 1. 1-5 `7878`
Group 2. 6-10 `6767`
Group 3. 11-17 `1234AB`
Regex101
Your syntax is pretty specific, so you regex just needs to be. Get rid of all your capture groups because they are screwing things up. You only need two groups which match your string syntax exactly.
First groups looks for word bounday followed by T then 4 digits then / then 4 more digits and a word break.
Second groups matches 4 digits and then letters A-D between 0 and 2 times. It has a negative lookbehind so will only match if there is a whitespace character before the 4 digits
(\bT\d{4}\/\d{4}\b)|(?<!\S)(\d{4}[A-D]{0,2})
Preg match all output:
Array
(
[0] => Array
(
[0] => T7878/6767
[1] => 1234AB
)
[1] => Array
(
[0] => T7878/6767
[1] =>
)
[2] => Array
(
[0] =>
[1] => 1234AB
)
)

How to extract only floats(decimals) from a string that also includes integers

I can't find the appropriate regex to extract only floats from a string.
Consider the following string:
$string = "8x2.1 3x2";
I want to extract 2.1, I tried the following but this gives me integers and floats:
preg_match_all('/[0-9,]+(?:\.[0-9]*)?/', $string, $matches);
i then tried using is_float to check for floats but this also returns the integers for some reason. Any ideas?
thanks
Consider this simple example:
<?php
$input = "8x2.1 3x2";
preg_match('/\d+\.\d+/', $input, $tokens);
print_r($tokens);
Matches "one or more digits, followed by exactly one full stop and again one or more digits".
The output obviously is:
Array
(
[0] => 2.1
)
Your regex matches float and integers, and even strings consisting of just commas.
[0-9,]+ - 1 or more digits or ,
(?:\.[0-9]*)? - one or zero sequences of . + zero or more digits.
You need
/\d+\.\d+/
That will match 1+ digits, . and 1+ digits.
Or, to also match negative and positive floats, add an optional - at the beginning:
/-?\d+\.\d+/
Details
-? - one or zero hyphens (? means match one or zero occurrences)
\d+ - one or more digits (+ means match one or more occurrences, \d matches a digit char)
\. - a literal dot (since a dot in a regex is a special metacharacter, it should be escaped to denote a literal dot)
\d+ - one or more digits
PHP demo:
$string = "8x2.1 3x2";
preg_match_all('/\d+\.\d+/', $string, $matches);
print_r($matches[0]);
// => Array ( [0] => 2.1 )
A bonus regex that will also match only float numbers with optional exponent (a variant of the regex at regular-expressions.info):
/[-+]?\d+\.\d+(?:e[-+]?\d+)?/i
Here, you can see that an optional + or - is matched first ([-+]?), then the same pattern as above is used, then comes an optional non-capturing group (?:...)? that matches 1 or 0 occurrences of the following sequence: e or E (since /i is a case insensitive modifier), [-+]? matches an optional + or -, and \d+` matches 1+ digits.

Regex to separate words numbers and symbols in php

I have the following sample string
Lot99. Is it 1+3 or 5 or 6.53
I would like the following result
["Lot99",".","Is","it","1","+","3","or","5","or","6.53"]
So results eliminate spaces, separates words but keep together words and numbers if there is no space between them, separates numbers if not at start or end of word. Separates symbols like +-.,!##$%^&*();\/|<> but not if a decimal point between 2 numbers, eg 2.2 should be kept as 2.2
So far I have this regex /s+[a-zA-Z]+|\b(?=\W)/
I know its not much but I have been visiting a number of websites to learn RegEx but I am still trying to get my head around this language. If your answer could please include comments so I can break it down and learn from it so I can then eventually start to modify it further.
Use preg_match_all
preg_match_all('~(?:\d+(?:\.\d+)?|\w)+|[^\s\w]~', $str, $matches);
Regex101 Demo
Explanation:
(?:\d+(?:\.\d+)?|\w)+ would match numbers (float or int) or word characters one or more times which matches strings like foo99.9 , 88gg etc
| OR
[^\s\w] matches a non-word , non-space character.
To provide yet another alternative, PHP offers the wonderful (*SKIP)(*FAIL) construct. What it says, is the following:
dont_match_this|forget_about_this|(but_keep_this)
Breaking it down to your actual problem, this would be:
(?:\d+\.\d+) # looks for digits with a point (float)
(*SKIP)(*FAIL) # all of the left alternatives should fail
| # OR
([.\s+]+) # a point, whitespace or plus sign
# this should match and be captured
# for PREG_SPLIT_DELIM_CAPTURE
In PHP this would be:
<?php
$string = "Lot99. Is it 1+3 or 5 or 6.53";
$regex = '~
(?:\d+\.\d+) # looks for digits with a point (float)
(*SKIP)(*FAIL) # all of the left alternatives should fail
| # OR
([.\s+]+) # a point, whitespace or plus sign
# this should match and be captured
# for PREG_SPLIT_DELIM_CAPTURE
~x'; # verbose modifier
$parts = preg_split($regex, $string, 0, PREG_SPLIT_DELIM_CAPTURE);
print_r($parts);
?>
See a demo on ideone.com and on regex101.com.
#Jan is definitely using the ideal function preg_split(). I'll offer an alternative pattern that doesn't need to use (*SKIP)(*FAIL) or a capture group.
Code: (Demo)
$txt = 'Lot99. Is it 1+3 or 5 or 6.53';
var_export(
preg_split('~(?:\d+\.\d+|\w+|\S)\K *~', $txt, 0, PREG_SPLIT_NO_EMPTY)
);
Output:
array (
0 => 'Lot99',
1 => '.',
2 => 'Is',
3 => 'it',
4 => '1',
5 => '+',
6 => '3',
7 => 'or',
8 => '5',
9 => 'or',
10 => '6.53',
)
Effectively, the pattern says match 1. a single float value 2. one or more consecutive number/letters/underscores or 3. a single non-whitespace character THEN forget the matched characters and then consume zero or more spaces. The spaces are the only characters discarded while splitting.

Capture two regular expression from string

I have strings containing this (where the number are integers representing the user id)
#[calbert](3)
#[username](684684)
I figured I need the following to get the username and user id
\((.*?)\)
and
\[(.*?)])
But is there a way to get both at once?
And PHP returns, is it possible to only get the result without the parenthesis (and brackets in the username case)
Array
(
[0] => (3)
[1] => 3
)
\[([^\]]*)\]|\(([^)]*)\)
Try this.See demo.You need to use | or operator.This provides regex engine to provide alternating capturing group if the first one fails.
https://regex101.com/r/tX2bH4/31
$re = "/\\[([^\\]]*)\\]|\\(([^)]*)\\)/im";
$str = " #[calbert](3)\n #[username](684684)";
preg_match_all($re, $str, $matches);
Or you can use ur own regex. \((.*?)\)|\[(.*?)\])
Through positive lookbehind and lookahead assertion.
(?<=\[)[^\]]*(?=])|(?<=\()\d+(?=\))
(?<=\[) Asserts that the match must be preceeded by [ character,
[^\]]* matches any char but not of ] zero or more times.
(?=]) Asserts that the match must be followed by ] symbol.
| Called logical OR operator used to combine two regexes.
DEMO

preg_split() String into Text and Numbers [duplicate]

This question already has answers here:
Get non-numeric characters then number on each line of a block of texf
(3 answers)
Closed 8 years ago.
i want to split the string into numbers and text as separate.
print_r(preg_split("/[^0-9]+/", "12345hello"));
output:
Array ( [0] => 12345 [1] => )
By using [^0-9]+ you are actually matching the numbers and splitting on them, which leaves you with an empty array element instead of the expected result. You can use a workaround to do this.
print_r(preg_split('/\d+\K/', '12345hello'));
# Array ([0] => 12345 [1] => hello)
The \K verb tells the engine to drop whatever it has matched so far from the match to be returned.
If you want to consistently do this with larger text, you need multiple lookarounds.
print_r(preg_split('/(?<=\D)(?=\d)|\d+\K/', '12345hello6789foo123bar'));
# Array ([0] => 12345 [1] => hello [2] => 6789 [3] => foo [4] => 123 [5] => bar)
You can split it with Lookahead and Lookbehind from digit after non-digit and vice verse.
(?<=\D)(?=\d)|(?<=\d)(?=\D)
explanation:
\D Non-Digit [^0-9]
\d any digit [0-9]
Here is online demo
Detail pattern explanation:
(?<= look behind to see if there is:
\D non-digits (all but 0-9)
) end of look-behind
(?= look ahead to see if there is:
\d digits (0-9)
) end of look-ahead
| OR
(?<= look behind to see if there is:
\d digits (0-9)
) end of look-behind
(?= look ahead to see if there is:
\D non-digits (all but 0-9)
) end of look-ahead

Categories