preg_split() String into Text and Numbers [duplicate] - php

This question already has answers here:
Get non-numeric characters then number on each line of a block of texf
(3 answers)
Closed 8 years ago.
i want to split the string into numbers and text as separate.
print_r(preg_split("/[^0-9]+/", "12345hello"));
output:
Array ( [0] => 12345 [1] => )

By using [^0-9]+ you are actually matching the numbers and splitting on them, which leaves you with an empty array element instead of the expected result. You can use a workaround to do this.
print_r(preg_split('/\d+\K/', '12345hello'));
# Array ([0] => 12345 [1] => hello)
The \K verb tells the engine to drop whatever it has matched so far from the match to be returned.
If you want to consistently do this with larger text, you need multiple lookarounds.
print_r(preg_split('/(?<=\D)(?=\d)|\d+\K/', '12345hello6789foo123bar'));
# Array ([0] => 12345 [1] => hello [2] => 6789 [3] => foo [4] => 123 [5] => bar)

You can split it with Lookahead and Lookbehind from digit after non-digit and vice verse.
(?<=\D)(?=\d)|(?<=\d)(?=\D)
explanation:
\D Non-Digit [^0-9]
\d any digit [0-9]
Here is online demo
Detail pattern explanation:
(?<= look behind to see if there is:
\D non-digits (all but 0-9)
) end of look-behind
(?= look ahead to see if there is:
\d digits (0-9)
) end of look-ahead
| OR
(?<= look behind to see if there is:
\d digits (0-9)
) end of look-behind
(?= look ahead to see if there is:
\D non-digits (all but 0-9)
) end of look-ahead

Related

Regex to get all numbers after a character

I have strings that are expected to be in the format of something like
"C 1,13,7,2,55"
I would expect matches to be [1,13,7,2,55].
I want to match on all numbers in that "csv" portion of the string. But only if it comes after "C " Note a space after the 'c'
This comes from user-input and so I want to account for case and multiple space(s) in between tokens and accidental double commas, etc..
I.e.
"c 1 , 12,15 , 8 , 9,10,11"
I want matches to be [1,12,15,8,9,10,11]
But I only want to attempt to match on numbers after the "C" char (case-insensitive).
So "1,2 , 4,5" and "d 12456, 9890" should fail .
Here's the regex I have half-baked so far.
Note: This will ultimately get ported over to PHP and so I will be using preg_match_all
/(?<=C)*\d+/gim
I use a positive lookbehind (but match as many times as needed) for the "C" char. Then match on 1 or more digits globally.
I haven't created all my unit tests yet, but I think this may work.
Is there a better way to do this?
Is matching on 1or more positive lookbehinds standard?
Why don't I need to include a \s* after the 'C' in the positive lookbehind?
When would including the 'm' multi-line flag even make a difference here?
Thanks!
The simplest option is probably to first test for the "C" using the case-insensitive stripos before matching the digits with \d+. For example:
$input = "c 1 , 12,15 , 8 , 9,10,11";
if (stripos($input, "C") === 0) {
preg_match_all("/\d+/", $input, $matches);
print_r($matches);
}
The condition could be for example stripos($input, "C") !== false if the "C" does not have to be the first character.
To validate that the string starts with "C" (possibly after horizontal whitespace), and contains only horizontal whitespace, commas and digits then the test could instead be
if (preg_match("/^\h*C[\h\d,]+$/i", $input)) {
The lookbehind in your regex /(?<=C)*\d+/gim is made optional by the *, so the regex does not require that a "C" is present for the digits to be matched. It is functionally equivalent to just /\d+/g.
Is matching on 1 or more positive lookbehinds standard?
In this case, the lookbehind would need to be variable-width (?<=C.*) and php does not support variable-width lookbehinds.
Why don't I need to include a \s* after the 'C' in the positive lookbehind?
Php does not support the use of the * quantifier inside a lookbehind.
When would including the 'm' multi-line flag even make a difference here?
You only might want to use the m flag if your input is multi-lined and you are using ^ or $ which assert the start or end respectively of a line or the whole string.
Using this pattern /(?<=C)*\d+/gim; in for example Javascript it would not be valid due to the quantifier after the lookbehind assertion.
If you want to write it in JavaScript getting all the digits after C at the start of the string, and the quantifier in the lookbehind is supported:
(?<=^C [\d, ]*)\d+
Regex demo
Using (?<=C)*\d+ in PHP, the quantifier for the lookbehind is optional, and it would also match 8 and 9 in for example this string 8,9 C 1,13,7,2,55
Using a quantifier with infinite length in a lookbehind assertion is not supported in PHP so you can not use (?<=C\h+)\d+ where \h+ would match 1+ spaces due to S
If you are using PHP, you can make use of the \G anchor to match only consecutive numbers after the first C character.
For a single line, you don't need the multi line flag. You do need it for multiple lines due to the anchor.
(?:^\h*C\h+|\G(?!^))\h*,*\h*\K\d+
The pattern matches:
(?: Non capture group
^ Start of string
\h*C\h+ Match optional spaces, then C and 1+ spaces
| Or
\G(?!^) Assert the position at the end of the previous match (not at the start)
) Close the non capture group
\h*,*\h*\K Match optional comma's between optional spaces
\d+ Match 1 or more digits
Regex demo | Php demo
$regex = '/(?:\h*C\h+|\G(?!^))\h*,*\h*\K\d+/i';
$strings = [
"C 1,13,7,2,55",
"c 1 , 12,15 , 8 , 9,10,11",
"1,2 , 4,5",
"d 12456, 9890"
];
foreach ($strings as $s) {
if (preg_match_all($regex, $s, $matches)) {
print_r($matches[0]);
}
}
Output
Array
(
[0] => 1
[1] => 13
[2] => 7
[3] => 2
[4] => 55
)
Array
(
[0] => 1
[1] => 12
[2] => 15
[3] => 8
[4] => 9
[5] => 10
[6] => 11
)

Double regex matches

I'm preg_match_all looping through a string using different patterns. Sometimes these patterns look a lot like each other, but differ slightly.
Right now I'm looking for a way to stop pattern A from matching strings that only pattern B - which has a 'T' in front of the 4 digits - should match.
The problem I'm running into is that pattern A also matches pattern B:
A:
(\d{4})(A|B)?(C|D)?
... matches 1234, 1234A, 1234AD, etc.
B:
I also have another pattern:
T(\d{4})\/(\d{4})
... which matches strings like: T7878/6767
The result
When running a preg_match_all on "T7878/6767 1234AD", A will give the following matches:
7878, 6767, 1234AD
Does anyone have a suggestion how to prevent A from matching B in a string like "Some text T7878/6767 1234AD and some more text"?
Your help is greatly appreciated!
Scenario with boundaries
If you only want to match those specific strings within some boundaries, use those boundary patterns on each side of the pattern.
If you expect a whitespace boundary before each match, then add the (?<!\S) negative lookbehind at the start of the pattern. If you expect a whitespace boundary at the end of the match, add the (?!\S) negative lookahead. If there can be any chars (as is in your original question), then SKIP-FAIL is the only way (see below).
So, in this first case, you may use
(?<!\S)(\d{4})([AB]?)([CD]?)(?!\S)
and
(?<!\S)T(\d{4})\/(\d{4})(?!\S)
See Pattern 1 demo and Pattern 2 demo.
Scenario with no specific boundaries
You need to make sure the second pattern is skipped when you parse the string with the first one. Use SKIP-FAIL technique for this:
'~T\d{4}/\d{4}(*SKIP)(*F)|(\d{4})(A|B)?(C|D)?~'
See the regex demo.
If you do not need the capturing groups, you may simplify it to
'~T\d{4}/\d{4}(*SKIP)(*F)|\d{4}[AB]?[CD]?~'
See another demo
Details
T\d{4}/\d{4} - T followed with 4 digits, / and another 4 digits
(*SKIP)(*F) - the matched text is discarded and the next match is searched from the matched text end
| - or
\d{4}[AB]?[CD]? - 4 digits, then optionally A or B and then optionally C or D.
From what you're asking, your current regexes don't really work. (A|B)?(C|D)? will never match AB. So I think you meant [ABCD]
Here's your new regex:
T(\d{4})\/(\d{4}) (\d{4}[ABCD]*)
For the string input:
T7878/6767 1234AB
We get the groups:
Match 1
Full match 0-17 `T7878/6767 1234AB`
Group 1. 1-5 `7878`
Group 2. 6-10 `6767`
Group 3. 11-17 `1234AB`
Regex101
Your syntax is pretty specific, so you regex just needs to be. Get rid of all your capture groups because they are screwing things up. You only need two groups which match your string syntax exactly.
First groups looks for word bounday followed by T then 4 digits then / then 4 more digits and a word break.
Second groups matches 4 digits and then letters A-D between 0 and 2 times. It has a negative lookbehind so will only match if there is a whitespace character before the 4 digits
(\bT\d{4}\/\d{4}\b)|(?<!\S)(\d{4}[A-D]{0,2})
Preg match all output:
Array
(
[0] => Array
(
[0] => T7878/6767
[1] => 1234AB
)
[1] => Array
(
[0] => T7878/6767
[1] =>
)
[2] => Array
(
[0] =>
[1] => 1234AB
)
)

PHP Regex how to capture floating point numbers that do not have a letter at the end

I'm using preg_match_all and I want to capture the floating point numbers that do not have a letter following them.
For example
-20.4a 110b 139 31c 10.4
Desired
[0] => Array
(
[0] => 139
[1] => 10.4
)
I've tried was able do to the opposite using this pattern:
/\d+(.\d+)?(?=[a-z])/i
which captures the numbers with letters that you can see in this demo. But I can't figure out how to capture the numbers that have no trailing letters.
Use negative lookahead:
/\d+(\.\d+)?(?![a-z])/i
But it is not sufficient, you have to exclude also digit and dot:
/\d+(?:\.\d+)?(?![a-z\d.])/i
PHP:
$string = '-20.4a 110b 139 31c 10.4';
preg_match_all('/\d+(?:\.\d+)?(?![a-z\d.])/', $string, $match);
print_r($match);
Output:
Array
(
[0] => Array
(
[0] => 139
[1] => 10.4
)
)
You can use this regex with a positive lookahead:
[+-]?\b\d*\.?\d+(?=\h|$)
RegEx Demo
(?=\h|$) asserts presence of a horizontal white space or end of line after matched number.
Alternatively you can use this regex with a possessive quantifier:
[+-]?\b\d*\.?\d++(?![.a-zA-Z])
RegEx Demo 2
There are a few approaches one can take here.
Atomic group matching and a negative lookahead or word boundary:
(?>\d+(?:\.\d+)?)(?![a-z])
(?>\d+(?:\.\d+)?)\b
Using a negative lookahead that also denies a dot and numbers:
\d+(?:\.\d+)?(?![a-z.\d])
Positive lookahead to a space (seems to be the separator in here) or the end of string
\d+(?:\.\d+)?(?=\s|$)

Quick regex pattern in PHP

I have a large string chunk of text, and I need to extract all occurrences of text matching the following pattern:
QXXXXX-X (where X can be any digit, 0-9).
How do I do this in PHP?
<?php
preg_match_all("","Q05546-8 XXX Q13323-0",$output,PREG_PATTERN_ORDER);
print_r($output);
?>
Here you go:
preg_match_all('/\bQ[0-9]{5}-[0-9]\b/',"Q05546-8 XXX Q13323-0",$output,PREG_PATTERN_ORDER);
print_r($output);
Or, you can use shorthand class \d for a digit: \bQ\d{5}-\d\b.
Regex explanation:
\b - Word boundary (we are either at the beginning of between a word character ([a-zA-Z0-9_]) and a non-word one (all others)
Q - Literal case-sensitive Q
[0-9]{5} - Exactly 5 (due to {5}) digits from 0 to 5 range
- - Literal hyphen
[0-9] - Exactly 1 digit from 0 to 5
\b - Again a word boundary.
If you have these values inside longer sequences, you may consider using \bQ[0-9]{5}-[0-9](?![0-9]) or using shorthand classes, \bQ\d{5}-\d(?!\d).
Output of the demo:
Array
(
[0] => Array
(
[0] => Q05546-8
[1] => Q13323-0
)
)

Regular expressions: matching words containing sequences

I am trying to match words containing the following: eph gro iss
I have eph|gro|iss which will match eph gro iss in this example: new grow miss eph.
However I need to match the whole word. For example it should match all of the miss not just iss and grow not just gro
Thanks
You can do it like this:
\b(\w*(eph|gro|iss)\w*)\b
How it works:
The expression is bracketed with word-boundary anchors \b, so it only matches whole words. These words must contain one of the literals eph, gro or iss somewhere, but the \w* parts allow the literals to appear anywhere within the whole word.
The important thing here is that you need to adopt some specific definition for "words". If you are OK with the regex definition that words are sequences that match [a-zA-Z0-9_]+ then you can use the above verbatim.
If your definition of word is something else, you will need to replace the \b anchors and \w classes appropriately.
Try this:
\b([a-zA-Z]*(?:eph|gro|iss)[a-zA-Z]*)\b
Breakdown:
\b - word boundary
( - start capture
[a-zA-Z]* - zero or more letters
(?:eph|gro|iss) - your original regex, non-capturing
[a-zA-Z]* - zero or more letters
) - end capture
\b - word boundary
Example output:
php > $string = "new grow miss eph";
php > preg_match_all("/\b([a-zA-Z]*(?:eph|gro|iss)[a-zA-Z]*)\b/", $string, $matches);
php > print_r($matches);
Array
(
[0] => Array
(
[0] => grow
[1] => miss
[2] => eph
)
[1] => Array
(
[0] => grow
[1] => miss
[2] => eph
)
)

Categories