I have a large string chunk of text, and I need to extract all occurrences of text matching the following pattern:
QXXXXX-X (where X can be any digit, 0-9).
How do I do this in PHP?
<?php
preg_match_all("","Q05546-8 XXX Q13323-0",$output,PREG_PATTERN_ORDER);
print_r($output);
?>
Here you go:
preg_match_all('/\bQ[0-9]{5}-[0-9]\b/',"Q05546-8 XXX Q13323-0",$output,PREG_PATTERN_ORDER);
print_r($output);
Or, you can use shorthand class \d for a digit: \bQ\d{5}-\d\b.
Regex explanation:
\b - Word boundary (we are either at the beginning of between a word character ([a-zA-Z0-9_]) and a non-word one (all others)
Q - Literal case-sensitive Q
[0-9]{5} - Exactly 5 (due to {5}) digits from 0 to 5 range
- - Literal hyphen
[0-9] - Exactly 1 digit from 0 to 5
\b - Again a word boundary.
If you have these values inside longer sequences, you may consider using \bQ[0-9]{5}-[0-9](?![0-9]) or using shorthand classes, \bQ\d{5}-\d(?!\d).
Output of the demo:
Array
(
[0] => Array
(
[0] => Q05546-8
[1] => Q13323-0
)
)
Related
I have strings that are expected to be in the format of something like
"C 1,13,7,2,55"
I would expect matches to be [1,13,7,2,55].
I want to match on all numbers in that "csv" portion of the string. But only if it comes after "C " Note a space after the 'c'
This comes from user-input and so I want to account for case and multiple space(s) in between tokens and accidental double commas, etc..
I.e.
"c 1 , 12,15 , 8 , 9,10,11"
I want matches to be [1,12,15,8,9,10,11]
But I only want to attempt to match on numbers after the "C" char (case-insensitive).
So "1,2 , 4,5" and "d 12456, 9890" should fail .
Here's the regex I have half-baked so far.
Note: This will ultimately get ported over to PHP and so I will be using preg_match_all
/(?<=C)*\d+/gim
I use a positive lookbehind (but match as many times as needed) for the "C" char. Then match on 1 or more digits globally.
I haven't created all my unit tests yet, but I think this may work.
Is there a better way to do this?
Is matching on 1or more positive lookbehinds standard?
Why don't I need to include a \s* after the 'C' in the positive lookbehind?
When would including the 'm' multi-line flag even make a difference here?
Thanks!
The simplest option is probably to first test for the "C" using the case-insensitive stripos before matching the digits with \d+. For example:
$input = "c 1 , 12,15 , 8 , 9,10,11";
if (stripos($input, "C") === 0) {
preg_match_all("/\d+/", $input, $matches);
print_r($matches);
}
The condition could be for example stripos($input, "C") !== false if the "C" does not have to be the first character.
To validate that the string starts with "C" (possibly after horizontal whitespace), and contains only horizontal whitespace, commas and digits then the test could instead be
if (preg_match("/^\h*C[\h\d,]+$/i", $input)) {
The lookbehind in your regex /(?<=C)*\d+/gim is made optional by the *, so the regex does not require that a "C" is present for the digits to be matched. It is functionally equivalent to just /\d+/g.
Is matching on 1 or more positive lookbehinds standard?
In this case, the lookbehind would need to be variable-width (?<=C.*) and php does not support variable-width lookbehinds.
Why don't I need to include a \s* after the 'C' in the positive lookbehind?
Php does not support the use of the * quantifier inside a lookbehind.
When would including the 'm' multi-line flag even make a difference here?
You only might want to use the m flag if your input is multi-lined and you are using ^ or $ which assert the start or end respectively of a line or the whole string.
Using this pattern /(?<=C)*\d+/gim; in for example Javascript it would not be valid due to the quantifier after the lookbehind assertion.
If you want to write it in JavaScript getting all the digits after C at the start of the string, and the quantifier in the lookbehind is supported:
(?<=^C [\d, ]*)\d+
Regex demo
Using (?<=C)*\d+ in PHP, the quantifier for the lookbehind is optional, and it would also match 8 and 9 in for example this string 8,9 C 1,13,7,2,55
Using a quantifier with infinite length in a lookbehind assertion is not supported in PHP so you can not use (?<=C\h+)\d+ where \h+ would match 1+ spaces due to S
If you are using PHP, you can make use of the \G anchor to match only consecutive numbers after the first C character.
For a single line, you don't need the multi line flag. You do need it for multiple lines due to the anchor.
(?:^\h*C\h+|\G(?!^))\h*,*\h*\K\d+
The pattern matches:
(?: Non capture group
^ Start of string
\h*C\h+ Match optional spaces, then C and 1+ spaces
| Or
\G(?!^) Assert the position at the end of the previous match (not at the start)
) Close the non capture group
\h*,*\h*\K Match optional comma's between optional spaces
\d+ Match 1 or more digits
Regex demo | Php demo
$regex = '/(?:\h*C\h+|\G(?!^))\h*,*\h*\K\d+/i';
$strings = [
"C 1,13,7,2,55",
"c 1 , 12,15 , 8 , 9,10,11",
"1,2 , 4,5",
"d 12456, 9890"
];
foreach ($strings as $s) {
if (preg_match_all($regex, $s, $matches)) {
print_r($matches[0]);
}
}
Output
Array
(
[0] => 1
[1] => 13
[2] => 7
[3] => 2
[4] => 55
)
Array
(
[0] => 1
[1] => 12
[2] => 15
[3] => 8
[4] => 9
[5] => 10
[6] => 11
)
I'm preg_match_all looping through a string using different patterns. Sometimes these patterns look a lot like each other, but differ slightly.
Right now I'm looking for a way to stop pattern A from matching strings that only pattern B - which has a 'T' in front of the 4 digits - should match.
The problem I'm running into is that pattern A also matches pattern B:
A:
(\d{4})(A|B)?(C|D)?
... matches 1234, 1234A, 1234AD, etc.
B:
I also have another pattern:
T(\d{4})\/(\d{4})
... which matches strings like: T7878/6767
The result
When running a preg_match_all on "T7878/6767 1234AD", A will give the following matches:
7878, 6767, 1234AD
Does anyone have a suggestion how to prevent A from matching B in a string like "Some text T7878/6767 1234AD and some more text"?
Your help is greatly appreciated!
Scenario with boundaries
If you only want to match those specific strings within some boundaries, use those boundary patterns on each side of the pattern.
If you expect a whitespace boundary before each match, then add the (?<!\S) negative lookbehind at the start of the pattern. If you expect a whitespace boundary at the end of the match, add the (?!\S) negative lookahead. If there can be any chars (as is in your original question), then SKIP-FAIL is the only way (see below).
So, in this first case, you may use
(?<!\S)(\d{4})([AB]?)([CD]?)(?!\S)
and
(?<!\S)T(\d{4})\/(\d{4})(?!\S)
See Pattern 1 demo and Pattern 2 demo.
Scenario with no specific boundaries
You need to make sure the second pattern is skipped when you parse the string with the first one. Use SKIP-FAIL technique for this:
'~T\d{4}/\d{4}(*SKIP)(*F)|(\d{4})(A|B)?(C|D)?~'
See the regex demo.
If you do not need the capturing groups, you may simplify it to
'~T\d{4}/\d{4}(*SKIP)(*F)|\d{4}[AB]?[CD]?~'
See another demo
Details
T\d{4}/\d{4} - T followed with 4 digits, / and another 4 digits
(*SKIP)(*F) - the matched text is discarded and the next match is searched from the matched text end
| - or
\d{4}[AB]?[CD]? - 4 digits, then optionally A or B and then optionally C or D.
From what you're asking, your current regexes don't really work. (A|B)?(C|D)? will never match AB. So I think you meant [ABCD]
Here's your new regex:
T(\d{4})\/(\d{4}) (\d{4}[ABCD]*)
For the string input:
T7878/6767 1234AB
We get the groups:
Match 1
Full match 0-17 `T7878/6767 1234AB`
Group 1. 1-5 `7878`
Group 2. 6-10 `6767`
Group 3. 11-17 `1234AB`
Regex101
Your syntax is pretty specific, so you regex just needs to be. Get rid of all your capture groups because they are screwing things up. You only need two groups which match your string syntax exactly.
First groups looks for word bounday followed by T then 4 digits then / then 4 more digits and a word break.
Second groups matches 4 digits and then letters A-D between 0 and 2 times. It has a negative lookbehind so will only match if there is a whitespace character before the 4 digits
(\bT\d{4}\/\d{4}\b)|(?<!\S)(\d{4}[A-D]{0,2})
Preg match all output:
Array
(
[0] => Array
(
[0] => T7878/6767
[1] => 1234AB
)
[1] => Array
(
[0] => T7878/6767
[1] =>
)
[2] => Array
(
[0] =>
[1] => 1234AB
)
)
I can't find the appropriate regex to extract only floats from a string.
Consider the following string:
$string = "8x2.1 3x2";
I want to extract 2.1, I tried the following but this gives me integers and floats:
preg_match_all('/[0-9,]+(?:\.[0-9]*)?/', $string, $matches);
i then tried using is_float to check for floats but this also returns the integers for some reason. Any ideas?
thanks
Consider this simple example:
<?php
$input = "8x2.1 3x2";
preg_match('/\d+\.\d+/', $input, $tokens);
print_r($tokens);
Matches "one or more digits, followed by exactly one full stop and again one or more digits".
The output obviously is:
Array
(
[0] => 2.1
)
Your regex matches float and integers, and even strings consisting of just commas.
[0-9,]+ - 1 or more digits or ,
(?:\.[0-9]*)? - one or zero sequences of . + zero or more digits.
You need
/\d+\.\d+/
That will match 1+ digits, . and 1+ digits.
Or, to also match negative and positive floats, add an optional - at the beginning:
/-?\d+\.\d+/
Details
-? - one or zero hyphens (? means match one or zero occurrences)
\d+ - one or more digits (+ means match one or more occurrences, \d matches a digit char)
\. - a literal dot (since a dot in a regex is a special metacharacter, it should be escaped to denote a literal dot)
\d+ - one or more digits
PHP demo:
$string = "8x2.1 3x2";
preg_match_all('/\d+\.\d+/', $string, $matches);
print_r($matches[0]);
// => Array ( [0] => 2.1 )
A bonus regex that will also match only float numbers with optional exponent (a variant of the regex at regular-expressions.info):
/[-+]?\d+\.\d+(?:e[-+]?\d+)?/i
Here, you can see that an optional + or - is matched first ([-+]?), then the same pattern as above is used, then comes an optional non-capturing group (?:...)? that matches 1 or 0 occurrences of the following sequence: e or E (since /i is a case insensitive modifier), [-+]? matches an optional + or -, and \d+` matches 1+ digits.
This question already has answers here:
Get non-numeric characters then number on each line of a block of texf
(3 answers)
Closed 8 years ago.
i want to split the string into numbers and text as separate.
print_r(preg_split("/[^0-9]+/", "12345hello"));
output:
Array ( [0] => 12345 [1] => )
By using [^0-9]+ you are actually matching the numbers and splitting on them, which leaves you with an empty array element instead of the expected result. You can use a workaround to do this.
print_r(preg_split('/\d+\K/', '12345hello'));
# Array ([0] => 12345 [1] => hello)
The \K verb tells the engine to drop whatever it has matched so far from the match to be returned.
If you want to consistently do this with larger text, you need multiple lookarounds.
print_r(preg_split('/(?<=\D)(?=\d)|\d+\K/', '12345hello6789foo123bar'));
# Array ([0] => 12345 [1] => hello [2] => 6789 [3] => foo [4] => 123 [5] => bar)
You can split it with Lookahead and Lookbehind from digit after non-digit and vice verse.
(?<=\D)(?=\d)|(?<=\d)(?=\D)
explanation:
\D Non-Digit [^0-9]
\d any digit [0-9]
Here is online demo
Detail pattern explanation:
(?<= look behind to see if there is:
\D non-digits (all but 0-9)
) end of look-behind
(?= look ahead to see if there is:
\d digits (0-9)
) end of look-ahead
| OR
(?<= look behind to see if there is:
\d digits (0-9)
) end of look-behind
(?= look ahead to see if there is:
\D non-digits (all but 0-9)
) end of look-ahead
I am trying to match words containing the following: eph gro iss
I have eph|gro|iss which will match eph gro iss in this example: new grow miss eph.
However I need to match the whole word. For example it should match all of the miss not just iss and grow not just gro
Thanks
You can do it like this:
\b(\w*(eph|gro|iss)\w*)\b
How it works:
The expression is bracketed with word-boundary anchors \b, so it only matches whole words. These words must contain one of the literals eph, gro or iss somewhere, but the \w* parts allow the literals to appear anywhere within the whole word.
The important thing here is that you need to adopt some specific definition for "words". If you are OK with the regex definition that words are sequences that match [a-zA-Z0-9_]+ then you can use the above verbatim.
If your definition of word is something else, you will need to replace the \b anchors and \w classes appropriately.
Try this:
\b([a-zA-Z]*(?:eph|gro|iss)[a-zA-Z]*)\b
Breakdown:
\b - word boundary
( - start capture
[a-zA-Z]* - zero or more letters
(?:eph|gro|iss) - your original regex, non-capturing
[a-zA-Z]* - zero or more letters
) - end capture
\b - word boundary
Example output:
php > $string = "new grow miss eph";
php > preg_match_all("/\b([a-zA-Z]*(?:eph|gro|iss)[a-zA-Z]*)\b/", $string, $matches);
php > print_r($matches);
Array
(
[0] => Array
(
[0] => grow
[1] => miss
[2] => eph
)
[1] => Array
(
[0] => grow
[1] => miss
[2] => eph
)
)