Regular expression in PHP being too greedy on words

Regular expression in PHP being too greedy on words - php

I know I'm just being simple-minded at this point but I'm stumped. Suppose I have a textual target that looks like this:
Johnny was really named for his 1234 grandfather, John Hugenot, but his T5677 id was JH6781 and his little brother's HG766 id was RB1223.
Using this RegExp: \s[A-Z][A-Z]\d\d\d\d\s, how would I extract, individually, the first and second occurrences of the matching strings? "JH6781" and "RB1223", respectively. I guarantee that the matching string will appear exactly twice in the target text.
Note: I do NOT want to change the existing string at all, so str_replace() is not an option.

Erm... how about using this regex:
/\b[A-Z]{2}\d{4}\b/
It means 'match boundary of a word, followed by exactly two capital English letters, followed by exactly four digits, followed by a word boundary'. So it won't match 'TGX7777' (word boundary is followed by three letters - pattern match failed), and it won't match 'TX77777' (four digits are followed by another digit - fail again).
And that's how it can be used:
$str = "Johnny was really named for his 1234 grandfather, John Hugenot, but his T5677 id was JH6781 and his little brother's HG766 id was RB1223.";
preg_match_all('/\b[A-Z]{2}\d{4}\b/', $str, $matches);
var_dump($matches[0]);
// array
// 0 => string 'JH6781' (length=6)
// 1 => string 'RB1223' (length=6)

$s='Johnny was really named for his 1234 grandfather, John Hugenot, but his T5677 id was JH6781 and his little brother\'s HG766 id was RB1223.';
$n=preg_match_all('/\b[A-Z][A-Z]\d\d\d\d\b/',$s,$m);
gives the result $n=2, then
print_r($m);
gives the result
Array
(
[0] => Array
(
[0] => JH6781
[1] => RB1223
)
)

You could use a combination of preg_match with the offset parameter(5th) and strpos to select the first and second occurrence.
Alternatively you could use preg_match_all and just use the first two array entries
<?php
$first = preg_match($regex, $subject, $match);
$second = preg_match($regex, $subject, $match, 0, strpos($match[0]) + 1);
?>

Related

Why is preg_match behaving differently to preg_replace (resulting in different matches) in php?

Given the following string and regular expression, the resulting behavior is something I don't understand. preg_match delivers what I am expecting while preg_replace doesn't make sense to me.
$string = 'aaa [Ticket#RS-123456] äüö [xxx] ccc ddd';
$re = '#(.*)?(\[Ticket\#)(.*)(\])(.*)?#siU';
What I finally need in this example is the string RS-123456 (or whatever string would be at this position). This string should match at the 3rd position ($3), if I don't completely misunderstand regular expressions.
preg_match($re, $string, $matches_pm);
Result (as expected):
Array(
[0] => aaa [Ticket#RS-123456]
[1] => aaa
[2] => [Ticket#
[3] => RS-123456 // That's exactly what I would expect
[4] => ]
)
$res_pr = preg_replace($re, "$3", $string);
Result (unexpected):
RS-123456 äüö [xxx] ccc ddd
I hope anyone can open my eyes and show me where my logical failure is hiding.

Both match the same text, but preg_match returns the first match only while preg_replace replaces the match (that is not the entire string) with Group 3 contents leaving äüö [xxx] ccc ddd in the resulting string.
Use
$re = '#(.*)(\[Ticket\#)(.*?)(\])(.*)#si';
to get the same results with preg_match and preg_replace.
See the PHP demo.
However, preg_match is the preferred way here:
if (preg_match('#\[Ticket#\K[^]]+#i', $string, $matches_pm)) {
echo $matches_pm[0];
}
See this PHP demo.
Pattern details
\[Ticket# - a literal [Ticket# substring
\K - match reset operator discarding the currently matched text
[^]]+ - 1 or more chars other than ]

regex find numbers after capital letter php

I am trying to find all the numbers after a capital letter. See the example below:
E1S1 should give me an array containing: [1 , 1]
S123455D1223 should give me an array containing: [123455 , 1223]
i tried the following but didnt get any matches on any of the examples shown above :(
$loc = "E123S5";
$locs = array();
preg_match('/\[A-Z]([0-9])/', $loc, $locs);
any help is greatly appreciated i am a newbie to regex.

Your regex \[A-Z]([0-9]) matches a literal [ (as it is escaped), then A-Z] as a char sequence (since the character class [...] is broken) and then matches and captures a single ASCII digit (with ([0-9])). Also, you are using a preg_match function that only returns 1 match, not all matches.
You might fix it with
preg_match_all('/[A-Z]([0-9]+)/', $loc, $locs);
The $locs\[1\] will contain the values you need.
Alternatively, you may use a [A-Z]\K[0-9]+ regex:
$loc = "E123S5";
$locs = array();
preg_match_all('/[A-Z]\K[0-9]+/', $loc, $locs);
print_r($locs[0]);
Result:
Array
(
[0] => 123
[1] => 5
)
See the online PHP demo.
Pattern details
[A-Z] - an upper case ASCII letter (to support all Unicode ones, use \p{Lu} and add u modifier)
\K - a match reset operator discarding all text matched so far
[0-9]+ - any 1 or more (due to the + quanitifier) digits.

Matching all characters except spaces in regex

Right now I have a regex, and I want to change one part of the regex.
(.{3,}?) ~
^---This part of the code, where it says, (any characters that are 3 or more in length, and matches up to the nearest space), I want to change it to (any characters, except spaces , that are 3 or more in length, and matches up to the nearest space). How would I say that in regex?
$text = "my name is to habert";
$regex = "~(?:my name is |my name\\\'s |i am |i\\\'m |it is |it\\\'s |call me )?(.{3,}?) ~i";
preg_match($regex, $text, $match);
print_r($match);
Result:
Array ( [0] => my name [1] => my name )
Need Result:
Array ( [0] => name [1] => name )

Gravedigger here... Since this question does not have an answer yet, I'll post mine.
(\S{3,}) will work for your needs
Regex Explanation:
( Open capture group
\S Everything but whitespaces (same as [^\s], you can use [^ ] too, but the latter works only for spaces.)
{3,} Must contain three or more characters
) Close capture group
Test it here!

How do i break string into words at the position of number

I have some string data with alphanumeric value. like us01name, phc01name and other i.e alphabates + number + alphabates.
i would like to get first alphabates + number in first string and remaining on second.
How can i do it in php?

You can use a regular expression:
// if statement checks there's at least one match
if(preg_match('/([A-z]+[0-9]+)([A-z]+)/', $string, $matches) > 0){
$firstbit = $matches[1];
$nextbit = $matches[2];
}
Just to break the regular expression down into parts so you know what each bit does:
( Begin group 1
[A-z]+ As many alphabet characters as there are (case agnostic)
[0-9]+ As many numbers as there are
) End group 1
( Begin group 2
[A-z]+ As many alphabet characters as there are (case agnostic)
) End group 2

Try this code:
preg_match('~([^\d]+\d+)(.*)~', "us01name", $m);
var_dump($m[1]); // 1st string + number
var_dump($m[2]); // 2nd string
OUTPUT
string(4) "us01"
string(4) "name"
Even this more restrictive regex will also work for you:
preg_match('~([A-Z]+\d+)([A-Z]+)~i', "us01name", $m);

You could use preg_split on the digits with the pattern capture flag. It returns all pieces, so you'd have to put them back together. However, in my opinion is more intuitive and flexible than a complete pattern regex. Plus, preg_split() is underused :)
Code:
$str = 'user01jason';
$pieces = preg_split('/(\d+)/', $str, -1, PREG_SPLIT_DELIM_CAPTURE);
print_r($pieces);
Output:
Array
(
[0] => user
[1] => 01
[2] => jason
)

How to get all captures of subgroup matches with preg_match_all()? [duplicate]

This question already has answers here:
Get repeated matches with preg_match_all()
(6 answers)
Closed 4 years ago.
Update/Note:
I think what I'm probably looking for is to get the captures of a group in PHP.
Referenced: PCRE regular expressions using named pattern subroutines.
(Read carefully:)
I have a string that contains a variable number of segments (simplified):
$subject = 'AA BB DD '; // could be 'AA BB DD CC EE ' as well
I would like now to match the segments and return them via the matches array:
$pattern = '/^(([a-z]+) )+$/i';
$result = preg_match_all($pattern, $subject, $matches);
This will only return the last match for the capture group 2: DD.
Is there a way that I can retrieve all subpattern captures (AA, BB, DD) with one regex execution? Isn't preg_match_all suitable for this?
This question is a generalization.
Both the $subject and $pattern are simplified. Naturally with such the general list of AA, BB, .. is much more easy to extract with other functions (e.g. explode) or with a variation of the $pattern.
But I'm specifically asking how to return all of the subgroup matches with the preg_...-family of functions.
For a real life case imagine you have multiple (nested) level of a variant amount of subpattern matches.
Example
This is an example in pseudo code to describe a bit of the background. Imagine the following:
Regular definitions of tokens:
CHARS := [a-z]+
PUNCT := [.,!?]
WS := [ ]
$subject get's tokenized based on these. The tokenization is stored inside an array of tokens (type, offset, ...).
That array is then transformed into a string, containing one character per token:
CHARS -> "c"
PUNCT -> "p"
WS -> "s"
So that it's now possible to run regular expressions based on tokens (and not character classes etc.) on the token stream string index. E.g.
regex: (cs)?cp
to express one or more group of chars followed by a punctuation.
As I now can express self-defined tokens as regex, the next step was to build the grammar. This is only an example, this is sort of ABNF style:
words = word | (word space)+ word
word = CHARS+
space = WS
punctuation = PUNCT
If I now compile the grammar for words into a (token) regex I would like to have naturally all subgroup matches of each word.
words = (CHARS+) | ( (CHARS+) WS )+ (CHARS+) # words resolved to tokens
words = (c+)|((c+)s)+c+ # words resolved to regex
I could code until this point. Then I ran into the problem that the sub-group matches did only contain their last match.
So I have the option to either create an automata for the grammar on my own (which I would like to prevent to keep the grammar expressions generic) or to somewhat make preg_match working for me somehow so I can spare that.
That's basically all. Probably now it's understandable why I simplified the question.
Related:
pcrepattern man page
Get repeated matches with preg_match_all()

Similar thread: Get repeated matches with preg_match_all()
Check the chosen answer plus mine might be useful I will duplicate there:
From http://www.php.net/manual/en/regexp.reference.repetition.php :
When a capturing subpattern is repeated, the value captured is the substring that matched the final iteration.
I personally give up and going to do this in 2 steps.
EDIT:
I see in that other thread someone claimed that lookbehind method is able doing it.

Try this:
preg_match_all("'[^ ]+'i",$text,$n);
$n[0] will contain an array of all non-space character groups in the text.
Edit: with subgroups:
preg_match_all("'([^ ]+)'i",$text,$n);
Now $n[1] will contain the subgroup matches, that are exactly the same as $n[0]. This is pointless actually.
Edit2: nested subgroups example:
$test = "Hello I'm Joe! Hi I'm Jane!";
preg_match_all("/(H(ello|i)) I'm (.*?)!/i",$test,$n);
And the result:
Array
(
[0] => Array
(
[0] => Hello I'm Joe!
[1] => Hi I'm Jane!
)
[1] => Array
(
[0] => Hello
[1] => Hi
)
[2] => Array
(
[0] => ello
[1] => i
)
[3] => Array
(
[0] => Joe
[1] => Jane
)
)

Is there a way that I can retrieve all matches (AA, BB, DD) with one regex execution? Isn't preg_match_all not suitable for this?
Your current regex seems to be for a preg_match() call. Try this instead:
$pattern = '/[a-z]+/i';
$result = preg_match_all($pattern, $subject, $matches);
Per comments, the ruby regex I mentioned:
sentence = %r{
(?<subject> cat | dog ){0}
(?<verb> eats | drinks ){0}
(?<object> water | bones ){0}
(?<adjective> big | smelly ){0}
(?<obj_adj> (\g<adjective>\s)? ){0}
The\s\g<obj_adj>\g<subject>\s\g<verb>\s\g<opt_adj>\g<object>
}x
md = sentence.match("The cat drinks water");
md = sentence.match("The big dog eats smelly bones");
But I think you'll need a lexer/parser/tokenizer to do the same kind of thing in PHP. :-|

You can't extract the subpatterns because the way you wrote your regex returns only one match (using ^ and $ at the same time, and + on the main pattern).
If you write it this way, you'll see that your subgroups are correctly there:
$pattern = '/(([a-z]+) )/i';
(this still has an unnecessary set of parentheses, I just left it there for illustration)

Edit
I didn't realize what you had originally asked for. Here is the new solution:
$result = preg_match_all('/[a-z]+/i', $subject, $matches);
$resultArr = ($result) ? $matches[0] : array();

How about:
$str = 'AA BB CC';
$arr = preg_split('/\s+/', $str);
print_r($arr);
output:
(
[0] => AA
[1] => BB
[2] => CC
)

I may have misunderstood what you're describing. Are you just looking for a pattern for groups of letters with whitespace between?
// any subject containing words:
$subject = 'AfdfdfdA BdfdfdB DdD';
$subject = 'AA BB CC';
$subject = 'Af df dfdA Bdf dfdB DdD';
$pattern = '/(([a-z]+)\s)+[a-z]+/i';
$result = preg_match_all($pattern, $subject, $matches);
print_r($matches);
echo "<br/>";
print_r($matches[0]); // this matches $subject
echo "<br/>".$result;

Yes your right your solution is by using preg_match_all preg_match_all is recursive, so dont use start-with^ and end-with$, so that preg_match_all put all found patterns in an array.
Each new pair of parenthesis will add a New arrays indicating the different matches
use ? for optional matches
You can Separate different groups of patterns reported with the parenthesis () to ask for a group to be found and added in a new array (can allow you to count matches, or to categorize each matches from the returned array )
Clarification required
Let me try to understand you question, so that my answer match what you ask.
Your $subject is not a good exemple of what your looking for?
You would like the pregmatch search, to split what you provided in $subject in to 4 categories , Words, Characters, Punctuation and white spaces ? and what about numbers?
As well you would like The returned matches, to have the offsets of the matches specified ?
Does $subject = 'aa.bb cc.dd EE FFF,GG'; better fit a real life exemple?
I will take your basic exemple in $subject and make it work to give your exactly what your asked.
So can you edit your $subject so that i better fit all the cases that you want to match
Original '/^(([a-z]+) )+$/i';
Keep me posted,
you can test your regexes here http://www.spaweditor.com/scripts/regex/index.php
Partial answer
/([a-z])([a-z]+)/i
AA BB DD CD
Array
(
[0] => Array
(
[0] => AA
[1] => BB
[2] => DD
[3] => CD
)
[1] => Array
(
[0] => A
[1] => B
[2] => D
[3] => C
)
[2] => Array
(
[0] => A
[1] => B
[2] => D
[3] => D
)
)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regular expression in PHP being too greedy on words - php

Related

Why is preg_match behaving differently to preg_replace (resulting in different matches) in php?

regex find numbers after capital letter php

Matching all characters except spaces in regex

How do i break string into words at the position of number

How to get all captures of subgroup matches with preg_match_all()? [duplicate]

Categories

Resources