Matching all characters except spaces in regex - php

Right now I have a regex, and I want to change one part of the regex.
(.{3,}?) ~
^---This part of the code, where it says, (any characters that are 3 or more in length, and matches up to the nearest space), I want to change it to (any characters, except spaces , that are 3 or more in length, and matches up to the nearest space). How would I say that in regex?
$text = "my name is to habert";
$regex = "~(?:my name is |my name\\\'s |i am |i\\\'m |it is |it\\\'s |call me )?(.{3,}?) ~i";
preg_match($regex, $text, $match);
print_r($match);
Result:
Array ( [0] => my name [1] => my name )
Need Result:
Array ( [0] => name [1] => name )

Gravedigger here... Since this question does not have an answer yet, I'll post mine.
(\S{3,}) will work for your needs
Regex Explanation:
( Open capture group
\S Everything but whitespaces (same as [^\s], you can use [^ ] too, but the latter works only for spaces.)
{3,} Must contain three or more characters
) Close capture group
Test it here!

Related

PHP: preg_match for "some multiple words string" + [\s* + "<some.email#address>"]

Need for parse strings which are appears in the following possible forms:
Some User Name
Some User Name <user.mail#address>
So username (multiple words) always exists but email is optional and contained in angle brackets.
I need to catch from those:
Username, one string with multiple words separated by \s or \h
Following next email address (if exists) without angle brackets. If no email address specified then resulting submask array should be empty (but always exists in result).
I tried some variations of
preg_match('/^(.*?)\s*(?:\<(.*)\>)?$/s', $in, $out)
but this not work.
Thanks anybody help me.
To get all the separate words separated by \h and an optional email address, you could make use of the \G anchor to get iterative matches, asserting the position at the end of the previous match.
(?|^(\w+)|\G(?!^)\h+(\w+))(?:\h+<([^<>\r\n]+)>$)?
Explanation
(?| Branch reset group (To keep the words in $matches[1])
^(\w+) Start of string, match 1+ word chars in group 1
| Or
\G(?!^) Assert the position at the end of the previous match, not at the start
\h+(\w+) Match 1+ horizontal whitespace chars in group 1
) Close branch reset group
(?: Non capture group
\h+ Match 1+ horizontal whitespace chars
<([^<>\r\n]+)>$ Capture the email address between <> in group 2 at the end of string
)? Close non capture group and make it optional
Regex demo | Php demo
Use preg_match_all to get all the values.
The default flag is PREG_PATTERN_ORDER which:
Orders results so that $matches[0] is an array of full pattern
matches, $matches[1] is an array of strings matched by the first
parenthesized subpattern, and so on.
The words are in $matches[1] and the email is in $matches2
If the email is not present, the array will be there, but empty.
You could use array_filter to remove the empty entries from the email array.
Example code
$pattern = "~(?|^(\w+)|\G(?!^)\h+(\w+))(?:\h+<([^<>\r\n]+)>$)?~";
$strings = [
"Some User Name ",
"Some User Name <user.mail#address>"
];
foreach ($strings as $str) {
preg_match_all($pattern, $str, $matches);
print_r($matches[1]);
print_r(array_filter($matches[2]));
}
Output
Array
(
[0] => Some
[1] => User
[2] => Name
)
Array
(
)
Array
(
[0] => Some
[1] => User
[2] => Name
)
Array
(
[2] => user.mail#address
)

Split String With preg_match

I have string :
$productList="
Saluran Dua(Bothway)-(TAN007);
Speedy Password-(INET PASS);
Memo-(T-Memo);
7-pib r-10/10-(AM);
FBI (R/N/M)-(Rr/R(A));
";
i want the result like this:
Array(
[0]=>TAN007
[1]=>INET PASS
[2]=>T-Memo
[3]=>AM
[4]=>Rr/R(A)
);
I used :
$separator = '/\-\(([A-z ]*)\)/';
preg_match_all($separator, $productList, $match);
$value=$match[1];
but the result:
Array(
[0]=>INET PASS
[1]=>AM
);
there's must wrong code, anybody can help this?
Your regex does not include all the characters that can appear in the piece of text you want to capture.
The correct regex is:
$match = array();
preg_match_all('/-\((.*)\);/', $productList, $match);
Explanation (from the inside to outside):
.* matches anything;
(.*) is the expression above put into parenthesis to capture the match in $match[1];
-\((.*)\); is the above in the context: it matches if it is preceded by -( and followed by );; the parenthesis are escaped to use their literal values and not their special regex interpretation;
there is no need to escape - in regex; it has special interpretation only when it is used inside character ranges ([A-Z], f.e.) but even there, if the dash character (-) is right after the [ or right before the ] then it has no special meaning; e.g. [-A-Z] means: dash (-) or any capital letter (A to Z).
Now, print_r($match[1]); looks like this:
Array
(
[0] => TAN007
[1] => INET PASS
[2] => T-Memo
[3] => AM
[4] => Rr/R(A)
)
for the 1th line you need 0-9
for the 3th line you need a - in and
in the last line you need ()
try this
#\-\(([a-zA-Z/0-9(\)\- ]*)\)#
try with this ReGex
$separator = '#\-\(([A-Za-z0-9/\-\(\) ]*)\)#';

Regular expression in PHP being too greedy on words

I know I'm just being simple-minded at this point but I'm stumped. Suppose I have a textual target that looks like this:
Johnny was really named for his 1234 grandfather, John Hugenot, but his T5677 id was JH6781 and his little brother's HG766 id was RB1223.
Using this RegExp: \s[A-Z][A-Z]\d\d\d\d\s, how would I extract, individually, the first and second occurrences of the matching strings? "JH6781" and "RB1223", respectively. I guarantee that the matching string will appear exactly twice in the target text.
Note: I do NOT want to change the existing string at all, so str_replace() is not an option.
Erm... how about using this regex:
/\b[A-Z]{2}\d{4}\b/
It means 'match boundary of a word, followed by exactly two capital English letters, followed by exactly four digits, followed by a word boundary'. So it won't match 'TGX7777' (word boundary is followed by three letters - pattern match failed), and it won't match 'TX77777' (four digits are followed by another digit - fail again).
And that's how it can be used:
$str = "Johnny was really named for his 1234 grandfather, John Hugenot, but his T5677 id was JH6781 and his little brother's HG766 id was RB1223.";
preg_match_all('/\b[A-Z]{2}\d{4}\b/', $str, $matches);
var_dump($matches[0]);
// array
// 0 => string 'JH6781' (length=6)
// 1 => string 'RB1223' (length=6)
$s='Johnny was really named for his 1234 grandfather, John Hugenot, but his T5677 id was JH6781 and his little brother\'s HG766 id was RB1223.';
$n=preg_match_all('/\b[A-Z][A-Z]\d\d\d\d\b/',$s,$m);
gives the result $n=2, then
print_r($m);
gives the result
Array
(
[0] => Array
(
[0] => JH6781
[1] => RB1223
)
)
You could use a combination of preg_match with the offset parameter(5th) and strpos to select the first and second occurrence.
Alternatively you could use preg_match_all and just use the first two array entries
<?php
$first = preg_match($regex, $subject, $match);
$second = preg_match($regex, $subject, $match, 0, strpos($match[0]) + 1);
?>

PHP Regex pulling text after period and before space

I'm attempting to pull a certain part out of different varying strings, and am having a really hard time getting the correct regex to do so. Here are a few examples of what I am trying to pull from:
AG055.MA - MAGNUM (Want to return just MA)
WI460.16 - SOMETHING (Want to return 16)
AG055.QB (Want to return QB)
So basically, I just want to pull the characters after the period, but before the space. Nothing else before or after. Can someone give me a hand with getting the correct regex?
This should work:
<?php
preg_match( '/\.([^ ]+)/', $text, $matches );
print_r( $matches );
?>
Output:
Array
(
[0] => .MA
[1] => MA
)
Array
(
[0] => .16
[1] => 16
)
Array
(
[0] => .QB
[1] => QB
)
The regex is saying find a . character, then get any characters after it that are not a space character. The + makes it only return matches where there is a non-space character after the dot.
preg_match('/\w+\.(\w{2})\s/', $input, $matches);
echo $matches[1];
\w+ means 1 or more word characters (a-z, A-Z and 0-9).
\. means the period/dot (the backslash is to escape it, because \. is used as an operator in regex)
(\w{2}) matches 2 word characters
\s means whitespace
preg_match('/^[A-Z0-9]{5}\.([A-Z0-9]{2})/', $string, $matches);
var_dump($matches);
Should return the characters in $matches[1].

How to get all captures of subgroup matches with preg_match_all()? [duplicate]

This question already has answers here:
Get repeated matches with preg_match_all()
(6 answers)
Closed 4 years ago.
Update/Note:
I think what I'm probably looking for is to get the captures of a group in PHP.
Referenced: PCRE regular expressions using named pattern subroutines.
(Read carefully:)
I have a string that contains a variable number of segments (simplified):
$subject = 'AA BB DD '; // could be 'AA BB DD CC EE ' as well
I would like now to match the segments and return them via the matches array:
$pattern = '/^(([a-z]+) )+$/i';
$result = preg_match_all($pattern, $subject, $matches);
This will only return the last match for the capture group 2: DD.
Is there a way that I can retrieve all subpattern captures (AA, BB, DD) with one regex execution? Isn't preg_match_all suitable for this?
This question is a generalization.
Both the $subject and $pattern are simplified. Naturally with such the general list of AA, BB, .. is much more easy to extract with other functions (e.g. explode) or with a variation of the $pattern.
But I'm specifically asking how to return all of the subgroup matches with the preg_...-family of functions.
For a real life case imagine you have multiple (nested) level of a variant amount of subpattern matches.
Example
This is an example in pseudo code to describe a bit of the background. Imagine the following:
Regular definitions of tokens:
CHARS := [a-z]+
PUNCT := [.,!?]
WS := [ ]
$subject get's tokenized based on these. The tokenization is stored inside an array of tokens (type, offset, ...).
That array is then transformed into a string, containing one character per token:
CHARS -> "c"
PUNCT -> "p"
WS -> "s"
So that it's now possible to run regular expressions based on tokens (and not character classes etc.) on the token stream string index. E.g.
regex: (cs)?cp
to express one or more group of chars followed by a punctuation.
As I now can express self-defined tokens as regex, the next step was to build the grammar. This is only an example, this is sort of ABNF style:
words = word | (word space)+ word
word = CHARS+
space = WS
punctuation = PUNCT
If I now compile the grammar for words into a (token) regex I would like to have naturally all subgroup matches of each word.
words = (CHARS+) | ( (CHARS+) WS )+ (CHARS+) # words resolved to tokens
words = (c+)|((c+)s)+c+ # words resolved to regex
I could code until this point. Then I ran into the problem that the sub-group matches did only contain their last match.
So I have the option to either create an automata for the grammar on my own (which I would like to prevent to keep the grammar expressions generic) or to somewhat make preg_match working for me somehow so I can spare that.
That's basically all. Probably now it's understandable why I simplified the question.
Related:
pcrepattern man page
Get repeated matches with preg_match_all()
Similar thread: Get repeated matches with preg_match_all()
Check the chosen answer plus mine might be useful I will duplicate there:
From http://www.php.net/manual/en/regexp.reference.repetition.php :
When a capturing subpattern is repeated, the value captured is the substring that matched the final iteration.
I personally give up and going to do this in 2 steps.
EDIT:
I see in that other thread someone claimed that lookbehind method is able doing it.
Try this:
preg_match_all("'[^ ]+'i",$text,$n);
$n[0] will contain an array of all non-space character groups in the text.
Edit: with subgroups:
preg_match_all("'([^ ]+)'i",$text,$n);
Now $n[1] will contain the subgroup matches, that are exactly the same as $n[0]. This is pointless actually.
Edit2: nested subgroups example:
$test = "Hello I'm Joe! Hi I'm Jane!";
preg_match_all("/(H(ello|i)) I'm (.*?)!/i",$test,$n);
And the result:
Array
(
[0] => Array
(
[0] => Hello I'm Joe!
[1] => Hi I'm Jane!
)
[1] => Array
(
[0] => Hello
[1] => Hi
)
[2] => Array
(
[0] => ello
[1] => i
)
[3] => Array
(
[0] => Joe
[1] => Jane
)
)
Is there a way that I can retrieve all matches (AA, BB, DD) with one regex execution? Isn't preg_match_all not suitable for this?
Your current regex seems to be for a preg_match() call. Try this instead:
$pattern = '/[a-z]+/i';
$result = preg_match_all($pattern, $subject, $matches);
Per comments, the ruby regex I mentioned:
sentence = %r{
(?<subject> cat | dog ){0}
(?<verb> eats | drinks ){0}
(?<object> water | bones ){0}
(?<adjective> big | smelly ){0}
(?<obj_adj> (\g<adjective>\s)? ){0}
The\s\g<obj_adj>\g<subject>\s\g<verb>\s\g<opt_adj>\g<object>
}x
md = sentence.match("The cat drinks water");
md = sentence.match("The big dog eats smelly bones");
But I think you'll need a lexer/parser/tokenizer to do the same kind of thing in PHP. :-|
You can't extract the subpatterns because the way you wrote your regex returns only one match (using ^ and $ at the same time, and + on the main pattern).
If you write it this way, you'll see that your subgroups are correctly there:
$pattern = '/(([a-z]+) )/i';
(this still has an unnecessary set of parentheses, I just left it there for illustration)
Edit
I didn't realize what you had originally asked for. Here is the new solution:
$result = preg_match_all('/[a-z]+/i', $subject, $matches);
$resultArr = ($result) ? $matches[0] : array();
How about:
$str = 'AA BB CC';
$arr = preg_split('/\s+/', $str);
print_r($arr);
output:
(
[0] => AA
[1] => BB
[2] => CC
)
I may have misunderstood what you're describing. Are you just looking for a pattern for groups of letters with whitespace between?
// any subject containing words:
$subject = 'AfdfdfdA BdfdfdB DdD';
$subject = 'AA BB CC';
$subject = 'Af df dfdA Bdf dfdB DdD';
$pattern = '/(([a-z]+)\s)+[a-z]+/i';
$result = preg_match_all($pattern, $subject, $matches);
print_r($matches);
echo "<br/>";
print_r($matches[0]); // this matches $subject
echo "<br/>".$result;
Yes your right your solution is by using preg_match_all preg_match_all is recursive, so dont use start-with^ and end-with$, so that preg_match_all put all found patterns in an array.
Each new pair of parenthesis will add a New arrays indicating the different matches
use ? for optional matches
You can Separate different groups of patterns reported with the parenthesis () to ask for a group to be found and added in a new array (can allow you to count matches, or to categorize each matches from the returned array )
Clarification required
Let me try to understand you question, so that my answer match what you ask.
Your $subject is not a good exemple of what your looking for?
You would like the pregmatch search, to split what you provided in $subject in to 4 categories , Words, Characters, Punctuation and white spaces ? and what about numbers?
As well you would like The returned matches, to have the offsets of the matches specified ?
Does $subject = 'aa.bb cc.dd EE FFF,GG'; better fit a real life exemple?
I will take your basic exemple in $subject and make it work to give your exactly what your asked.
So can you edit your $subject so that i better fit all the cases that you want to match
Original '/^(([a-z]+) )+$/i';
Keep me posted,
you can test your regexes here http://www.spaweditor.com/scripts/regex/index.php
Partial answer
/([a-z])([a-z]+)/i
AA BB DD CD
Array
(
[0] => Array
(
[0] => AA
[1] => BB
[2] => DD
[3] => CD
)
[1] => Array
(
[0] => A
[1] => B
[2] => D
[3] => C
)
[2] => Array
(
[0] => A
[1] => B
[2] => D
[3] => D
)
)

Categories