PHP preg_match_all $matches output contains 3 rows - php

Here is my test code:
$test = '#12345 abc #12 #abd engng#geneng';
preg_match_all('/(^|\s)#([^# ]+)/', $test, $matches);
print_r($matches);
And the output $matches:
Array ( [0] => Array ( [0] => #12345 [1] => #12 [2] => #abd ) [1] => Array ( [0] => [1] => [2] => ) [2] => Array ( [0] => 12345 [1] => 12 [2] => abd ) )
My question is why does it have an empty row?
[1] => Array ( [0] => [1] => [2] => )
If I get ride of (^|\s) in the regex, the second row will disappear. However I would not able to prevent matching #geneng.
Any answer will be appreciated.

The problem with your regular expression is that it matches # even when it is preceded by whitespace. Because \s will match the whitespace, it will be captured into $matches array. You can solve this problem by using lookarounds. In this case, it can be solved with a positive lookbehind:
preg_match_all('/(?<=^|\s)#([^# ]+)/', $test, $matches);
This will match the part after # only if it is preceded by a space or beginning-of-the line anchor. It's important to note that lookarounds do not actually consume characters. They just assert that the given regular expression is either followed or preceded by something.
Demo

It's because of the memory capture to test (^|\s):
preg_match_all('/(^|\s)#([^# ]+)/', $test, $matches);
^^^^^^
It's captured as memory location #1, so to avoid that you can simply use non-capturing parentheses:
preg_match_all('/(?:^|\s)#([^# ]+)/', $test, $matches);
^^

preg_match_all uses by default the PREG_PATTERN_ORDER flag. This means that you will obtain:
$matches[0] -> all substrings that matches the whole pattern
$matches[1] -> all capture groups 1
$matches[2] -> all capture groups 2
etc.
You can change this behavior using the PREG_SET_ORDER flag:
$matches[0] -> array with the whole pattern and the capture groups for the first result
$matches[1] -> same for the second result
$matches[2] -> etc.
In your code you (PREG_PATTERN_ORDER by default) you obtain $matches[1] with only empty or blank items because it is the content of capture group 1 (^|\s)

There is 2 set of parentheses that's why you get an empty row. PHP thinks, you want 2 set of matching in the string. Removing one of them will remove one array.
FYI: In this case, you can not use [^|\s] instead of (^|\s). Cause PHP will think, you want to exclude the white space.

Related

Regular expression to extract a numeric value on a changing position within a variable string

How can I extract the bold numeric part of a string, when the most of the string can change? /data/ is always present and followed by the relevant, variable, numeric part (in this case 123456).
differentcontentLocationhttps://example.com/api/result/13548/data/123456differentstuffincludingwhitespacesandnewlines8484
$str = "differentcontentLocationhttps://example.com/api/result/13548/data/123456differentstuffincludingwhitespacesandnewlines8484";
$str2 = "differentcontentLocationhttps://example.com/api/result/13548/data/123456";
In this example I need 123456. The only constant parts in the string are /data/ and maybe the first part of the URL, like https://.
preg_match("#/data/([0-9]+)([^0-9]+)#siU", $str, $matches);
Results in Array ( [0] => /data/123456d [1] => 123456 [2] => d ), what would be acceptable. But if there's nothing following the relevant numeric part, like in $str2, this expression fails. I've tried to make the tailing part optional with preg_match("#/ads/([0-9]+)(([^0-9]+)?)#siU", $x, $matches);, but it fails, too; returning only the first number of the numeric part.
The U greediness swapping modifier makes all greedy subpattern lazy here, you should remove it together with ([^0-9]+). You also do not need DOTALL modifier because there is no . in your pattern whose behavior could be modified with that s flag.
preg_match("#/data/([0-9]+)#i", $str, $matches);
Now, the pattern will match:
/data/ - a sequence of literal chars
([0-9]+) - Group 1 capturing 1+ digits (same as (\d+))
See the PHP demo.
$str = "differentcontentLocationhttps://e...content-available-to-author-only...e.com/api/result/13548/data/123456differentstuffincludingwhitespacesandnewlines8484";
$str2 = "differentcontentLocationhttps://e...content-available-to-author-only...e.com/api/result/13548/data/123456";
preg_match("#/data/([0-9]+)#i", $str, $matches);
print_r($matches); // Array ( [0] => /data/123456 [1] => 123456 )
preg_match("#/data/([0-9]+)#i", $str2, $matches2);
print_r($matches2); // Array ( [0] => /data/123456 [1] => 123456 )

Exclude character from being returned in array

I have the following regex function:
function getMatches($string_content) {
$matches = array();
preg_match_all('/#([A-Za-z0-9_]+)/', $string_content, $matches);
return $matches;
}
Right now, it returns an array like this:
Array (
[0] => Array (
[0] => #test
[1] => #test2
)
[1] => Array (
[0] => test
[1] => test2
)
)
How can I make it only return the matches without the # symbol?
Return $matches[1] instead of $matches.
That will give you the first capture group instead of all matches.
With this small tweak (you can inspect the matches in the regex demo):
preg_match_all('~#\K\w+~', $string_content, $matches);
Explanation
In your original regex, the parentheses around ([A-Za-z0-9_]+) create a capture group. This is why the array contains a second element with index #1: this element contains the Group 1 captures.
\w is equivalent to [A-Za-z0-9_]
The \K tells the engine to drop what was matched so far from the final match it returns. It is more efficient than using a lookbehind (?<=#)
The ~ is just a small esthetic tweak—you can use any delimiter you like around your regex patttern.
Just use \K in your regex to avoid # in the final result and you don't need to capture anything,
preg_match_all('~#\K[A-Za-z0-9_]+~', $string_content, $matches);
OR
Use a lookbehind,
preg_match_all('~(?<=#)[A-Za-z0-9_]+~', $string_content, $matches);
DEMO
Explanation:
(?<=#) REgex engine sets the matching marker just after to the # symbol.
[A-Za-z0-9_]+ Matches one or more word characters.
You don't need any change in your regular expression, simply refer to capturing group #1, which would be $matches[1] to print the match result from your capturing group, excluding # from your array matches.
Your code would look like this:
function getMatches($string_content) {
preg_match_all('/#([A-Za-z0-9_]+)/', $string_content, $matches);
return $matches[1];
}
print_r(getMatches('foo bar #test baz #test2 quz'));
Output
Array
(
[0] => test
[1] => test2
)

Regular expersion repeat inside a pattern

I have the following text and I would like to preg_match_all what is within the {'s and }'s if it contains only a-zA-Z0-9 and :
some text,{SOMETHING21} {SOMETHI32NG:MORE}some msdf{TEXT:GET:2}sdfssdf sdf sdf
I am trying to match {SOMETHING21} {SOMETHI32NG:MORE} {TEXT:GET:2} there can be several :'s within the tag.
What I currently have is:
preg_match_all('/\{([a-zA-Z0-9\-]+)(\:([a-zA-Z0-9\-]+))*\}/', $from, $matches, PREG_SET_ORDER);
It works as expected for {SOMETHING21} and {SOMETHI32NG:MORE} but for {TEXT:GET:2} it only matches TEXT and 2
So it only matches the first and last word within the tag, and leaves the middle ones out of the $matches array. Is this even possible or should I just match them and then explode on : ?
-- edit --
Well the question isn't if I can get the tags, the question is if I can get them grouped without having to explode the results again. Even though my current regex finds all the results the subpattern does not come back with all the matches in $matches.
I hope the following will clear it up abit more:
\{ // the match has to start with {
([a-zA-Z0-9\-]+) // after the { the match needs to have alphanum consisting out of 1 or more characters
(
\: // if we have : it should be followed by alphanum consisting out of 1 or more characters
([a-zA-Z0-9\-]+) // <---- !! this is what it is about !! even though this subexpression is between brackets it is not put into $matches if more then one of these is found
)* // there could be none or more of the previous subexpression
\} // the match has to end with }
You can't get all the matched values of a capturing group, you only get the last one.
So you have to match the pattern:
preg_match_all('/{([a-z\d-]+(?::[a-z\d-]+)*)}/i', $from, $matches);
and then split each element in $matches[1] on :.
I used non-capture groupings to eliminate the inner groups, and just capture the outer complete colon-separated list.
$from = "some text,{SOMETHING21} {SOMETHI32NG:MORE}some msdf{TEXT:GET:2}sdfssdf sdf sdf";
preg_match_all('/\{((?:[a-zA-Z0-9\-]+)(?:\:(?:[a-zA-Z0-9\-]+))*)\}/', $from, $matches, PREG_SET_ORDER);
print_r($matches);
Result:
Array
(
[0] => Array
(
[0] => {SOMETHING21}
[1] => SOMETHING21
)
[1] => Array
(
[0] => {SOMETHI32NG:MORE}
[1] => SOMETHI32NG:MORE
)
[2] => Array
(
[0] => {TEXT:GET:2}
[1] => TEXT:GET:2
)
)
Maybe I didn't understand the requirement, but...
preg_match_all('/{[A-Za-z0-9:-]+}/', $from, $matches, PREG_PATTERN_ORDER);
results in:
Array
(
[0] => Array
(
[0] => {SOMETHING21}
[1] => {SOMETHI32NG:MORE}
[2] => {TEXT:GET:2}
)
)

Match rest of string with regex

I have a string like this
ch:keyword
ch:test
ch:some_text
I need a regular expression which will match all of the strings, however, it must not match the following:
ch: (ch: is proceeded by a space, or any number of spaces)
ch: (ch: is proceeded by nothing)
I am able to deduce the length of the string with the 'ch:' in it.
Any help would be appreciated; I am using PHP's preg_match()
Edit: I have tried this:
preg_match("/^ch:[A-Za-z_0-9]/", $str, $matches)
However, this only matches 1 character after the string. I tried putting a * after the closing square bracket, but this matches spaces, which I don't want.
preg_match('/^ch:(\S+)/', $string, $matches);
print_r($matches);
\S+ is for matching 1 or more non-space characters. This should work for you.
Try this regular expression:
^ch:\S.*$
$str = <<<TEXT
ch:keyword
ch:test
ch:
ch:some_text
ch: red
TEXT;
preg_match_all('|ch\:(\S+)|', $str, $matches);
echo '<pre>'; print_r($matches); echo '</pre>';
Output:
Array
(
[0] => Array
(
[0] => ch:keyword
[1] => ch:test
[2] => ch:some_text
)
[1] => Array
(
[0] => keyword
[1] => test
[2] => some_text
)
)
Try using this:
preg_match('/(?<! +)ch:[^ ].*/', $str);

Two or more matches in expression

Is it possible to make two matches of text - /123/123/123?edit
I need to match 123, 123 ,123 and edit
For the first(123,123,123): pattern is - ([^\/]+)
For the second(edit): pattern is - ([^\?=]*$)
Is it possible to match in one preg_match_all function, or I need to do it twice - one time for one pattern, second one for second?
Thanks !
You can do this with a single preg_match_all call:
$string = '/123/123/123?edit';
$matches = array();
preg_match_all('#(?<=[/?])\w+#', $string, $matches);
/* $matches will be:
Array
(
[0] => Array
(
[0] => 123
[1] => 123
[2] => 123
[3] => edit
)
)
*/
See this in action at http://www.ideone.com/eb2dy
The pattern ((?<=[/?])\w+) uses a lookbehind to assert that either a slash or a question mark must precede a sequence of word characters (\w is a shorthand class equivalent to [a-z0-9_]).

Categories