Regex pattern matches fine but output is not complete - php

I am trying this regex pattern:
$string = '<div class="className">AlwaysTheSame:</div>Subtitle <br /><span class="anotherClass">entry1</span><span class="anotherClass">entry2</span><span class="anotherClass">entry3</span>';
preg_match_all('|<div class="className">AlwaysTheSame:</div>(.*?)<br />(<span class="anotherClass">(.*?)</span>)*|', $string, $matches);
print_r($matches);
exit;
The <span class="anotherClass">entry</span> can not exists or exists multiple times, the pattern seems to match it fine works both when exists and when it doesn't, but the output is:
Array
(
[0] => Array
(
[0] => <div class="className">AlwaysTheSame:</div>Subtitle <br /><span class="anotherClass">entry1</span><span class="anotherClass">entry2</span><span class="anotherClass">entry3</span>
)
[1] => Array
(
[0] => Subtitle
)
[2] => Array
(
[0] => <span class="anotherClass">entry3</span>
)
[3] => Array
(
[0] => entry3
)
)
Array[0][0] contains the full string so its matching all I need, but in Array[2] and [3] I only get the last <span...
How can I get all those <span... in the output array and not just the last one?

You can't directly, at least not in PHP. Repeated capturing groups always contain the last expression they matched. The exception is .NET where regex matches have an additional property that allows you to access every single match of a repeated group. Also, Perl 6 can do something like this - but not PHP.
Solution: Use
~<div class="className">AlwaysTheSame:</div>(.*?)<br />((?:<span class="anotherClass">(.*?)</span>)*)~
Now the second capturing group contains all the <span> tags. With another regex you can then extract all the matches:
~(?<=<span class="anotherClass">).*?(?=</span>)~
I'm using ~ as a regex delimiter, by the way - using | is confusing IMO.

Related

preg_match_all is only matching one

I am trying to get the value after the dots, and I would like to get all of them (each as their own key/value).
The following is what I am running:
$string = "div.cat.dog#mouse";
preg_match_all("/\.(.+?)(\.|#|$)/", $string, $matches);
and when I do a dump of $matches I am getting this:
Array
(
[0] => Array
(
[0] => .cat.
)
[1] => Array
(
[0] => cat
)
[2] => Array
(
[0] => .
)
)
Where item [1] is, it is only returning 1 value. What I was expecting was for it to return (for this case) 2 items cat and dog. How come dog isn't getting picked up by preg_match_all?
Use lookahead:
\.(.+?)(?=\.|#|$)
RegEx Demo
Problem in your regex is that you're matching DOT on LHS and a DOT or HASH or end of input on RHS of match. After matching that internal pointer moves ahead leaving no DOT to be matched for next word.
(?=\.|#|$) is a positive lookahead that doesn't match these characters but just looks ahead so pointer remains at the cat instead of DOT after cat..

how not to match pattern inside parenthesis in preg_match

I'm trying to get the src of all the image tag from a web page. But I'm confused as to how not to match the patterns inside the parenthesis. In this case gif|jpg|png|jpeg
$img_src_pattern = '/src="?.+\.(gif|jpg|png|jpeg)"/';
preg_match_all($img_src_pattern, $contents, $img_matches);
So when printing out $img_matches I get an array like this:
Array (
[0] => Array (
[0] => src="http://s9.addthis.com/button1-bm.gif"
[1] => src="http://s9.addthis.com/button1-bm.gif" )
[1] => Array ( [0] => gif [1] => gif )
)
And here's what I want to get:
Array (
[0] => Array (
[0] => src="http://s9.addthis.com/button1-bm.gif"
[1] => src="http://s9.addthis.com/button1-bm.gif" )
)
This is really the part of preg_match that confuses me. Can you enlighten me on this?
You can just ignore it, since it belongs to another index in the array.
Or you can change the capturing group (pattern) to non-capturing group (?:pattern):
'/src="?.+\.(?:gif|jpg|png|jpeg)"/'
Your current regex, apart from finding a match to the whole regex, also "captures" (i.e. remembers) the text matched by the regex gif|jpg|png|jpeg, due to the effect of capturing group () surrounding it. Non-capturing group will retain the grouping property, but will not capture the text matched by the sub-expression gif|jpg|png|jpeg.
preg_match_all output a 2 dimensional array, where the first dimension is the capturing group (index 0 will contain the text matched by the whole regex), and the second dimension is the id of the matches that it has found.

Regular expersion repeat inside a pattern

I have the following text and I would like to preg_match_all what is within the {'s and }'s if it contains only a-zA-Z0-9 and :
some text,{SOMETHING21} {SOMETHI32NG:MORE}some msdf{TEXT:GET:2}sdfssdf sdf sdf
I am trying to match {SOMETHING21} {SOMETHI32NG:MORE} {TEXT:GET:2} there can be several :'s within the tag.
What I currently have is:
preg_match_all('/\{([a-zA-Z0-9\-]+)(\:([a-zA-Z0-9\-]+))*\}/', $from, $matches, PREG_SET_ORDER);
It works as expected for {SOMETHING21} and {SOMETHI32NG:MORE} but for {TEXT:GET:2} it only matches TEXT and 2
So it only matches the first and last word within the tag, and leaves the middle ones out of the $matches array. Is this even possible or should I just match them and then explode on : ?
-- edit --
Well the question isn't if I can get the tags, the question is if I can get them grouped without having to explode the results again. Even though my current regex finds all the results the subpattern does not come back with all the matches in $matches.
I hope the following will clear it up abit more:
\{ // the match has to start with {
([a-zA-Z0-9\-]+) // after the { the match needs to have alphanum consisting out of 1 or more characters
(
\: // if we have : it should be followed by alphanum consisting out of 1 or more characters
([a-zA-Z0-9\-]+) // <---- !! this is what it is about !! even though this subexpression is between brackets it is not put into $matches if more then one of these is found
)* // there could be none or more of the previous subexpression
\} // the match has to end with }
You can't get all the matched values of a capturing group, you only get the last one.
So you have to match the pattern:
preg_match_all('/{([a-z\d-]+(?::[a-z\d-]+)*)}/i', $from, $matches);
and then split each element in $matches[1] on :.
I used non-capture groupings to eliminate the inner groups, and just capture the outer complete colon-separated list.
$from = "some text,{SOMETHING21} {SOMETHI32NG:MORE}some msdf{TEXT:GET:2}sdfssdf sdf sdf";
preg_match_all('/\{((?:[a-zA-Z0-9\-]+)(?:\:(?:[a-zA-Z0-9\-]+))*)\}/', $from, $matches, PREG_SET_ORDER);
print_r($matches);
Result:
Array
(
[0] => Array
(
[0] => {SOMETHING21}
[1] => SOMETHING21
)
[1] => Array
(
[0] => {SOMETHI32NG:MORE}
[1] => SOMETHI32NG:MORE
)
[2] => Array
(
[0] => {TEXT:GET:2}
[1] => TEXT:GET:2
)
)
Maybe I didn't understand the requirement, but...
preg_match_all('/{[A-Za-z0-9:-]+}/', $from, $matches, PREG_PATTERN_ORDER);
results in:
Array
(
[0] => Array
(
[0] => {SOMETHING21}
[1] => {SOMETHI32NG:MORE}
[2] => {TEXT:GET:2}
)
)

Regular Expression with wordpress shortcodes

I'm trying to find all shortcodes within a string which looks like this:
 [a_col] One
 [/a_col]
outside
[b_col]
Two
[/b_col] [c_col] Three [/c_col]
I need the content (eg "Three") and the letter from the col (a, b or c)
Here's the expression I'm using
preg_match_all('#\[(a|b|c)_col\](.*)\[\/\1_col\]#m', $string, $hits);
but $hits contains only the last one.
The content can have any character even "[" or "]"
EDIT:
I would like to get "outside" as well which can be any string (except these cols). How can I handle that or should I parse this in a second step?
This will capture anything in the content, as well as attributes, and will allow any characters in the content.
<?php
$input = '[a_col some="thing"] One[/a_col]
[b_col] Two [/b_col]
[c_col] [Three] [/c_col] ';
preg_match_all('#\[(a|b|c)_col([^\[]*)\](.*?)\[\/\1_col\]#msi', $input, $matches);
print_r($matches);
?>
EDIT:
You may want to then trim the matches, since it appears there may be some whitespace. Alternatively, you can use regex for removing the whitespace in the content:
preg_match_all('#\[(a|b|c)_col([^\[]*)\]\s*(.*?)\s*\[\/\1_col\]#msi', $input, $matches);
OUTPUT:
Array
(
[0] => Array
(
[0] => [a_col some="thing"] One[/a_col]
[1] => [b_col] Two [/b_col]
[2] => [c_col] [Three] [/c_col]
)
[1] => Array
(
[0] => a
[1] => b
[2] => c
)
[2] => Array
(
[0] => some="thing"
[1] =>
[2] =>
)
[3] => Array
(
[0] => One
[1] => Two
[2] => [Three]
)
)
It might also be helpful to use this for capturing the attribute names and values stored in $matches[2]. Consider $atts to be the first element in $matches[2]. Of course, would iterate over the array of attributes and perform this on each.
preg_match_all('#([^="\'\s]+)[\t ]*=[\t ]*("|\')(.*?)\2#', $atts, $att_matches);
This gives an array where the names are stored in $att_matches[1] and their corresponding values are stored in $att_matches[3].
use ((.|\n)*) instead of (.*) to capture multiple lines...
<?php
$string = "
[a_col] One
[/a_col]
[b_col]
Two
[/b_col] [c_col] Three [/c_col]";
preg_match_all('#\[(a|b|c)_col\]((.|\n)*)\[\/\1_col\]#m', $string, $hits);
echo "<textarea style='width:90%;height:90%;'>";
print_r($hits);
echo "</textarea>";
?>
I don't have an environment I can test with here but you could use a look behind and look ahead assertion and a back reference to match tags around the content. Something like this.
(?<=\[(\w)\]).*(?=\[\/\1\])

Get position of all matches in group

Consider the following example:
$target = 'Xa,a,aX';
$pattern = '/X((a),?)*X/';
$matches = array();
preg_match_all($pattern,$target,$matches,PREG_OFFSET_CAPTURE|PREG_PATTERN_ORDER);
var_dump($matches);
What it does is returning only the last 'a' in the series, but what I need is all the 'a's.
Particularly, I need the position of ALL EACH OF the 'a's inside the string separately, thus PREG_OFFSET_CAPTURE.
The example is much more complex, see the related question: pattern matching an array, not their elements per se
Thanks
It groups a single match since the regex X((a),?)*X matches the entire string. The last ((a),?) will be grouped.
What you want to match is an a that has an X before it (and the start of the string), has a comma ahead of it, or has an X ahead of it (and the end of the string).
$target = 'Xa,a,aX';
$pattern = '/(?<=^X)a|a(?=X$|,)/';
preg_match_all($pattern, $target, $matches, PREG_OFFSET_CAPTURE);
print_r($matches);
Output:
Array
(
[0] => Array
(
[0] => Array
(
[0] => a
[1] => 1
)
[1] => Array
(
[0] => a
[1] => 3
)
[2] => Array
(
[0] => a
[1] => 5
)
)
)
When your regex includes X, it matches once. It finds one large match with groups in it. What you want is many matches, each with its own position.
So, in my opinion the best you can do is simply search for /a/ or /a,?/ without any X. Then matches[0] will contain all appearances of 'a'
If you need them between X, pre-select this part of the string.

Categories