regex extract/replace values from xml-like tags via named (sub)groups - php

Trying to create a simple text-translator in PHP.
It shoult match something like:
Bla bla {translator id="TEST" language="de"/}
The language can be optional
Blabla <translator id="TEST"/>
Here is the code:
$result = preg_replace_callback(
'#{translator(\s+(?\'attribute\'\w+)="(?\'value\'\w+)")+/}#i',
array($this, 'translateTextCallback'),
$aText
);
It extracts the "attributes", but fetches only the last one. My first thought was, it has to do with the group naming, when PHP overwrites the (named) array elements on every match. But leaving out the group naming it also only returns the last match.
Here is an array as returned to the callback as example
Array
(
[0] => {translator id="TEST" language="de"/}
[1] => language="de"
[attribute] => language
[2] => language
[value] => de
[3] => de
)

When you iterate a group, you only get the last match. There is no way around this. You need to match the whole set of attribute/values and then parse them in code.

Related

Split Out Shortcode Bracket from JSON Array Bracket With Compatibility

Generally, I have a hard time with regex patterns beyond basic matching. I have a simplistic shortcode parser similar to how Wordpress has so an example is:
~PLUGIN::name_of_plugin["param1"="value1","param2"="value2"]~
I have this splitting and working, I just use explode() on the [ and trim the ] to get the parameters portion. Now, I want to pass a nested array so I am trying to do a JSON object into those brackets instead, so the string to parse would look like this:
$str = '~PLUGIN::name_of_plugin[{
"category":"whatever",
"test":[{
"name":"Title Something",
"desc":"123123-A",
"result":"Confirmation",
"conforms":"true",
"mass_spec":"true"
}]
}]~';
After a series of different pattern attempts with varying degrees of success, this is what I came up with that I would consider useable:
preg_match('/^\~([a-z]+::)([a-z\_\-]+)([^\~]+)\~/i',$str,$match);
print_r($match);
It matches to this (I can trim key 3 for what I need):
Array
(
[0] => ~PLUGIN::name_of_plugin[{
"category":"whatever",
"test":[{
"name":"Title Something",
"desc":"123123-A",
"result":"Confirmation",
"conforms":"true",
"mass_spec":"true"
}]
}]~
[1] => PLUGIN::
[2] => name_of_plugin
[3] => [{
"category":"whatever",
"test":[{
"name":"Title Something",
"desc":"123123-A",
"result":"Confirmation",
"conforms":"true",
"mass_spec":"true"
}]
}]
)
The problem there is, if the shortcode doesn't have parameters like:
$str = '~PLUGIN::name_test~';
It splits out to this (notice the n in key 3):
Array
(
[0] => ~PLUGIN::name_of_plugin~
[1] => PLUGIN::
[2] => name_of_plugi
[3] => n
)
Is there some sort of forward-looking thing I should be doing that would make this split out to:
Array
(
[0] => ~PLUGIN::name_of_plugin~
[1] => PLUGIN::
[2] => name_of_plugin
)
but also will split the parameters block out when it exists? I am trying not to do some sort of hack where I implode() key 2 and 3 or something.
It seems that this works by just making the last group optional. I only added ? before the last ~.
Also, ~ and _ are not a special characters, so no need to escape them.
I also added the anchor for end of line $. It's usually a good idea if you want to make sure it captures the whole string.
^~([a-z]+::)([a-z_\-]+)([^~]+)?~$
See it work: regex101

regular expression - how to connect two groups to one value?

I have got problems with trolls once again.
There is situation something like that:
<?php
[...]
$string="IchigoKurosaki FirstRandomTroll NatsuDragneel SecondRandomTroll NarutoUzumaki TrollMaster";
//[do some magic here]
//$outputArray = {
[1] => "FirstRandomTroll SecondRandomTroll",
[2] => "TrollMaster"
}
?>
I want to use preg_match to catch in groups trolls. I have got 3 groups:
(firstRandomTroll), (secondRandomTroll), and (TrollMaster)
which are saved to $output as
[1] => "first...", [2]=> "second...", and [3]="TrollMaster".
Is this possible to connect [1] and [2] to one value using only one regular expression and nothing more?
I wrote a regex to capture every troll in the string :
$string="IchigoKurosaki FirstRandomTroll NatsuDragneel SecondRandomTroll NarutoUzumaki TrollMaster";
preg_match_all("([a-zA-Z]*Troll[a-zA-Z]*)", $string, $array);
print_r($array);
Output :
Array
(
[0] => Array
(
[0] => FirstRandomTroll
[1] => SecondRandomTroll
[2] => TrollMaster
)
)
The trolls are now isolated (that's for the better, right ?)
Now you can perform operations on the elements of the array to regroup them (sounds like a bad idea, it's trolls we are talking about ...). Just be careful with the structure of this array.
Now if you really, really want to capture two trolls at once, I don't think it's possible with one regex. You could rewrite it to have two capturing groups, but that's will not output two trolls in one string right away ...

how not to match pattern inside parenthesis in preg_match

I'm trying to get the src of all the image tag from a web page. But I'm confused as to how not to match the patterns inside the parenthesis. In this case gif|jpg|png|jpeg
$img_src_pattern = '/src="?.+\.(gif|jpg|png|jpeg)"/';
preg_match_all($img_src_pattern, $contents, $img_matches);
So when printing out $img_matches I get an array like this:
Array (
[0] => Array (
[0] => src="http://s9.addthis.com/button1-bm.gif"
[1] => src="http://s9.addthis.com/button1-bm.gif" )
[1] => Array ( [0] => gif [1] => gif )
)
And here's what I want to get:
Array (
[0] => Array (
[0] => src="http://s9.addthis.com/button1-bm.gif"
[1] => src="http://s9.addthis.com/button1-bm.gif" )
)
This is really the part of preg_match that confuses me. Can you enlighten me on this?
You can just ignore it, since it belongs to another index in the array.
Or you can change the capturing group (pattern) to non-capturing group (?:pattern):
'/src="?.+\.(?:gif|jpg|png|jpeg)"/'
Your current regex, apart from finding a match to the whole regex, also "captures" (i.e. remembers) the text matched by the regex gif|jpg|png|jpeg, due to the effect of capturing group () surrounding it. Non-capturing group will retain the grouping property, but will not capture the text matched by the sub-expression gif|jpg|png|jpeg.
preg_match_all output a 2 dimensional array, where the first dimension is the capturing group (index 0 will contain the text matched by the whole regex), and the second dimension is the id of the matches that it has found.

Regular Expression with wordpress shortcodes

I'm trying to find all shortcodes within a string which looks like this:
 [a_col] One
 [/a_col]
outside
[b_col]
Two
[/b_col] [c_col] Three [/c_col]
I need the content (eg "Three") and the letter from the col (a, b or c)
Here's the expression I'm using
preg_match_all('#\[(a|b|c)_col\](.*)\[\/\1_col\]#m', $string, $hits);
but $hits contains only the last one.
The content can have any character even "[" or "]"
EDIT:
I would like to get "outside" as well which can be any string (except these cols). How can I handle that or should I parse this in a second step?
This will capture anything in the content, as well as attributes, and will allow any characters in the content.
<?php
$input = '[a_col some="thing"] One[/a_col]
[b_col] Two [/b_col]
[c_col] [Three] [/c_col] ';
preg_match_all('#\[(a|b|c)_col([^\[]*)\](.*?)\[\/\1_col\]#msi', $input, $matches);
print_r($matches);
?>
EDIT:
You may want to then trim the matches, since it appears there may be some whitespace. Alternatively, you can use regex for removing the whitespace in the content:
preg_match_all('#\[(a|b|c)_col([^\[]*)\]\s*(.*?)\s*\[\/\1_col\]#msi', $input, $matches);
OUTPUT:
Array
(
[0] => Array
(
[0] => [a_col some="thing"] One[/a_col]
[1] => [b_col] Two [/b_col]
[2] => [c_col] [Three] [/c_col]
)
[1] => Array
(
[0] => a
[1] => b
[2] => c
)
[2] => Array
(
[0] => some="thing"
[1] =>
[2] =>
)
[3] => Array
(
[0] => One
[1] => Two
[2] => [Three]
)
)
It might also be helpful to use this for capturing the attribute names and values stored in $matches[2]. Consider $atts to be the first element in $matches[2]. Of course, would iterate over the array of attributes and perform this on each.
preg_match_all('#([^="\'\s]+)[\t ]*=[\t ]*("|\')(.*?)\2#', $atts, $att_matches);
This gives an array where the names are stored in $att_matches[1] and their corresponding values are stored in $att_matches[3].
use ((.|\n)*) instead of (.*) to capture multiple lines...
<?php
$string = "
[a_col] One
[/a_col]
[b_col]
Two
[/b_col] [c_col] Three [/c_col]";
preg_match_all('#\[(a|b|c)_col\]((.|\n)*)\[\/\1_col\]#m', $string, $hits);
echo "<textarea style='width:90%;height:90%;'>";
print_r($hits);
echo "</textarea>";
?>
I don't have an environment I can test with here but you could use a look behind and look ahead assertion and a back reference to match tags around the content. Something like this.
(?<=\[(\w)\]).*(?=\[\/\1\])

Regex pattern matches fine but output is not complete

I am trying this regex pattern:
$string = '<div class="className">AlwaysTheSame:</div>Subtitle <br /><span class="anotherClass">entry1</span><span class="anotherClass">entry2</span><span class="anotherClass">entry3</span>';
preg_match_all('|<div class="className">AlwaysTheSame:</div>(.*?)<br />(<span class="anotherClass">(.*?)</span>)*|', $string, $matches);
print_r($matches);
exit;
The <span class="anotherClass">entry</span> can not exists or exists multiple times, the pattern seems to match it fine works both when exists and when it doesn't, but the output is:
Array
(
[0] => Array
(
[0] => <div class="className">AlwaysTheSame:</div>Subtitle <br /><span class="anotherClass">entry1</span><span class="anotherClass">entry2</span><span class="anotherClass">entry3</span>
)
[1] => Array
(
[0] => Subtitle
)
[2] => Array
(
[0] => <span class="anotherClass">entry3</span>
)
[3] => Array
(
[0] => entry3
)
)
Array[0][0] contains the full string so its matching all I need, but in Array[2] and [3] I only get the last <span...
How can I get all those <span... in the output array and not just the last one?
You can't directly, at least not in PHP. Repeated capturing groups always contain the last expression they matched. The exception is .NET where regex matches have an additional property that allows you to access every single match of a repeated group. Also, Perl 6 can do something like this - but not PHP.
Solution: Use
~<div class="className">AlwaysTheSame:</div>(.*?)<br />((?:<span class="anotherClass">(.*?)</span>)*)~
Now the second capturing group contains all the <span> tags. With another regex you can then extract all the matches:
~(?<=<span class="anotherClass">).*?(?=</span>)~
I'm using ~ as a regex delimiter, by the way - using | is confusing IMO.

Categories