Regex for parsing text between brackets and parenthesis

Regex for parsing text between brackets and parenthesis - php

I want to create a regex that saves all of $text1 and $text2 in two separade arrays. text1 and text2 are: ($text1)[$text2] that exist in string.
I wrote this code to parse between brackets as:
<?php
preg_match_all("/\[[^\]]*\]/", $text, $matches);
?>
It works correctly .
And I wrote another code to parse between parantheses as:
<?php
preg_match('/\([^\)]*\)/', $text, $match);
?>
But it just parses between one of parantheses not all of the parantheses in string :(
So I have two problems:
1) How can I parse text between all of the parantheses in the string?
2) How can I reach $text1 and $text2 as i described at top?
Please help me. I am confused about regex. If you have a good resource share it link. Thanks ;)

Use preg_match_all() with the following regular expression:
/(\[.+?\])(\(.+?\))/i
Demo
Details
/ # begin pattern
( # first group, brackets
\[ # literal bracket
.+? # any character, one or more times, greedily
\] # literal bracket, close
) # first group, close
( # second group, parentheses
\( # literal parentheses
.+? # any character, one or more times, greedily
\) # literal parentheses, close
) # second group, close
/i # end pattern
Which will save everything between brackets in one array, and everything between parentheses in another. So, in PHP:
<?php
$s = "[test1](test2) testing the regex [test3](test4)";
preg_match_all("/(\[.+?\])(\(.+?\))/i", $s, $m);
var_dump($m[1]); // bracket group
var_dump($m[2]); // parentheses group
Demo

The only reason you were failing to capture multiple ( ) wrapped substrings is because you were calling preg_match() instead of preg_match_all().
A couple of small points:
The ) inside of your negated character class didn't need to be escaped.
The closing square bracket (at the end of your pattern) doesn't need to be escaped; regex will not mistake it to mean the end of a character class.
There is no need to declare the i pattern modifier, you have no letters in your pattern to modify.
Combine your two patterns into one and bake in my small points and you have a fully refined/optimized pattern.
In case you don't know why your patterns are great, I'll explain. You see, when you ask the regex engine to match "greedily", it can move more efficiently (take less steps).
By using a negated character class, you can employ greedy matching. If you only use . then you have to use "lazy" matching (*?) to ensure that matching doesn't "go too far".
Pattern: ~\(([^)]*)\)\[([^\]]*)]~ (11 steps)
The above will capture zero or more characters between the parentheses as Capture Group #1, and zero or more characters between the square brackets as Capture Group #2.
If you KNOW that your target strings will obey your strict format, you can even remove the final ] from the pattern to improve efficiency. (10 steps)
Compare this with lazy . matching. ~\((.*?)\)\[(.*?)]~ (35 steps) and that's only on your little 16-character input string. As your text increases in length (I can only imagine that you are targeting these substrings inside a much larger block of text) the performance impact will become greater.
My point is, always try to design patterns that use "greedy" quantifiers in pursuit of making the best / most efficient pattern. (further tips on improving efficiency: avoid piping (|), avoid capture groups, and avoid lookarounds whenever reasonable because they cost steps.)
Code: (Demo)
$string='Demo #1: (11 steps)[1] and Demo #2: (35 steps)[2]';
var_export(preg_match_all('~\(([^)]*)\)\[([^\]]*)]~',$string,$out)?array_slice($out,1):[]);
Output: (I trimmed off the fullstring matches with array_slice())
array (
0 =>
array (
0 => '11 steps',
1 => '35 steps',
),
1 =>
array (
0 => '1',
1 => '2',
),
)
Or depending on your use: (with PREG_SET_ORDER)
Code: (Demo)
$string='Demo #1: (11 steps)[1] and Demo #2: (35 steps)[2]';
var_export(preg_match_all('~\(([^)]*)\)\[([^\]]*)]~',$string,$out,PREG_SET_ORDER)?$out:[]);
Output:
array (
0 =>
array (
0 => '(11 steps)[1]',
1 => '11 steps',
2 => '1',
),
1 =>
array (
0 => '(35 steps)[2]',
1 => '35 steps',
2 => '2',
),
)

Related

How can I extract values that have opening and closing brackets with regular expression?

I am trying to extract [[String]] with regular expression. Notice how a bracket opens [ and it needs to close ]. So you would receive the following matches:
[[String]]
[String]
String
If I use \[[^\]]+\] it will just find the first closing bracket it comes across without taking into consideration that a new one has opened in between and it needs the second close. Is this at all possible with regular expression?
Note: This type can either be String, [String] or [[String]] so you don't know upfront how many brackets there will be.

You can use the following PCRE compliant regex:
(?=((\[(?:\w++|(?2))*])|\b\w+))
See the regex demo. Details:
(?= - start of a positive lookahead (necessary to match overlapping strings):
(- start of Capturing group 1 (it will hold the "matches"):
(\[(?:\w++|(?2))*]) - Group 2 (technical, used for recursing): [, then zero or more occurrences of one or more word chars or the whole Group 2 pattern recursed, and then a ] char
| - or
\b\w+ - a word boundary (necessary since all overlapping matches are being searched for) and one or more word chars
) - end of Group 1
) - end of the lookahead.
See the PHP demo:
$s = "[[String]]";
if (preg_match_all('~(?=((\[(?:\w++|(?2))*])|\b\w+))~', $s, $m)){
print_r($m[1]);
}
Output:
Array
(
[0] => [[String]]
[1] => [String]
[2] => String
)

Regex to filter for no-space string only

I currently have the following regex:
(?<=\[)([^\]]+)
Result are as follows:
text* your-name
email* your-email
text your-subject
textarea your-message
submit "Submit"
your-subject
your-name
your-email
your-message
I'd like to adjust my regex so it filters out the results that have a space in between, so that I'm only left with following results:
your-subject
your-name
your-email
your-message
How can I do this?
Here's how it currently is: https://regex101.com/r/yP3iB0/58

You may use
(?<=\[)[^]\s]+(?=])
See the regex and PHP demo. Note that the $matches structure is cleaner without a capturing group in the pattern, with all context checked using non-consuming lookarounds.
Details
(?<=\[) - a positive lookbehind that requires a [ immediately to the left of the current location
[^]\s]+ - 1+ chars other than ] (no need to escape it as it is the first char in the negated character class) and whitespace
(?=]) - a positive lookahead that requires a ] immediately to the right of the current location (] is not special outside of a character class).
PHP demo:
$arr = ['[text* your-name]','[email* your-email]','[text your-subject]','[textarea your-message]','[submit "Verzenden"]','[your-subject]','[your-name]','[your-email]','[your-message]'];
foreach ($arr as $s) {
if (preg_match_all('~(?<=\[)[^]\s]+(?=])~', $s, $matches)) {
print_r($matches[0]);
}
}
Output:
Array
(
[0] => your-subject
)
Array
(
[0] => your-name
)
Array
(
[0] => your-email
)
Array
(
[0] => your-message
)

/^\[\K([^ \]]+)(?=\])/gm
Test this regex pattern at https://regex101.com/r/yP3iB0/62

Starting with your current pattern, all you have to do is to exclude the space or all the blank characters from your character class and to check there's a closing square bracket after. So (?<=\[)([^]\s]+)(?=]) with the result in the whole match or in the capture group (that makes it useless).
But you can write a better pattern, more simple and more efficient: \[([^]\s]+)]. demo
More simple because since there's a capture group, you don't have to use lookarounds to extract the content you want without the brackets. It's also shorter and easier to understand.
More efficient because of two start-up optimizations:
The first and most important: when the pattern starts with a literal string (the opening bracket here), a fast algorithm searches the string for all positions where this literal string occurs in the subject string, and the pattern will be only tested at these positions. Otherwise, and this is the case if you enclose this bracket in a lookbehind, this start-up optimization isn't possible and the pattern will be tested at each position in the subject string.
The second is called auto-possessification. It automatically makes a quantifier possessive at compile-time when eventual backtracks don't change the result. For instance, a*b becomes a*+b when a.*b stays a.*b. In our case, since the character class [^]\s] excludes the closing bracket, [^]\s]+] becomes [^]\s]++]. Concretely, when a space is found instead of the closing bracket, the greedy quantifier doesn't give characters back to try other possibilities, the pattern fails, and the regex engine tries the pattern at the next position. One more time, putting the bracket in a lookahead disables this optimization.
But why lookarounds disable these optimizations? The reason is simple, these optimizations require to study the pattern, so to keep these analyses fast, they are limited to simple cases. (Note that the capture group doesn't disturb the auto-possessification.)
If you absolutely want to avoid the capture group but want to keep these two optimizations, nothing forbids to write:
\[\K[^]\s]++(?=]) demo
or more fun:
\[(?=[^]\s]++\K]) demo
The two patterns start with a literal [ and the possessive quantifier is added by hand.

How to remove text inside brackets and parentheses at the same time with any whitespace before if present?

I am trying to clean a string in PHP using the following code, but I am not sure how to get rid of the text inside brackets and parentheses at the same time with any whitespace before if present.
The code I am using is:
$string = "Deadpool 2 [Region 4](Blu-ray)";
echo preg_replace("/\[[^)]+\]/","",$string);
The output I'm getting is:
Deadpool [](Blu-ray)
However, the desired output is:
Deadpool 2
Using the solutions from this and this questions, it is not clear how to remove both one type of matches and the other one while also removing the optional whitespace before them.

There are four main points here:
String between parentheses can be matched with \([^()]*\)
String between square brackets can be matched with \[[^][]*] (or \[[^\]\[]*\] if you prefer to escape literal [ and ], in PCRE, it is stylistic, but in some other regex flavors, it might be a must)
You need alternation to match either this or that pattern and account for any whitespaces before these patterns
Since after removing these strings you may get leading and trailing spaces, you need to trim the string.
You may use
$string = "Deadpool 2 [Region 4](Blu-ray)";
echo trim(preg_replace("/\s*(?:\[[^][]*]|\([^()]*\))/","", $string));
See the regex demo and a PHP demo.
The \[[^][]*] part matches strings between [ and ] having no other [ and ] inside and \([^()]*\) matches strings between ( and ) having no other parentheses inside. trim removes leading/trailing whitespace.
Regex graph and explanation:
\s* - 0+ whitespaces
(?: - start of a non-capturing group:
\[[^][]*] - [, zero or more chars other than [ and ] (note you may keep these brackets inside a character class unescaped in a PCRE pattern if ] is right after initial [, in JS, you would have to escape ] by all means, [^\][]*)
| - or (an alternation operator)
\([^()]*\) - (, any 0+ chars other than ( and ) and a )
) - end of the non-capturing group.

Based on just the one sample input there are some simpler approaches.
$string = "Deadpool 2 [Region 4](Blu-ray)";
var_export(preg_replace("~ [[(].*~", "", $string));
echo "\n";
var_export(strstr($string, ' [', true));
Output:
'Deadpool 2'
'Deadpool 2'
These assume that the start of the unwanted substring begins with space opening square brace.
The strstr() technique requires that the space-brace sequence exists in the string.
If the unwanted substring marker is not consistently included, then you can use:
var_export(explode(' [', $string, 2)[0]);
This will put the unwanted substring in explode's output array at [1] and the wanted substring in [0].

Output only values that do not contain HTML tags in parentheses from PHP array

Here is a sample PHP array that explains my question well
$array = array('1' => 'Cookie Monster (<i>eats cookies</i>)',
'2' => 'Tiger (eats meat)',
'3' => 'Muzzy (eats <u>clocks</u>)',
'4' => 'Cow (eats grass)');
All I need is to return only values that don't contain any tag enclosed with parentheses from this array:
- Tiger (eats meat)
- Cow (eats grass)
For this I'm going to use the following code:
$array_no_tags = preg_grep("/[A-Za-z]\s\(^((?!<(.*?)(\h*).*?>(.*?)<\/\1>).)*$\)/", $array);
foreach ($array_no_tags as $a_n_t) {echo "- ".$a_n_t."<br />";}
Assuming that [A-Za-z] may be whoever, \s is a space, \( is the opening parenthesis, ^((?! is start of the tag denial statement, <(.*?)(\h*).*?>(.*?)<\/\1> is the tag itself, ).)*$ is end of the tag denial statement and \) is the closing parenthesis.
Nothing works.
print_r($array_no_tags); returns empty array.

You could use the following expression to match strings with HTML tags inside of parentheses:
/\([^)]*<(\w+)>[^<>]*<\/\\1>[^)]*\)/
Then set the PREG_GREP_INVERT flag to true in order to only return items that don't match.
$array_no_tags = preg_grep("/\([^)]*<(\w+)>[^<>]*<\/\\1>[^)]*\)/", $array, true);
Explanation:
\( - Match the literal ( character
[^)]* - Negated character class to match zero or more non-) characters
<(\w+)> - Capturing group one that matches the opening element's tag name
[^<>]* - Negated character class to match zero or more non-<> characters
<\/\1> - Back reference to capturing group one to match the closing tag
[^)]* - Negated character class to match zero or more non-) characters
\) - Match the literal ) character
If you don't care about the parentheses around the element tag, then you could also just use the following simplified expression:
/<(\w+)>[^<>]+<\/\\1>/
And likewise, you would use:
$array_no_tags = preg_grep("/<(\w+)>[^<>]+<\/\\1>/", $array, true);

You pattern looks a bit overcomplicated. I thought maybe a simple pattern inside the negative lookahead that checks for not any <x inside ( ) could be sufficient.
$array_no_tags = preg_grep("/^(?!.*?\([^)<]*<\w)/", $array);
PHP demo at eval.in
So this does not match (?! if there is an ( opening bracket, followed by [^)<]* any amount of characters that are not ) or <, followed by <\w lesser sign that's followed by a word character.
Bear in mind that there are nice regex tools like regex101 available for testing patterns.

find a string mapped between two string using php

I know this question was asked many times before and was read most of them, but I have still issue with this.
I will have a string that mapped with [[[ and ]]], and I don't know the position of this string and either I don't know how many times this would be happen.
for example :
$string = '[[[this is a string]]] and this is some other part. [[[this is another]]]and etc.';
Now, would some body help me to learn how can I find this is a string and this is another
Thanks in Advance

You need to use preg_match_all(), and you also need to be sure to escape the square brackets since they are special characters.
$string = '[[[this is a string]]] and this is some other part. [[[this is another]]]and etc.';
preg_match_all('/\[\[\[([^\]]*)\]\]\]/', $string, $matches);
print_r($matches);
Regex logic:
\[\[\[([^\]]*)\]\]\]
Debuggex Demo
Output:
Array
(
[0] => Array
(
[0] => [[[this is a string]]]
[1] => [[[this is another]]]
)
[1] => Array
(
[0] => this is a string
[1] => this is another
)
)

Here is a method using lookbehinds and lookaheads:
$string = '[[[this is a string]]] and this is some other part. [[[this is another]]]and etc.';
preg_match_all('/(?<=\[{3}).*?(?=\]{3})/', $string, $m);
print_r($m);
This outputs the following:
Array
(
[0] => Array
(
[0] => this is a string
[1] => this is another
)
)
Here is the explanation of the REGEX:
(?<= \[{3} ) .*? (?= \]{3} )
1 2 3 4 5 6 7
(?<= Positive lookbehind - This combination of (?<= ... ) tells REGEX to make sure that whatever is in the parenthesis has to appear directly before whatever it is we are trying to match. It will check to see if it's there, but won't include it in the matches.
\[{3} This says to look for an opening square brace '[', three times in a row {3}. The only thing is that the square brace is a special character in REGEX, so we have to escape it with a backslash \. [ becomes \[.
) Closing parenthesis ) for the lookbehind (Item #1)
.*? This tells REGEX to match any character ., any number of times * until it hits the next part of our regular expression ?. In this case, the next part that it will hit will be a lookahead for three closing square braces.
(?= Positive lookahead - The combination of (?= ... ) tells REGEX to make sure that whatever is in the parenthesis has to be directly in front (ahead) of what we are currently matching. It will check to see if it's there, but won't include it as part of our match.
\]{3} This looks for a closing square brace ], three times in a row {3} and as with item #2, must be escaped with a backslash \.
) Closing parenthesis ) for the lookahead (Item #5)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regex for parsing text between brackets and parenthesis - php

Related

How can I extract values that have opening and closing brackets with regular expression?

Regex to filter for no-space string only

How to remove text inside brackets and parentheses at the same time with any whitespace before if present?

Output only values that do not contain HTML tags in parentheses from PHP array

find a string mapped between two string using php

Categories

Resources