Simple regex question. I have a string on the following format:
this is a [sample] string with [some] special words. [another one]
What is the regular expression to extract the words within the square brackets, ie.
sample
some
another one
Note: In my use case, brackets cannot be nested.
You can use the following regex globally:
\[(.*?)\]
Explanation:
\[ : [ is a meta char and needs to be escaped if you want to match it literally.
(.*?) : match everything in a non-greedy way and capture it.
\] : ] is a meta char and needs to be escaped if you want to match it literally.
(?<=\[).+?(?=\])
Will capture content without brackets
(?<=\[) - positive lookbehind for [
.*? - non greedy match for the content
(?=\]) - positive lookahead for ]
EDIT: for nested brackets the below regex should work:
(\[(?:\[??[^\[]*?\]))
This should work out ok:
\[([^]]+)\]
Can brackets be nested?
If not: \[([^]]+)\] matches one item, including square brackets. Backreference \1 will contain the item to be match. If your regex flavor supports lookaround, use
(?<=\[)[^]]+(?=\])
This will only match the item inside brackets.
To match a substring between the first [ and last ], you may use
\[.*\] # Including open/close brackets
\[(.*)\] # Excluding open/close brackets (using a capturing group)
(?<=\[).*(?=\]) # Excluding open/close brackets (using lookarounds)
See a regex demo and a regex demo #2.
Use the following expressions to match strings between the closest square brackets:
Including the brackets:
\[[^][]*] - PCRE, Python re/regex, .NET, Golang, POSIX (grep, sed, bash)
\[[^\][]*] - ECMAScript (JavaScript, C++ std::regex, VBA RegExp)
\[[^\]\[]*] - Java, ICU regex
\[[^\]\[]*\] - Onigmo (Ruby, requires escaping of brackets everywhere)
Excluding the brackets:
(?<=\[)[^][]*(?=]) - PCRE, Python re/regex, .NET (C#, etc.), JGSoft Software
\[([^][]*)] - Bash, Golang - capture the contents between the square brackets with a pair of unescaped parentheses, also see below
\[([^\][]*)] - JavaScript, C++ std::regex, VBA RegExp
(?<=\[)[^\]\[]*(?=]) - Java regex, ICU (R stringr)
(?<=\[)[^\]\[]*(?=\]) - Onigmo (Ruby, requires escaping of brackets everywhere)
NOTE: * matches 0 or more characters, use + to match 1 or more to avoid empty string matches in the resulting list/array.
Whenever both lookaround support is available, the above solutions rely on them to exclude the leading/trailing open/close bracket. Otherwise, rely on capturing groups (links to most common solutions in some languages have been provided).
If you need to match nested parentheses, you may see the solutions in the Regular expression to match balanced parentheses thread and replace the round brackets with the square ones to get the necessary functionality. You should use capturing groups to access the contents with open/close bracket excluded:
\[((?:[^][]++|(?R))*)] - PHP PCRE
\[((?>[^][]+|(?<o>)\[|(?<-o>]))*)] - .NET demo
\[(?:[^\]\[]++|(\g<0>))*\] - Onigmo (Ruby) demo
If you do not want to include the brackets in the match, here's the regex: (?<=\[).*?(?=\])
Let's break it down
The . matches any character except for line terminators. The ?= is a positive lookahead. A positive lookahead finds a string when a certain string comes after it. The ?<= is a positive lookbehind. A positive lookbehind finds a string when a certain string precedes it. To quote this,
Look ahead positive (?=)
Find expression A where expression B follows:
A(?=B)
Look behind positive (?<=)
Find expression A where expression B
precedes:
(?<=B)A
The Alternative
If your regex engine does not support lookaheads and lookbehinds, then you can use the regex \[(.*?)\] to capture the innards of the brackets in a group and then you can manipulate the group as necessary.
How does this regex work?
The parentheses capture the characters in a group. The .*? gets all of the characters between the brackets (except for line terminators, unless you have the s flag enabled) in a way that is not greedy.
Just in case, you might have had unbalanced brackets, you can likely design some expression with recursion similar to,
\[(([^\]\[]+)|(?R))*+\]
which of course, it would relate to the language or RegEx engine that you might be using.
RegEx Demo 1
Other than that,
\[([^\]\[\r\n]*)\]
RegEx Demo 2
or,
(?<=\[)[^\]\[\r\n]*(?=\])
RegEx Demo 3
are good options to explore.
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
Test
const regex = /\[([^\]\[\r\n]*)\]/gm;
const str = `This is a [sample] string with [some] special words. [another one]
This is a [sample string with [some special words. [another one
This is a [sample[sample]] string with [[some][some]] special words. [[another one]]`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
Source
Regular expression to match balanced parentheses
(?<=\[).*?(?=\]) works good as per explanation given above. Here's a Python example:
import re
str = "Pagination.go('formPagination_bottom',2,'Page',true,'1',null,'2013')"
re.search('(?<=\[).*?(?=\])', str).group()
"'formPagination_bottom',2,'Page',true,'1',null,'2013'"
The #Tim Pietzcker's answer here
(?<=\[)[^]]+(?=\])
is almost the one I've been looking for. But there is one issue that some legacy browsers can fail on positive lookbehind.
So I had to made my day by myself :). I manged to write this:
/([^[]+(?=]))/g
Maybe it will help someone.
console.log("this is a [sample] string with [some] special words. [another one]".match(/([^[]+(?=]))/g));
if you want fillter only small alphabet letter between square bracket a-z
(\[[a-z]*\])
if you want small and caps letter a-zA-Z
(\[[a-zA-Z]*\])
if you want small caps and number letter a-zA-Z0-9
(\[[a-zA-Z0-9]*\])
if you want everything between square bracket
if you want text , number and symbols
(\[.*\])
This code will extract the content between square brackets and parentheses
(?:(?<=\().+?(?=\))|(?<=\[).+?(?=\]))
(?: non capturing group
(?<=\().+?(?=\)) positive lookbehind and lookahead to extract the text between parentheses
| or
(?<=\[).+?(?=\]) positive lookbehind and lookahead to extract the text between square brackets
In R, try:
x <- 'foo[bar]baz'
str_replace(x, ".*?\\[(.*?)\\].*", "\\1")
[1] "bar"
([[][a-z \s]+[]])
Above should work given the following explaination
characters within square brackets[] defines characte class which means pattern should match atleast one charcater mentioned within square brackets
\s specifies a space
+ means atleast one of the character mentioned previously to +.
I needed including newlines and including the brackets
\[[\s\S]+\]
If someone wants to match and select a string containing one or more dots inside square brackets like "[fu.bar]" use the following:
(?<=\[)(\w+\.\w+.*?)(?=\])
Regex Tester
Related
Suppose I have a string that looks like:
"lets refer to [[merp] [that entry called merp]] and maybe also to that entry called [[blue] [blue]]"
The idea here is to replace a block of [[name][some text]] with some text.
So I'm trying to use regular expressions to find blocks that look like [[name][some text]], but I'm having tremendous difficulty.
Here's what I thought should work (in PHP):
preg_match_all('/\[\[.*\]\[.*\]/', $my_big_string, $matches)
But this just returns a single match, the string from '[[merp' to 'blue]]'. How can I get it to return the two matches [[merp][that entry called merp]] and [[blue][blue]]?
The regex you're looking for is \[\[(.+?)\]\s\[(.+?)\]\] and replace it with $2
The regex pattern matched inside the () braces are captured and can be back-referenced using $1, $2,...
Example on regex101.com
Quantifiers like the * are by default greedy,
which means, that as much as possible is matched to meet conditions. E.g. in your sample a regex like \[.*\] would match everything from the first [ to the last ] in the string. To change the default behaviour and make quantifiers lazy (ungreedy, reluctant):
Use the U (PCRE_UNGREEDY) modifier to make all quantifiers lazy
Put a ? after a specific quantifier. E.g. .*? as few of any characters as possible
1.) Using the U-modifier a pattern could look like:
/\[\[(.*)]\s*\[(.*)]]/Us
Additional used the s (PCRE_DOTALL) modifier to make the . dot also match newlines. And added some \s whitespaces in between ][ which are in your sample string. \s is a shorthand for [ \t\r\n\f].
There are two capturing groups (.*) to be replaced then. Test on regex101.com
2.) Instead using the ? to making each quantifier lazy:
/\[\[(.*?)]\s*\[(.*?)]]/s
Test on regex101.com
3.) Alternative without modifiers, if no square brackets are expected to be inside [...].
/\[\[([^]]*)]\s*\[([^]]*)]]/
Using a ^ negated character class to allow [^]]* any amount of characters, that are NOT ] in between [ and ]. This wouldn't require to rely on greediness. Also no . is used, so no s-modifier is needed.
Test on regex101.com
Replacement for all 3 examples according to your sample: \2 where \1 correspond matches of the first parenthesized group,...
I'm currently building a chat system with reply function.
How can I match the numbers inside the '#' symbol and brackets, example: #[123456789]
This one works in JavaScript
/#\[(0-9_)+\]/g
But it doesn't work in PHP as it cannot recognize the /g modifier. So I tried this:
/\#\[[^0-9]\]/
I have the following example code:
$example_message = 'Hi #[123456789] :)';
$msg = preg_replace('/\#\[[^0-9]\]/', '$1', $example_message);
But it doesn't work, it won't capture those numbers inside #[ ]. Any suggestions? Thanks
You have some core problems in your regex, the main one being the ^ that negates your character class. So instead of [^0-9] matching any digit, it matches anything but a digit. Also, the g modifier doesn't exist in PHP (preg_replace() replaces globally and you can use preg_match_all() to match expressions globally).
You'll want to use a regex like /#\[(\d+)\]/ to match (with a group) all of the digits between #[ and ].
To do this globally on a string in PHP, use preg_match_all():
preg_match_all('/#\[(\d+)\]/', 'Hi #[123456789] :)', $matches);
var_dump($matches);
However, your code would be cleaner if you didn't rely on a match group (\d+). Instead you can use "lookarounds" like: (?<=#\[)\d+(?=\]). Also, if you will only have one digit per string, you should use preg_match() not preg_match_all().
Note: I left the example vague and linked to lots of documentation so you can read/learn better. If you have any questions, please ask. Also, if you want a better explanation on the regular expressions used (specifically the second one with lookarounds), let me know and I'll gladly elaborate.
Use the preg_match_all function in PHP if you’d like to produce the behaviour of the g modifier in Javascript. Use the preg_match function otherwise.
preg_match_all("/#\\[([0-9]+)\\]/", $example_message, $matches);
Explanation:
/ opening delimiter
# match the at sign
\\[ match the opening square bracket (metacharacter, so needs to be escaped)
( start capturing
[0-9] match a digit
+ match the previous once or more
) stop capturing
\\] match the closing square bracket (metacharacter, so needs to be escaped)
/ closing delimiter
Now $matches[1] contains all the numbers inside the square brackets.
I'm trying to write a simple regular expression that recognizes a sequence of characters that are not columns or are escaped columns.
I.e:
foo:bar //Does not match
but
foo\:bar //Does match
By my knowledge of Regular Languages, such language can be described by the regular expression
/([^:]|\\[:])*/
You can see a graphical representation of this expression in the wonderful tool Regexper
Using php's preg_match (that is based on the PCRE engine), such expression does not match "foo\:bar".
However, if substitute the class with the single char:
/([^:]|\\:)*/
the expression matches.
Do you have an explanation for this? Is this a sort of limitation of the PCRE engine on character classes?
PS: Testing the first expression on RegExr, that is based on AS3 Regexp engine, does not offer a match, while changing the alternation order:
/(\\[:]|[^:])*/
it does match, while the same expression does not match in PCRE.
preg_match() accepts a regular expression pattern as a string, so you need to double escape everything.
^(?:[^:\\\\]|\\\\:)+$
This matches one or more characters that are not colons or escape characters [^:\\\\], or an escaped colon \\\\:.
Why your first regular expression didn't work: /([^:]|\\[:])*/.
This matches a non-colon [^:], or it matches \\[:] which matches a literal [ followed by a literal : and then a literal ].
Why this works : /([^:]|\\:)*/ ?
This matches a non-colon [^:], or it matches a literal \\: so it effectively matches everything.
Edit: Why /([^:]|E[:])*/ won't match fooE:bar ?
This is what happens: [^:] matches the f then it matches o then the other o then it matches the E, now it finds a colon : and it can't match it, but since by default the PCRE engine doesn't look for the longest possible match it is satisfied with what is has matched so far and stops right there and returns fooE as a match without trying the other alternative E[:] (which is equal by the way to E:) at all.
If you want to match the entire sequence then you will to use an expression like this one:
/([^:E]|E[:])*/
This prevents [^:] from consuming that E.
You can try this. This allow the secuence \\: to have a chance before the negated character class [^:].
^(?:\\:|[^:])+$
If you use the values in the alternation bar inverted as in ^((?:[^:]|\\:)+$ it will not match escaped colon \: because the first alternative will consume the slash (\) before the second expression have a chance to try.
I'm trying to create a regular expression where it replaces words which are not enclosed by brackets.
Here is what I currently have:
$this->parsed = preg_replace('/\b(?<!\[)('.preg_quote($word).')\b/','[$1['.implode(",",array_unique($types)).']]',$this->parsed);
Where $word could be one of the following, "Burkely Mayfair Trunk" or "Trunk".
It would replace the sentence
This Burkely Mayfair Trunk is pretty nice
for
This [Burkely Mayfair [Trunk[productname]][productname]] is pretty
nice
Although it should become
This [Burkely Mayfair Trunk[productname]] is pretty nice
Since it replaces in order of the largest string to the smallest string, the smaller strings and or double occurences of word parts should not be replaced in an already replaced part of the string. It works when it's the first part of the string.
When I try to make a dynamic lookbehind it gives the following error: "Compilation failed: lookbehind assertion is not fixed length at offset 11". And I have no idea on how to fix this.
Anyone who has any ideas?
After another morning of playing with the regex I came up with a quite dirty solution which isn't flexible at all, but works for my use case.
$this->parsed = preg_replace('/\b(?!\[(|((\w+)(\s|\.))|((\w+)(\s|\.)(\w+)(\s|\.))))('.preg_quote($word).')(?!(((\s|\.)(\w+))|((\s|\.)(\w+)(\s|\.)(\w+))|)\[)\b/s','[$10['.implode(",",array_unique($types)).']]',$this->parsed);
What it basically does is check for brackets with no words, 1 word or 2 words in front or behind it in combination with the specified keyword.
Still, it would be great to hear if anyone has a better solution.
You may match any substring inside parentheses with \[[^][]*] pattern, and then use (*SKIP)(*FAIL) PCRE verbs to drop the match, and only match your pattern in any other context:
\[[^][]*](*SKIP)(*FAIL)|your_pattern_here
See the regex demo. To skip matches inside paired nested square brackets, use a recusrsion-based regex with a subroutine (note it will have to use a capturing group):
(?<skip>\[(?:[^][]++|(?&skip))*])(*SKIP)(*FAIL)|your_pattern_here
See a regex demo
Also, since you are building the pattern dynamically, you need to preg_quote the $word along with the delimiter symbol (here, /).
Your solution is
$this->parsed = preg_replace(
'/\[[^][]*\[[^][]*]](*SKIP)(*FAIL)|\b(?:' . preg_quote($word, '/') . ')\b/',
'[$0[' . implode(",", array_unique($types)) . ']]',
$this->parsed);
The \[[^][]*\[[^][]*]] regex will match all those occurrences that have been wrapped with your replacement pattern:
\[ - a [
[^][]* - 0+ chars other than [ and ]
\[ - a [ char
[^][]* - 0+ chars other than [ and ]
]] - a ]] substring.
I am tired of being frightened of regular expressions. The topic of this post is limited to PHP implementation of regular expressions, however, any generic regular expression advice would obviously be appreciated (i.e. don't confuse me with scope that is not applicable to PHP).
The following (I believe) will remove any whitespace between numbers. Maybe there is a better way to do so, but I still want to understand what is going on.
$pat="/\b(\d+)\s+(?=\d+\b)/";
$sub="123 345";
$string=preg_replace($pat, "$1", $sub);
Going through the pattern, my interpretation is:
\b A word boundary
\d+ A subpattern of 1 or more digits
\s+ One or more whitespaces
(?=\d+\b) Lookahead assertion of one or more digit followed by a word boundary?
Putting it all together, search for any word boundary followed by one or more digits and then some whitespace, and then do some sort of lookahead assertion on it, and save the results in $1 so it can replace the pattern?
Questions:
Is my above interpretation correct?
What is that lookahead assertion all about?
What is the purpose of the leading / and trailing /?
Is my above interpretation correct?
Yes, your interpretation is correct.
What is that lookahead assertion all about?
That lookahead assertion is a way for you to match characters that have a certain pattern in front of them, without actually having to match the pattern.
So basically, using the regex abcd(?=e) to match the string abcde will give you the match: abcd.
The reason that this matches is that the string abcde does in fact contain:
An a
Followed by a b
Followed by a c
Followed by a d that has an e after it (this is a single character!)
It is important to note that after the 4th item it also contains an actual "e" character, which we didn't match.
On the other hand, trying to match the string against the regex abcd(?=f) will fail, since the sequence:
"a", followed by "b", followed by "c", followed by "d that has an f in front of it"
is not found.
What is the purpose of the leading / and trailing /
Those are delimiters, and are used in PHP to distinguish the pattern part of your string from the modifier part of your string. A delimiter can be any character, although I prefer # signs myself. Remember that the character you are using as a delimiter needs to be escaped if it is used in your pattern.
It would be a good idea to watch this video, and the 4 that follow this:
http://blog.themeforest.net/screencasts/regular-expressions-for-dummies/
The rest of the series is found here:
http://blog.themeforest.net/?s=regex+for+dummies
A colleague sent me the series and after watching them all I was much more comfortable using Regular Expressions.
Another good idea would be installing RegexBuddy or Regexr. Especially RegexBuddy is very useful for understanding the workings of a regular expression.