I just started with PHP regular expressions. I understand how to read and write them (I need my book though because I haven't memorized any pattern symbols). I really want to use RegExp for BB Code on my site, using preg_replace.
I understand the parameters, but what I don't understand is what defines what is to be replaced in the pattern? What I have so far:
preg_replace('/(\[url=http:\/\/.*\])/','$2',"[url=http://google.com]");
Now, I know it's probably not the best "security" wise, I just want to get something to work. I match the entire string... so I get a link that looks like mysite/[url=http://google.com].
I read over the PHP manual on it, but I still have a headache trying to absorb and comprehend something:
What defines what is replaced in the string because of the pattern?
What TELLS me what my $1 and $2 and so on are?
I don't even know what they are called. Could someone explain this to me?
The same replacement without errors:
$BBlink = '[url=http://google.com]';
$pattern = '~\[url=(http://[^] ]+)]~';
$replacement = '$1';
$result = preg_replace($pattern, $replacement, $BBlink);
explanations:
1) pattern
~ # pattern delimiter
\[ # literal opening square bracket
url=
( # first capturing group
http://
[^] ]+ # all characters that are not ] or a space one or more times
) # close the capturing group
] # literal closing square bracket
~ # pattern delimiter
2) replacement
$1 refer to the first capturing group
Alternative: http://www.php.net/manual/en/function.bbcode-create.php, see the first example.
Related
Given a text string (a markdown document) I need to achieve one of this two options:
to replace all the matches of a particular expression ((\W)(theWord)(\W)) all across the document EXCEPT the matches that are inside a markdown image syntax ![Blah theWord blah](url).
to replace all the matches of a particular expression ({{([^}}]+)}}\[\[[^\]\]]+\]\]) ONLY inside the markdown images, ie.: ![Blah {{theWord}}[[1234]] blah](url).
Both expressions are currently matching everything, no matter if inside the markdown image syntax or not, and I've already tried everything I could think.
Here is an example of the first option
And here is an example of the second option
Any help and/or clue will be highly appreciated.
Thanks in advance!
Well I modified first expression a little bit as I thought there are some extra capturing groups then made them by adding a lookahead trick:
-First one (Live demo):
\b(vitae)\b(?![^[]*]\s*\()
-Second one (Live demo):
{{([^}}]+)}}\[\[[^\]\]]+\]\](?=[^[]*]\s*\()
Lookahead part explanations:
(?! # Starting a negative lookahead
[^[]*] # Everything that's between brackets
\s* # Any whitespace
\( # Check if it's followed by an opening parentheses
) # End of lookahead which confirms the whole expression doesn't match between brackets
(?= means a positive lookahead
You can leverage the discard technique that it really useful for this cases. It consists of having below pattern:
patternToSkip1 (*SKIP)(*FAIL)|patternToSkip2 (*SKIP)(*FAIL)| MATCH THIS PATTERN
So, according you needs:
to replace all the matches of a particular expression ((\W)(theWord)(\W)) all across the document EXCEPT the matches that are inside a markdown image syntax
You can easily achieve this in pcre through (*SKIP)(*FAIL) flags, so for you case you can use a regex like this:
\[.*?\](*SKIP)(*FAIL)|\bTheWord\b
Or using your pattern:
\[.*?\](*SKIP)(*FAIL)|(\W)(theWord)(\W)
The idea behind this regex is tell regex engine to skip the content within [...]
Working demo
The first regex is easily fixed with a SKIP-FAIL trick:
\!\[.*?\]\(http[^)]*\)(*SKIP)(*FAIL)|\bvitae\b
To replace with the word of your choice. It is a totally valid way in PHP (PCRE) regex to match something outside some markers.
See Demo 1
As for the second one, it is harder, but acheivable with \G that ensures we match consecutively inside some markers:
(\!\[.*?|(?<!^)\G)((?>(?!\]\(http).)*?){{([^}]+?)}}\[{2}[^]]+?\]{2}(?=.*?\]\(http[^)]*?\))
To replace with $1$2{{NEW_REPLACED_TEXT}}[[NEW_DIGITS]]
See Demo 2
PHP:
$re1 = "#\!\[.*?\]\(http[^)]*\)(*SKIP)(*FAIL)|\bvitae\b#i";
$re2 = "#(\!\[.*?|(?<!^)\G)((?>(?!\]\(http).)*?){{([^}]+?)}}\[{2}[^]]+?\]{2}(?=.*?\]\(http[^)]*?\))#i";
Suppose I have a string that looks like:
"lets refer to [[merp] [that entry called merp]] and maybe also to that entry called [[blue] [blue]]"
The idea here is to replace a block of [[name][some text]] with some text.
So I'm trying to use regular expressions to find blocks that look like [[name][some text]], but I'm having tremendous difficulty.
Here's what I thought should work (in PHP):
preg_match_all('/\[\[.*\]\[.*\]/', $my_big_string, $matches)
But this just returns a single match, the string from '[[merp' to 'blue]]'. How can I get it to return the two matches [[merp][that entry called merp]] and [[blue][blue]]?
The regex you're looking for is \[\[(.+?)\]\s\[(.+?)\]\] and replace it with $2
The regex pattern matched inside the () braces are captured and can be back-referenced using $1, $2,...
Example on regex101.com
Quantifiers like the * are by default greedy,
which means, that as much as possible is matched to meet conditions. E.g. in your sample a regex like \[.*\] would match everything from the first [ to the last ] in the string. To change the default behaviour and make quantifiers lazy (ungreedy, reluctant):
Use the U (PCRE_UNGREEDY) modifier to make all quantifiers lazy
Put a ? after a specific quantifier. E.g. .*? as few of any characters as possible
1.) Using the U-modifier a pattern could look like:
/\[\[(.*)]\s*\[(.*)]]/Us
Additional used the s (PCRE_DOTALL) modifier to make the . dot also match newlines. And added some \s whitespaces in between ][ which are in your sample string. \s is a shorthand for [ \t\r\n\f].
There are two capturing groups (.*) to be replaced then. Test on regex101.com
2.) Instead using the ? to making each quantifier lazy:
/\[\[(.*?)]\s*\[(.*?)]]/s
Test on regex101.com
3.) Alternative without modifiers, if no square brackets are expected to be inside [...].
/\[\[([^]]*)]\s*\[([^]]*)]]/
Using a ^ negated character class to allow [^]]* any amount of characters, that are NOT ] in between [ and ]. This wouldn't require to rely on greediness. Also no . is used, so no s-modifier is needed.
Test on regex101.com
Replacement for all 3 examples according to your sample: \2 where \1 correspond matches of the first parenthesized group,...
I'm currently building a chat system with reply function.
How can I match the numbers inside the '#' symbol and brackets, example: #[123456789]
This one works in JavaScript
/#\[(0-9_)+\]/g
But it doesn't work in PHP as it cannot recognize the /g modifier. So I tried this:
/\#\[[^0-9]\]/
I have the following example code:
$example_message = 'Hi #[123456789] :)';
$msg = preg_replace('/\#\[[^0-9]\]/', '$1', $example_message);
But it doesn't work, it won't capture those numbers inside #[ ]. Any suggestions? Thanks
You have some core problems in your regex, the main one being the ^ that negates your character class. So instead of [^0-9] matching any digit, it matches anything but a digit. Also, the g modifier doesn't exist in PHP (preg_replace() replaces globally and you can use preg_match_all() to match expressions globally).
You'll want to use a regex like /#\[(\d+)\]/ to match (with a group) all of the digits between #[ and ].
To do this globally on a string in PHP, use preg_match_all():
preg_match_all('/#\[(\d+)\]/', 'Hi #[123456789] :)', $matches);
var_dump($matches);
However, your code would be cleaner if you didn't rely on a match group (\d+). Instead you can use "lookarounds" like: (?<=#\[)\d+(?=\]). Also, if you will only have one digit per string, you should use preg_match() not preg_match_all().
Note: I left the example vague and linked to lots of documentation so you can read/learn better. If you have any questions, please ask. Also, if you want a better explanation on the regular expressions used (specifically the second one with lookarounds), let me know and I'll gladly elaborate.
Use the preg_match_all function in PHP if you’d like to produce the behaviour of the g modifier in Javascript. Use the preg_match function otherwise.
preg_match_all("/#\\[([0-9]+)\\]/", $example_message, $matches);
Explanation:
/ opening delimiter
# match the at sign
\\[ match the opening square bracket (metacharacter, so needs to be escaped)
( start capturing
[0-9] match a digit
+ match the previous once or more
) stop capturing
\\] match the closing square bracket (metacharacter, so needs to be escaped)
/ closing delimiter
Now $matches[1] contains all the numbers inside the square brackets.
So, let's say I want to accept strings as follows
SomeColumn IN||<||>||= [123, 'hello', "wassup"]||123||'hello'||"yay!"
For example:MyValue IN ['value', 123] or MyInt > 123 -> I think you get the idea. Now, what's bothering me is how to phrase this in a regex? I'm using PHP, and this is what I'm doing right now: $temp = explode(';', $constraints);
$matches = array();
foreach ($temp as $condition) {
preg_match('/(.+)[\t| ]+(IN|<|=|>|!)[\t| ]+([0-9]+|[.+]|.+)/', $condition, $matches[]);
}
foreach ($matches as $match) {
if ($match[2] == 'IN') {
preg_match('/(?:([0-9]+|".+"|\'.+\'))/', substr($match[3], 1, -1), $tempm);
print_r($tempm);
}
}
Really appreciate any help right there, my regex'ing is horrible.
I assume your input looks similar to this:
$string = 'SomeColumn IN [123, \'hello\', "wassup"];SomeColumn < 123;SomeColumn = \'hello\';SomeColumn > 123;SomeColumn = "yay!";SomeColumn = [123, \'hello\', "wassup"]';
If you use preg_match_all there is no need for explode or to build the matches yourself. Note that the resulting two-dimensional array will have its dimensions switched, but that is often desirable. Here is the code:
preg_match_all('/(\w+)[\t ]+(IN|<|>|=|!)[\t ]+((\'[^\']*\'|"[^"]*"|\d+)|\[[\t ]*(?4)(?:[\t ]*,[\t ]*(?4))*[\t ]*\])/', $string, $matches);
$statements = $matches[0];
$columns = $matches[1];
$operators = $matches[2];
$values = $matches[3];
There will also be a $matches[4] but it does not really have a meaning and is only used inside the regular expression. First, a few things you did wrong in your attempt:
(.+) will consume as much as possible, and any character. So if you have something inside a string value that looks like IN 13 then your first repetition might consume everything until there and return it as the column. It also allows whitespace and = inside column names. There are two ways around this. Either making the repetition "ungreedy" by appending ? or, even better, restrict the allowed characters, so you cannot go past the desired delimiter. In my regex I only allow letters, digits and underscores (\w) for column identifiers.
[\t| ] this mixes up two concepts: alternation and character classes. What this does is "match a tab, a pipe or a space". In character classes you simply write all characters without delimiting them. Alternatively you could have written (\t| ) which would be equivalent in this case.
[.+] I don't know what you were trying to accomplish with this, but it matches either a literal . or a literal +. And again it might be useful to restrict the allowed characters, and to check for correct matching of quotes (to avoid 'some string")
Now for an explanation of my own regex (you can copy this into your code, as well, it will work just fine; plus you have the explanation as comments in your code):
preg_match_all('/
(\w+) # match an identifier and capture in $1
[\t ]+ # one or more tabs or spaces
(IN|<|>|=|!) # the operator (capture in $2)
[\t ]+ # one or more tabs or spaces
( # start of capturing group $3 (the value)
( # start of subpattern for single-valued literals (capturing group $4)
\' # literal quote
[^\']* # arbitrarily many non-quote characters, to avoid going past the end of the string
\' # literal quote
| # OR
"[^"]*" # equivalent for double-quotes
| # OR
\d+ # a number
) # end of subpattern for single-valued literals
| # OR (arrays follow)
\[ # literal [
[\t ]* # zero or more tabs or spaces
(?4) # reuse subpattern no. 4 (any single-valued literal)
(?: # start non-capturing subpattern for further array elements
[\t ]* # zero or more tabs or spaces
, # a literal comma
[\t ]* # zero or more tabs or spaces
(?4) # reuse subpattern no. 4 (any single-valued literal)
)* # end of additional array element; repeat zero or more times
[\t ]* # zero or more tabs or spaces
\] # literal ]
) # end of capturing group $3
/',
$string,
$matches);
This makes use of PCRE's recursion feature where you can reuse a subpattern (or the whole regular expression) with (?n) (where n is just the number you would also use for a backreference).
I can think of three major things that could be improved with this regex:
It does not allow for floating-point numbers
It does not allow for escaped quotes (if your value is 'don\'t do this', I would only captur 'don\'). This can be solved using a negative lookbehind.
It does not allow for empty arrays as values (this could be easily solved by wrapping all parameters in a subpattern and making it optional with ?)
I included none of these, because I was not sure whether they apply to your problem, and I thought the regex was already complex enough to present here.
Usually regular expressions are not powerful enough to do proper language parsing anyway. It is generally better to write your parser.
And since you said your regex'ing is horrible... while regular expressions seem like a lot of black magic due to their uncommon syntax, they are not that hard to understand, if you take the time once to get your head around their basic concepts. I can recommend this tutorial. It really takes you all the way through!
I have something like:
$string1="dog fox [cat]"
I need the contents inside [ ] i.e cat
another question: if you are familiar with regex in one language, will it do for the other languages as well?
$matches = array();
$matchcount = preg_match('/\[([^\]]*)\]/', $string1, $matches);
$item_inside_brackets = $matches[1];
If you want to match multiple bracketed terms in the same string, you'll want to look into preg_match_all instead of just preg_match.
And yes, regular expressions are a fairly cross-language standard (there are some variations in what features are available in different languages, and occasional syntax differences, but for the most part it's all the same).
Explanation of the above regex:
/ # beginning of regex delimiter
\[ # literal left bracket (normally [ is a special character)
( # start capture group - isolate the text we actually want to extract
[^\]]* # match any number of non-] characters
) # end capture group
\] # literal right bracket
/ # end of regex delimiter
The contents of the $matches array are set based on both the entirety of the text matched (which would include the brackets) in [0], and then the contents of each capture group from the matching in [1] and up (first capture group's contents in [1], second in [2], etc).
Here is the php code:
preg_match('/\[(.*)\]/', $string1, $matches);
echo $matches[1];
And yes, your knowledge will transfer. Although there my be subtle differences between each language's version of regular expressions.