Using OR (|) with PHP Regex when ORing two expressions - php

I'm trying to combine two regular expressions with an OR condition in PHP so that two different string patterns can be found with one pass.
I have this pattern [\$?{[_A-Za-z0-9-]+[:[A-Za-z]*]*}] which matches strings like this ${product} and ${Product:Test}.
I have this pattern [<[A-Za-z]+:[A-Za-z]+\s*(\s[A-Za-z]+=\"[A-Za-z0-9\s]+\"){0,5}\s*/>] which matches strings like this <test:helloWorld /> and <calc:sum val1="10" val2="5" />.
However when I try to join the two patterns into one
[\$?{[_A-Za-z0-9-]+[:[A-Za-z]*]*}]|[<[A-Za-z]+:[A-Za-z]+\s*(\s[A-Za-z]+=\"[A-Za-z0-9\s]+\"){0,5}\s*/>]
so I can find all the matching strings with one call to
preg_match_all(REGEX_COMBINED, $markup, $results, PREG_SET_ORDER);
I get the following error message Unknown modifier '|'.
Can anyone please tell me where I am going wrong, I've tried multiple variations of the pattern but nothing I do seems to work.
Thanks

In PHP, regexes have to be enclosed in delimiters, like /abc/ or ~abc~. Almost any ASCII punctuation character will do; it just has to be the same character at both ends in most cases. The exception is when you use "bracketing" characters like () and <>; then they have to be correctly paired.
With your original regexes, the square brackets were being used as regex delimiters. After you glued them together it no longer worked because the compiler was still trying to use the first ] as the closing delimiter.
Another problem is that you're trying to use square brackets for grouping, which is wrong; you use parentheses for that. If you look below you'll see that I replaced square brackets with parentheses where needed, but the outermost pair I simple dropped; grouping isn't needed at that level. Then I added ~ to serve as the regex delimiter. I also added the i modifier and got rid of some clutter.
~\$?\{[\w-]+(?::[a-z]*)*\}~i
~<[a-z]+:[a-z]+\s*(?:\s[a-z]+=\"[a-z\d\s]+\"){0,5}\s*/>~i
To combine the regexes, just remove the ending ~i from the first regex and the opening ~ from the second, and replace them with a pipe:
~\$?\{[\w-]+(?::[a-z]*)*\}|<[a-z]+:[a-z]+\s*(?:\s[a-z]+=\"[a-z\d\s]+\"){0,5}\s*/>~i

Try wrapping the two conditions in an outer set of brackets "(...|...)":
([\$?{[_A-Za-z0-9-]+[:[A-Za-z]*]*}]|[<[A-Za-z]+:[A-Za-z]+\s*(\s[A-Za-z]+=\"[A-Za-z0-9\s]+\"){0,5}\s*/>])
Tested here and it seemed to work

Related

PHP Regex extract numbers inside brackets

I'm currently building a chat system with reply function.
How can I match the numbers inside the '#' symbol and brackets, example: #[123456789]
This one works in JavaScript
/#\[(0-9_)+\]/g
But it doesn't work in PHP as it cannot recognize the /g modifier. So I tried this:
/\#\[[^0-9]\]/
I have the following example code:
$example_message = 'Hi #[123456789] :)';
$msg = preg_replace('/\#\[[^0-9]\]/', '$1', $example_message);
But it doesn't work, it won't capture those numbers inside #[ ]. Any suggestions? Thanks
You have some core problems in your regex, the main one being the ^ that negates your character class. So instead of [^0-9] matching any digit, it matches anything but a digit. Also, the g modifier doesn't exist in PHP (preg_replace() replaces globally and you can use preg_match_all() to match expressions globally).
You'll want to use a regex like /#\[(\d+)\]/ to match (with a group) all of the digits between #[ and ].
To do this globally on a string in PHP, use preg_match_all():
preg_match_all('/#\[(\d+)\]/', 'Hi #[123456789] :)', $matches);
var_dump($matches);
However, your code would be cleaner if you didn't rely on a match group (\d+). Instead you can use "lookarounds" like: (?<=#\[)\d+(?=\]). Also, if you will only have one digit per string, you should use preg_match() not preg_match_all().
Note: I left the example vague and linked to lots of documentation so you can read/learn better. If you have any questions, please ask. Also, if you want a better explanation on the regular expressions used (specifically the second one with lookarounds), let me know and I'll gladly elaborate.
Use the preg_match_all function in PHP if you’d like to produce the behaviour of the g modifier in Javascript. Use the preg_match function otherwise.
preg_match_all("/#\\[([0-9]+)\\]/", $example_message, $matches);
Explanation:
/ opening delimiter
# match the at sign
\\[ match the opening square bracket (metacharacter, so needs to be escaped)
( start capturing
[0-9] match a digit
+ match the previous once or more
) stop capturing
\\] match the closing square bracket (metacharacter, so needs to be escaped)
/ closing delimiter
Now $matches[1] contains all the numbers inside the square brackets.

How can I do a massive and complex string replace in a PHP project?

I guess "in a file tree" would suffice but this is my case.
I'm given a task to replace all instances of
some_function('string_parameter'.$some_optional_var.');
to
some_function().'string_parameter'.$some_optional_var.'; in a PHP project.
The parameter can be different in any place the function is being called.
I'm pretty sure I can user regex or something like that but regex is not my strength...
I'm using Eclipse Indigo if that helps. Thanks!
Here is a solution that worked for me in Eclipse search and replace.
Search:
some_function\(([^)]+)\)
Replace:
some_function().$1
(This puts the content inside the call to some_function after it, no matter what that content is. If this isn't specific enough, you will need to describe exactly what kind of cases of some_function you need to handle.)
You need to turn on the regex option, obviously.
Explanation:
some_function matches that exact text.
\( ... \) since parentheses are a special character in regexes, a backslash is needed to escape them so that we match actual parentheses.
(...) the inner parentheses create a capture group to record what this portion of the regex matches.
[^)]+ a character class matching one or more characters that are not parentheses (the ^ negates a character class; the + requires there to be at least one character, and matches as many as possible.
$1 in the replacement--inserts the value captured in the parentheses above. You can capture many different things with different groups of parentheses, and they will go into $1, $2, $3, etc.

Php regex with safe delimiters

I've thought that php's perl compatible regular expression (preg library) supports curly brackets as delimiters. This should be fine:
{ello {world}i // should match on Hello {World
The main point of curly brackets is that it only takes the most left and right ones, thus requiring no escaping for the inner ones. As far as I know, php requires the escaping
{ello \{world}i // this actually matches on Hello {World
Is this the expected behavior or bug in php preg implementation?
When in Perl you use for the pattern delimiter any of the four paired ASCII bracket types, you only need to escape unpaired brackets within the pattern. This is indeed the entire purpose of using brackets. This is documented in the perlop manpage under “Quote and Quote-like Operators”, which reads in part:
Non-bracketing delimiters use the same character fore and aft,
but the four sorts of brackets (round, angle, square, curly)
will all nest, which means that
q{foo{bar}baz}
is the same as
'foo{bar}baz'
Note, however, that this does not always work for quoting Perl code:
$s = q{ if($a eq "}") ... }; # WRONG
That’s why you often see people use m{…} or qr{…} in Perl code, especially for multiline patterns used with /x ᴀᴋᴀ (?x). For example:
return qr{
(?= # pure lookahead for conjunctive matching
\A # always from start
. *? # going only as far as we need to to find the pattern
(?:
${case_flag}
${left_boundary}
${positive_pattern}
${right_boundary}
)
)
}sxm;
Notice how those nested braces are no problem.
Expected behavior as far as I know, otherwise how else would the compiler allow group limiters? e.g.
[a-z]{1,5}
From http://lv.php.net/manual/en/regexp.reference.delimiters.php:
If the delimiter needs to be matched
inside the pattern it must be escaped
using a backslash. If the delimiter
appears often inside the pattern, it
is a good idea to choose another
delimiter in order to increase
readability.
So this is expected behavior, not a bug.
I found that no escaping is required in this case:
'ello {world'i
(ello {world)i
So my theory is, that the problem is with the '{' delimiters only. Also, the following two produce the same error:
{ello {world}i
(ello (world)i
Using starting/ending braces as delimiters may require to escape the given braces in the expression.

recursive regular expression to process nested strings enclosed by {| and |}

In a project I have a text with patterns like that:
{| text {| text |} text |}
more text
I want to get the first part with brackets. For this I use preg_match recursively. The following code works fine already:
preg_match('/\{((?>[^\{\}]+)|(?R))*\}/x',$text,$matches);
But if I add the symbol "|", I got an empty result and I don't know why:
preg_match('/\{\|((?>[^\{\}]+)|(?R))*\|\}/x',$text,$matches);
I can't use the first solution because in the text something like { text } can also exist. Can somebody tell me what I do wrong here? Thx
Try this:
'/(?s)\{\|(?:(?:(?!\{\||\|\}).)++|(?R))*\|\}/'
In your original regex you use the character class [^{}] to match anything except a delimiter. That's fine when the delimiters are only one character, but yours are two characters. To not-match a multi-character sequence you need something this:
(?:(?!\{\||\|\}).)++
The dot matches any character (including newlines, thank to the (?s)), but only after the lookahead has determined that it's not part of a {| or |} sequence. I also dropped your atomic group ((?>...)) and replaced it with a possessive quantifier (++) to reduce clutter. But you should definitely use one or the other in that part of the regex to prevent catastrophic backtracking.
You've got a few suggestions for working regular expressions, but if you're wondering why your original regexp failed, read on. The problem lies when it comes time to match a closing "|}" tag. The (?>[^{}]+) (or [^{}]++) sub expression will match the "|", causing the |} sub expression to fail. With no backtracking in the sub expression, there's no way to recover from the failed match.
See PHP - help with my REGEX-based recursive function
To adapt it to your use
preg_match_all('/\{\|(?:^(\{\||\|\})|(?R))*\|\}/', $text, $matches);

Match words then match potentially parenthetical string, then match potential square braced string

I have a script where I need to get three parts out of a text string, and return them in an array. After a couple of trying and failing I couldn't get it to work.
The text strings can look like this:
Some place
Some place (often text in parenthesis)
Some place (often text in parenthesis) [even text in brackets sometimes]
I need to split these strings into three:
{Some place} ({often text in parenthesis}) [{even text i brackets sometimes}]
Which should return:
1: Some place
2: often text in parenthesis
3: even text in brackets sometimes
I know this should be an easy task, but I couldn't solve the correct regular expression. This is to be used in PHP.
Thanks in advance!
Try something like this:
$result = preg_match('/
^ ([^(]+?)
(\s* \( ([^)]++) \))?
(\s* \[ ([^\]]++) \])?
\s*
$/x', $mystring, $matches);
print_r($matches);
Note that in this example, you will probably be most interested in $matches[1], $matches[3], and $matches[5].
Split the problem into three regular expressions. After the first one, where you get every character before the first parenthesis, save your position - the same as the length of the string you just extracted.
Then in step two, do the same, but grab everything up to the closing parenthesis. (Nested parentheses make this a little more complicated but not too much.) Again, save a pointer to the end of the second string.
Getting the third string is then trivial.
I'd probably do it as three regular expressions, starting with both parenthesis and brackets, and falling back to less items if that fails.
^(.*?)\s+\((.*?)\)\s+\[(.*?)\]\s+$
if it fails then try:
^(.*?)\s+\((.*?)\)\s+$
if that also fails try:
^\s+(.*?)\s+$
I'm sure they can be combined into one regular expression, but I wouldn't try.
Something like this?
([^(]++)(?: \(([^)]++)\))?(?: \[([^\]]++)\])?

Categories