Regex include new lines - php

Current regular expression:
"/\[(.*?)\](.+?)\[\/(.*?)\]/"
Now when I have the following:
[test]textextext[/test]
it works just fine but it doesn't find
[test]tesxc
tcxvxcv
[/test]
How do I fix this? Help is greatly appreciated.

You can use the /s flag to make . match new lines. For example:
/\[(\w+)](.+?)\[\/\1]/s

Actually... the s flag alone may not be such a good idea, because that would allow:
[te
xt]123
456[/tex
t]
I think you might want this:
"(\[(?>([^\]])*)\](.+?)\[/\1\])s"
This uses a few clever tricks:
Once-only subpattern for the opening tag, prevents possible backtracking explosion
[^\]]* instead of .*?, so the once-only thing works better and this is more explicit as to what ends your repeating
End tag uses \1 to match the same as the opening tag, assuming you want [abc]...[/abc] and not [abc]...[/def]
Use () instead of // to delimit the regex - parentheses come in pairs so there's no need to escape anything inside (you'll notice I just have / in the closing tag of the pattern instead of \/), but also this can serve as a handy reminder that the first index in the match array is the entire match.

Try with this
/\[(.*?)\][^.]+\[\/(.*?)\]/
You can test these regex at regexpal.com

Just use the /s modifier.
Reference :
s (PCRE_DOTALL) If this modifier is set, a dot metacharacter in the
pattern matches all characters, including newlines. Without it,
newlines are excluded. This modifier is equivalent to Perl's /s
modifier. A negative class such as [^a] always matches a newline
character, independent of the setting of this modifier.
http://php.net/manual/en/reference.pcre.pattern.modifiers.php

Related

How to include EOL in this regex? [duplicate]

I have a string that contains normal characters, white charsets and newline characters between <div> and </div>.
This regular expression doesn't work: /<div>(.*)<\/div>. It is because .* doesn't match newline characters. How can I do this?
You need to use the DOTALL modifier (/s).
'/<div>(.*)<\/div>/s'
This might not give you exactly what you want because you are greedy matching. You might instead try a non-greedy match:
'/<div>(.*?)<\/div>/s'
You could also solve this by matching everything except '<' if there aren't other tags:
'/<div>([^<]*)<\/div>/'
Another observation is that you don't need to use / as your regular expression delimiters. Using another character means that you don't have to escape the / in </div>, improving readability. This applies to all the above regular expressions. Here's it would look if you use '#' instead of '/':
'#<div>([^<]*)</div>#'
However all these solutions can fail due to nested divs, extra whitespace, HTML comments and various other things. HTML is too complicated to parse with Regex, so you should consider using an HTML parser instead.
To match all characters, you can use this trick:
%\<div\>([\s\S]*)\</div\>%
You can also use the (?s) mode modifier. For example,
(?s)/<div>(.*?)<\/div>
There shouldn't be any problem with just doing:
(.|\n)
This matches either any character except newline or a newline, so every character. It solved it for me, at least.
An option would be:
'/<div>(\n*|.*)<\/div>/i'
Which would match either newline or the dot identifier matches.
There is usually a flag in the regular expression compiler to tell it that dot should match newline characters.

PHP Regex - Get text between <P> tags with multiple lines [duplicate]

I have a string that contains normal characters, white charsets and newline characters between <div> and </div>.
This regular expression doesn't work: /<div>(.*)<\/div>. It is because .* doesn't match newline characters. How can I do this?
You need to use the DOTALL modifier (/s).
'/<div>(.*)<\/div>/s'
This might not give you exactly what you want because you are greedy matching. You might instead try a non-greedy match:
'/<div>(.*?)<\/div>/s'
You could also solve this by matching everything except '<' if there aren't other tags:
'/<div>([^<]*)<\/div>/'
Another observation is that you don't need to use / as your regular expression delimiters. Using another character means that you don't have to escape the / in </div>, improving readability. This applies to all the above regular expressions. Here's it would look if you use '#' instead of '/':
'#<div>([^<]*)</div>#'
However all these solutions can fail due to nested divs, extra whitespace, HTML comments and various other things. HTML is too complicated to parse with Regex, so you should consider using an HTML parser instead.
To match all characters, you can use this trick:
%\<div\>([\s\S]*)\</div\>%
You can also use the (?s) mode modifier. For example,
(?s)/<div>(.*?)<\/div>
There shouldn't be any problem with just doing:
(.|\n)
This matches either any character except newline or a newline, so every character. It solved it for me, at least.
An option would be:
'/<div>(\n*|.*)<\/div>/i'
Which would match either newline or the dot identifier matches.
There is usually a flag in the regular expression compiler to tell it that dot should match newline characters.

Improving this regex to include what it matches until it matches a certain character

Can someone please help me improve this regex so that it captures everything that starts with http://, https://, or www and then continues until it reaches a ' or ". It includes punctuation and is case-insensitive.
Here is the regular expression right now:
(wwww|https?://)
/(?:https?:\/\/|www)[^'"]*/i
I escaped the slashes since they could conflict if you use /.../ notation. [^'"] is an inverted character class that allows everything but quotes.
Edit: I removed the caret to match any occurrence of the pattern, :? to make the group non-capturing.
#(www|https?://).*?(?=['"])#i
The .*? makes the quantifier reluctant so it will stop at the first quote rather than the last.
The following regex will work:
(?:https?:\/\/|www)[^'"]*
You can walk through the details of the match at www.debuggex.com.

Regular expression for templating captures too much

With PHP I'm trying to get my regular expression to match both template references below. The problem is, it also grabs the </ul> from the first block of text. When I remove the /s flag that it only catches the second reference. What I'm a doing wrong?
/{{\%USERS}}(.*)?{{\%\/USERS}}/s
Here is my string.
<ul class="users">
{{%USERS}}
<li>{%}</li>
{{%/USERS}}
</ul>
{{%USERS}} hello?!{{%/USERS}}
Why is my expression catching too much or too little?
You probably need to use non-greedy quantifiers.
* and + are "greedy". They'll match as many characters as they can.
*? and +? are "non-greedy". They'll match only as many characters as required to move on to the next part of the regex.
So in the following test string:
<alpha><bravo>
<.+> will capture <alpha><bravo> (because . matches >< as
well!).
<.+?> will capture <alpha>.
Why is my expression catching too much or too little?
Its catching too much because the quantifiers are greedy by default (see Li-aung Yip answer +1 for that)
If you remove the modifier s it matches only the second occurrence, because that modifier makes the . also match newline characters, so without it, it's not possible to match the first part, because there are newlines in between.
See the non greedy answer
{{\%USERS}}(.*?){{\%\/USERS}}
here on Regexr, a good place to test regular expressions.
Btw. I removed the ? after the capturing group, its not needed, since * matches also the empty string, so no need to make it additionally optional.
Here is your regexp:
/{{%USERS}}([^{]+({%[^{]+)?){{%/USERS}}/g

How to make dot match newline characters using regular expressions

I have a string that contains normal characters, white charsets and newline characters between <div> and </div>.
This regular expression doesn't work: /<div>(.*)<\/div>. It is because .* doesn't match newline characters. How can I do this?
You need to use the DOTALL modifier (/s).
'/<div>(.*)<\/div>/s'
This might not give you exactly what you want because you are greedy matching. You might instead try a non-greedy match:
'/<div>(.*?)<\/div>/s'
You could also solve this by matching everything except '<' if there aren't other tags:
'/<div>([^<]*)<\/div>/'
Another observation is that you don't need to use / as your regular expression delimiters. Using another character means that you don't have to escape the / in </div>, improving readability. This applies to all the above regular expressions. Here's it would look if you use '#' instead of '/':
'#<div>([^<]*)</div>#'
However all these solutions can fail due to nested divs, extra whitespace, HTML comments and various other things. HTML is too complicated to parse with Regex, so you should consider using an HTML parser instead.
To match all characters, you can use this trick:
%\<div\>([\s\S]*)\</div\>%
You can also use the (?s) mode modifier. For example,
(?s)/<div>(.*?)<\/div>
There shouldn't be any problem with just doing:
(.|\n)
This matches either any character except newline or a newline, so every character. It solved it for me, at least.
An option would be:
'/<div>(\n*|.*)<\/div>/i'
Which would match either newline or the dot identifier matches.
There is usually a flag in the regular expression compiler to tell it that dot should match newline characters.

Categories