How to make dot match newline characters using regular expressions - php

I have a string that contains normal characters, white charsets and newline characters between <div> and </div>.
This regular expression doesn't work: /<div>(.*)<\/div>. It is because .* doesn't match newline characters. How can I do this?

You need to use the DOTALL modifier (/s).
'/<div>(.*)<\/div>/s'
This might not give you exactly what you want because you are greedy matching. You might instead try a non-greedy match:
'/<div>(.*?)<\/div>/s'
You could also solve this by matching everything except '<' if there aren't other tags:
'/<div>([^<]*)<\/div>/'
Another observation is that you don't need to use / as your regular expression delimiters. Using another character means that you don't have to escape the / in </div>, improving readability. This applies to all the above regular expressions. Here's it would look if you use '#' instead of '/':
'#<div>([^<]*)</div>#'
However all these solutions can fail due to nested divs, extra whitespace, HTML comments and various other things. HTML is too complicated to parse with Regex, so you should consider using an HTML parser instead.

To match all characters, you can use this trick:
%\<div\>([\s\S]*)\</div\>%

You can also use the (?s) mode modifier. For example,
(?s)/<div>(.*?)<\/div>

There shouldn't be any problem with just doing:
(.|\n)
This matches either any character except newline or a newline, so every character. It solved it for me, at least.

An option would be:
'/<div>(\n*|.*)<\/div>/i'
Which would match either newline or the dot identifier matches.

There is usually a flag in the regular expression compiler to tell it that dot should match newline characters.

Related

How to include EOL in this regex? [duplicate]

I have a string that contains normal characters, white charsets and newline characters between <div> and </div>.
This regular expression doesn't work: /<div>(.*)<\/div>. It is because .* doesn't match newline characters. How can I do this?
You need to use the DOTALL modifier (/s).
'/<div>(.*)<\/div>/s'
This might not give you exactly what you want because you are greedy matching. You might instead try a non-greedy match:
'/<div>(.*?)<\/div>/s'
You could also solve this by matching everything except '<' if there aren't other tags:
'/<div>([^<]*)<\/div>/'
Another observation is that you don't need to use / as your regular expression delimiters. Using another character means that you don't have to escape the / in </div>, improving readability. This applies to all the above regular expressions. Here's it would look if you use '#' instead of '/':
'#<div>([^<]*)</div>#'
However all these solutions can fail due to nested divs, extra whitespace, HTML comments and various other things. HTML is too complicated to parse with Regex, so you should consider using an HTML parser instead.
To match all characters, you can use this trick:
%\<div\>([\s\S]*)\</div\>%
You can also use the (?s) mode modifier. For example,
(?s)/<div>(.*?)<\/div>
There shouldn't be any problem with just doing:
(.|\n)
This matches either any character except newline or a newline, so every character. It solved it for me, at least.
An option would be:
'/<div>(\n*|.*)<\/div>/i'
Which would match either newline or the dot identifier matches.
There is usually a flag in the regular expression compiler to tell it that dot should match newline characters.

PHP Regex - Get text between <P> tags with multiple lines [duplicate]

I have a string that contains normal characters, white charsets and newline characters between <div> and </div>.
This regular expression doesn't work: /<div>(.*)<\/div>. It is because .* doesn't match newline characters. How can I do this?
You need to use the DOTALL modifier (/s).
'/<div>(.*)<\/div>/s'
This might not give you exactly what you want because you are greedy matching. You might instead try a non-greedy match:
'/<div>(.*?)<\/div>/s'
You could also solve this by matching everything except '<' if there aren't other tags:
'/<div>([^<]*)<\/div>/'
Another observation is that you don't need to use / as your regular expression delimiters. Using another character means that you don't have to escape the / in </div>, improving readability. This applies to all the above regular expressions. Here's it would look if you use '#' instead of '/':
'#<div>([^<]*)</div>#'
However all these solutions can fail due to nested divs, extra whitespace, HTML comments and various other things. HTML is too complicated to parse with Regex, so you should consider using an HTML parser instead.
To match all characters, you can use this trick:
%\<div\>([\s\S]*)\</div\>%
You can also use the (?s) mode modifier. For example,
(?s)/<div>(.*?)<\/div>
There shouldn't be any problem with just doing:
(.|\n)
This matches either any character except newline or a newline, so every character. It solved it for me, at least.
An option would be:
'/<div>(\n*|.*)<\/div>/i'
Which would match either newline or the dot identifier matches.
There is usually a flag in the regular expression compiler to tell it that dot should match newline characters.

Regex include new lines

Current regular expression:
"/\[(.*?)\](.+?)\[\/(.*?)\]/"
Now when I have the following:
[test]textextext[/test]
it works just fine but it doesn't find
[test]tesxc
tcxvxcv
[/test]
How do I fix this? Help is greatly appreciated.
You can use the /s flag to make . match new lines. For example:
/\[(\w+)](.+?)\[\/\1]/s
Actually... the s flag alone may not be such a good idea, because that would allow:
[te
xt]123
456[/tex
t]
I think you might want this:
"(\[(?>([^\]])*)\](.+?)\[/\1\])s"
This uses a few clever tricks:
Once-only subpattern for the opening tag, prevents possible backtracking explosion
[^\]]* instead of .*?, so the once-only thing works better and this is more explicit as to what ends your repeating
End tag uses \1 to match the same as the opening tag, assuming you want [abc]...[/abc] and not [abc]...[/def]
Use () instead of // to delimit the regex - parentheses come in pairs so there's no need to escape anything inside (you'll notice I just have / in the closing tag of the pattern instead of \/), but also this can serve as a handy reminder that the first index in the match array is the entire match.
Try with this
/\[(.*?)\][^.]+\[\/(.*?)\]/
You can test these regex at regexpal.com
Just use the /s modifier.
Reference :
s (PCRE_DOTALL) If this modifier is set, a dot metacharacter in the
pattern matches all characters, including newlines. Without it,
newlines are excluded. This modifier is equivalent to Perl's /s
modifier. A negative class such as [^a] always matches a newline
character, independent of the setting of this modifier.
http://php.net/manual/en/reference.pcre.pattern.modifiers.php

Regular expression for templating captures too much

With PHP I'm trying to get my regular expression to match both template references below. The problem is, it also grabs the </ul> from the first block of text. When I remove the /s flag that it only catches the second reference. What I'm a doing wrong?
/{{\%USERS}}(.*)?{{\%\/USERS}}/s
Here is my string.
<ul class="users">
{{%USERS}}
<li>{%}</li>
{{%/USERS}}
</ul>
{{%USERS}} hello?!{{%/USERS}}
Why is my expression catching too much or too little?
You probably need to use non-greedy quantifiers.
* and + are "greedy". They'll match as many characters as they can.
*? and +? are "non-greedy". They'll match only as many characters as required to move on to the next part of the regex.
So in the following test string:
<alpha><bravo>
<.+> will capture <alpha><bravo> (because . matches >< as
well!).
<.+?> will capture <alpha>.
Why is my expression catching too much or too little?
Its catching too much because the quantifiers are greedy by default (see Li-aung Yip answer +1 for that)
If you remove the modifier s it matches only the second occurrence, because that modifier makes the . also match newline characters, so without it, it's not possible to match the first part, because there are newlines in between.
See the non greedy answer
{{\%USERS}}(.*?){{\%\/USERS}}
here on Regexr, a good place to test regular expressions.
Btw. I removed the ? after the capturing group, its not needed, since * matches also the empty string, so no need to make it additionally optional.
Here is your regexp:
/{{%USERS}}([^{]+({%[^{]+)?){{%/USERS}}/g

How to regex match text with different endings?

This is what I have at the moment.
<h2>Information</h2>\n +<p>(.*)<br />|</p>
^ that is a tab space, didn't know if there was
a better way to represent one or more (it seems to work)
Im trying to match the 'bla bla.' text, but my current regex doesn't quite work, it will match most of the line, but I want it to match the first
<h2>Information</h2>
<p>bla bla.<br /><br />google<br />
or
<h2>Information</h2>
<p>bla bla.</p> other code...
Oh and my php code:
preg_match('#h2>Information</h2>\n +<p>(.*)<br />|</p>#', $result, $postMessage);
Don't use regex to parse HTML. PHP provides DOMDocument that can be used for this purpose.
Having said that you have some errors in your regular expression:
You need parentheses around the alternation.
You need lazy modifiers.
You can't type 'header' to match 'Information'.
With these changes it would look like this:
<h2>.*?</h2>\n\t+<p>.*?(<br />|</p>)
Your regular expression is also very fragile. For example, if the input contains spaces instead of tabs or the line ending is Windows-style, your regular expression will fail. Using a proper HTML parser will give a much more robust solution.
Use \s to match any whitespace character (including spaces, tabs, new-line feeds, etc.), e.g.
preg_match('#<h2>header</h2>\s*<p>(.*)<br />|</p>#', $result, $postMessage);
But, as already mentioned, do not use regular expressions to parse HTML.
the .* match should be non greedy (match the minimum of arbitrary characters instead of the maxium), that is (.*?) i guess in PHP.
Try making your match non-greedy by using (.*?) in place of (.*)

Categories