How to regex match text with different endings?

How to regex match text with different endings? - php

This is what I have at the moment.
<h2>Information</h2>\n +<p>(.*)<br />|</p>
^ that is a tab space, didn't know if there was
a better way to represent one or more (it seems to work)
Im trying to match the 'bla bla.' text, but my current regex doesn't quite work, it will match most of the line, but I want it to match the first
<h2>Information</h2>
<p>bla bla.<br /><br />google<br />
or
<h2>Information</h2>
<p>bla bla.</p> other code...
Oh and my php code:
preg_match('#h2>Information</h2>\n +<p>(.*)<br />|</p>#', $result, $postMessage);

Don't use regex to parse HTML. PHP provides DOMDocument that can be used for this purpose.
Having said that you have some errors in your regular expression:
You need parentheses around the alternation.
You need lazy modifiers.
You can't type 'header' to match 'Information'.
With these changes it would look like this:
<h2>.*?</h2>\n\t+<p>.*?(<br />|</p>)
Your regular expression is also very fragile. For example, if the input contains spaces instead of tabs or the line ending is Windows-style, your regular expression will fail. Using a proper HTML parser will give a much more robust solution.

Use \s to match any whitespace character (including spaces, tabs, new-line feeds, etc.), e.g.
preg_match('#<h2>header</h2>\s*<p>(.*)<br />|</p>#', $result, $postMessage);
But, as already mentioned, do not use regular expressions to parse HTML.

the .* match should be non greedy (match the minimum of arbitrary characters instead of the maxium), that is (.*?) i guess in PHP.

Try making your match non-greedy by using (.*?) in place of (.*)

Related

How to include EOL in this regex? [duplicate]

I have a string that contains normal characters, white charsets and newline characters between <div> and </div>.
This regular expression doesn't work: /<div>(.*)<\/div>. It is because .* doesn't match newline characters. How can I do this?

You need to use the DOTALL modifier (/s).
'/<div>(.*)<\/div>/s'
This might not give you exactly what you want because you are greedy matching. You might instead try a non-greedy match:
'/<div>(.*?)<\/div>/s'
You could also solve this by matching everything except '<' if there aren't other tags:
'/<div>([^<]*)<\/div>/'
Another observation is that you don't need to use / as your regular expression delimiters. Using another character means that you don't have to escape the / in </div>, improving readability. This applies to all the above regular expressions. Here's it would look if you use '#' instead of '/':
'#<div>([^<]*)</div>#'
However all these solutions can fail due to nested divs, extra whitespace, HTML comments and various other things. HTML is too complicated to parse with Regex, so you should consider using an HTML parser instead.

To match all characters, you can use this trick:
%\<div\>([\s\S]*)\</div\>%

You can also use the (?s) mode modifier. For example,
(?s)/<div>(.*?)<\/div>

There shouldn't be any problem with just doing:
(.|\n)
This matches either any character except newline or a newline, so every character. It solved it for me, at least.

An option would be:
'/<div>(\n*|.*)<\/div>/i'
Which would match either newline or the dot identifier matches.

There is usually a flag in the regular expression compiler to tell it that dot should match newline characters.

PHP Regex - Get text between <P> tags with multiple lines [duplicate]

I have a string that contains normal characters, white charsets and newline characters between <div> and </div>.
This regular expression doesn't work: /<div>(.*)<\/div>. It is because .* doesn't match newline characters. How can I do this?

You need to use the DOTALL modifier (/s).
'/<div>(.*)<\/div>/s'
This might not give you exactly what you want because you are greedy matching. You might instead try a non-greedy match:
'/<div>(.*?)<\/div>/s'
You could also solve this by matching everything except '<' if there aren't other tags:
'/<div>([^<]*)<\/div>/'
Another observation is that you don't need to use / as your regular expression delimiters. Using another character means that you don't have to escape the / in </div>, improving readability. This applies to all the above regular expressions. Here's it would look if you use '#' instead of '/':
'#<div>([^<]*)</div>#'
However all these solutions can fail due to nested divs, extra whitespace, HTML comments and various other things. HTML is too complicated to parse with Regex, so you should consider using an HTML parser instead.

To match all characters, you can use this trick:
%\<div\>([\s\S]*)\</div\>%

You can also use the (?s) mode modifier. For example,
(?s)/<div>(.*?)<\/div>

There shouldn't be any problem with just doing:
(.|\n)
This matches either any character except newline or a newline, so every character. It solved it for me, at least.

An option would be:
'/<div>(\n*|.*)<\/div>/i'
Which would match either newline or the dot identifier matches.

There is usually a flag in the regular expression compiler to tell it that dot should match newline characters.

PHP regex lookbehind with wildcard

I have two strings in PHP:
$string = '<a href="http://localhost/image1.jpeg" /></a>';
and
$string2 = '[caption id="attachment_5" align="alignnone" width="483"]<a href="http://localhost/image1.jpeg" /></a>[/caption]';
I'm trying to match strings of the first type. That is strings that are not surrounded by '[caption ... ]' and '[/caption]'. So far, I would like to use something like this:
$pattern = '/(?<!\[caption.*\])(?!\[\/caption\])(<a.*><img.*><\/a>)/';
but PHP matches out the first string as well with this pattern even though it is NOT preceeded by '[caption' and zero or more characters followed by ']'. What gives? Why is this and what's the correct pattern?
Thanks.

Variable length look-behind is not supported in PHP, so this part of your pattern is not valid:
(?<!\[caption.*\])
It should be warning you about this.
In addition, .* always matches the larges possible amount. Thus your pattern may result in a match that overlaps multiple tags. Instead, use [^>] (match anything that is not a closing bracket), because closing brackets should not occur inside the img tag.
To solve the look-behind problem, why not just check for the closing tag only? This should be sufficient (assuming the caption tags are only used in a way similar to what you have shown).
$pattern = '|(<a[^>]*><img[^>]*></a>)(?!\[/caption\])|';
When matching patterns that contain /, use another character as the pattern delimiter to avoid leaning toothpick syndrome. You can use nearly any non-alphanumeric character around the pattern.
Update: the previous regex is based on the example regex you gave, rather than the example data. If you want to match links that don't contain images, do this:
$pattern = '|(<a[^>]*>[^<]*</a>)(?!\[/caption\])|';
Note that this doesn't allow any tags in the middle of the link. If you allow tags (such as by using .*?), a regex could match something starting within the [caption] and ending elsewhere.

I don't see how your regexp could match either string, since you're looking for <a.*><img.*><\/a>, and both anchors don't contain an <img... tag. Also, the two subexpressions looking for and prohibiting the caption-bits look oddly positioned to me. Finally, you need to ensure your tag-matching bits don't act greedy, i.e. don't use .* but [^>]*.
Do you mean something like this?
$pattern = '/(<a[^>]*>(<img[^>]*>)?<\/a>)(?!\[\/caption\])/'
Test it on regex101.
Edit: Removed useless lookahead as per dan1111's suggestion and updated regex101 link.

Lookbehind doesn't allow non fixed length pattern i.e. (*,+,?), I think this /<a.*><\/a>(?!\[\/caption\])/ is enough for your requirement

Using preg_match_all to get items from HTML

I have a number of items in a table, formatted like this
<td class="product highlighted">
Item Name
</td>
and I am using the following PHP code
$regex_pattern = "/<td class=\"product highlighted\">(.*)<\/td>/";
preg_match_all($regex_pattern,$buffer,$matches);
print_r($matches);
I am not getting any output, yet I can see the items in the html.
Is there something wrong with my regexp?

Apart from your using regex to parse HTML, yes, there is something wrong: The dot doesn't match newlines.
So you need to use
$regex_pattern = "/<td class=\"product highlighted\">(.*?)<\/td>/s";
The /s modifier allows the dot to match any character, including newlines. Note the reluctant quantifier .*? to avoid matching more than one tag at once.

In order to match your example, you will need to add the dot all flag, s, so the . will match newlines.
Try the following.
$regex_pattern = "/<td class=\"product highlighted\">(.*?)<\/td>/s";
Also note that I changed the capture to non-greedy, (.*?). It's best to do so when matching open ended text.
It's worth noting regular expressions are not the right tool for HTML parsing, you should look into DOMDocument. However, for such a simple match you can get away with regular expressions provided your HTML is well-formed.

How to make dot match newline characters using regular expressions

I have a string that contains normal characters, white charsets and newline characters between <div> and </div>.
This regular expression doesn't work: /<div>(.*)<\/div>. It is because .* doesn't match newline characters. How can I do this?

You need to use the DOTALL modifier (/s).
'/<div>(.*)<\/div>/s'
This might not give you exactly what you want because you are greedy matching. You might instead try a non-greedy match:
'/<div>(.*?)<\/div>/s'
You could also solve this by matching everything except '<' if there aren't other tags:
'/<div>([^<]*)<\/div>/'
Another observation is that you don't need to use / as your regular expression delimiters. Using another character means that you don't have to escape the / in </div>, improving readability. This applies to all the above regular expressions. Here's it would look if you use '#' instead of '/':
'#<div>([^<]*)</div>#'
However all these solutions can fail due to nested divs, extra whitespace, HTML comments and various other things. HTML is too complicated to parse with Regex, so you should consider using an HTML parser instead.

To match all characters, you can use this trick:
%\<div\>([\s\S]*)\</div\>%

You can also use the (?s) mode modifier. For example,
(?s)/<div>(.*?)<\/div>

There shouldn't be any problem with just doing:
(.|\n)
This matches either any character except newline or a newline, so every character. It solved it for me, at least.

An option would be:
'/<div>(\n*|.*)<\/div>/i'
Which would match either newline or the dot identifier matches.

There is usually a flag in the regular expression compiler to tell it that dot should match newline characters.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to regex match text with different endings? - php

Use \s to match any whitespace character (including spaces, tabs, new-line feeds, etc.), e.g. preg_match('#<h2>header</h2>\s<p>(.)<br />|</p>#', $result, $postMessage); But, as already mentioned, do not use regular expressions to parse HTML.

the .* match should be non greedy (match the minimum of arbitrary characters instead of the maxium), that is (.*?) i guess in PHP.

Try making your match non-greedy by using (.?) in place of (.)

Related

How to include EOL in this regex? [duplicate]

PHP Regex - Get text between <P> tags with multiple lines [duplicate]

PHP regex lookbehind with wildcard

Using preg_match_all to get items from HTML

How to make dot match newline characters using regular expressions

Categories

Resources

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to regex match text with different endings? - php

Use \s to match any whitespace character (including spaces, tabs, new-line feeds, etc.), e.g. preg_match('#<h2>header</h2>\s*<p>(.*)<br />|</p>#', $result, $postMessage); But, as already mentioned, do not use regular expressions to parse HTML.

the .* match should be non greedy (match the minimum of arbitrary characters instead of the maxium), that is (.*?) i guess in PHP.

Try making your match non-greedy by using (.*?) in place of (.*)

Related

How to include EOL in this regex? [duplicate]

PHP Regex - Get text between <P> tags with multiple lines [duplicate]

PHP regex lookbehind with wildcard

Using preg_match_all to get items from HTML

How to make dot match newline characters using regular expressions

Categories

Resources

Use \s to match any whitespace character (including spaces, tabs, new-line feeds, etc.), e.g. preg_match('#<h2>header</h2>\s<p>(.)<br />|</p>#', $result, $postMessage); But, as already mentioned, do not use regular expressions to parse HTML.

Try making your match non-greedy by using (.?) in place of (.)