RegEx: First <br> within a paragraph

RegEx: First <br> within a paragraph - php

How to capture and remove the first occurrence of a <br/> tag within a paragraph.
<p><br/>Hello World</p>
Becomes:
<p>Hello World</p>
But importantly the following remains unchanged:
<p><br/></p>
Remove leading <br> tags from paragraphs that contain text
What I have so far:
preg_replace('/(<p>\s*<br *\/?>(.*?)<\/p>)+/si', '<p>$2</p>', $html);
Although this captures <p><br></p> instances...

Here is how you would do it using PHP's built in DOMDocument and DOMXPath classes:
$html = "<div><p><br/>Hello World</p><p><br/></p><p> <br> </p></div>";
$doc = new DOMDocument();
$doc->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($doc);
// find <br> within a <p> that has text content
$breaks = $xpath->query("//p[normalize-space()!='']/br");
$breaks = $xpath->query("//p[text()!='']/br");
// and remove them
foreach ($breaks as $br) {
$br->parentNode->removeChild($br);
}
echo $doc->saveHTML();
Note that there are two lines assigning values to $breaks. You should use the one which meets your requirements: the first will only strip <br> from elements which have non-whitespace characters between the <p> and </p>, while the second will also strip them from <p> elements containing only whitespace. The different effects can be seen in this demo.

It is not a recommended way to parse html with regex. But just for a quick and temporary work, you may use this regex to capture a linebreak <br/> that is preceded by <p> tag and some text and do a lookahead that it should not be immediately followed by closing </p> tag.
<p>.*?\K<br\/>(?!<\/p>)
and replace such captured <br/> with empty string hence removing it.
Explanation:
<p>.*? --> Match a paragraph tag followed by any characters in non-greedy way
\K --> Reset whatever matched as we don't intend to replace that
<br\/>(?!<\/p>) --> Match a line break tag that is not immediately followed by closing paragraph tag, which will be replaced with empty string.
Demo
Here are sample php codes,
$html = '<p><br/>Hello World</p>';
$html = preg_replace('/<p>.*?\K<br\/>(?!<\/p>)/si', '', $html);
echo $html. "\n";
$html = '<p><br/></p>';
$html = preg_replace('/<p>.*?\K<br\/>(?!<\/p>)/si', '', $html);
echo $html. "\n";
Which prints following output,
<p>Hello World</p>
<p><br/></p>

If there are more rules, we can pass array in preg_replace. In my solution, first element in pattern will look for <br /> with text. and second will look for just <br /> without text. Also this search is made from the beginng of the string (/^..).
preg_replace(['/^(<p>\s*(<br *\/?>)([a-zA-Z0-9 ]+)<\/p>)+/si', '/^(<p>\s*(<br *\/?>)<\/p>)+/si'], ['<p>$3</p>', '$0'], $html);

Related

Strip tags and replace all br and p tags with a single space

What is the regex to strip all html tags and where there are <br> and <p> tags replace with a single space and remove all line breaks?
e.g:
<h1>Heading</h1>
<br>
<br />
hyperlink
<p></p>
<p>paragraph1</p>
<p>paragraph2</p>
Should become:
Heading hyperlink paragraph1 paragraph2
I have tried the following:
$string = preg_replace( ["/<br\s*\/?>/i","/<\/p\s*>/i"]," ",$string);
$string = preg_replace(["/<\/?[^>]+>/", "/\r?\n|\r/"],"",$string);
Which gives me:
Heading hyperlink paragraph1 paragraph2
any ideas of a single line or more elegant solution which actually works?

This is what I would do:
$a = '<h1>Heading</h1>
<br>
<br />
hyperlink
<p></p>
<p>paragraph1</p>
<p>paragraph2</p>';
echo trim(preg_replace(['/<[^>]*>/','/\s+/'],' ', $a));
Output
Heading hyperlink paragraph1 paragraph2
Sandbox
The first regex removes tags replacing them with a space, the second one takes multiple spaces and changes it to one.
This works pretty good, but I can see a way that it could deviate from what was specifically requested.
What is the regex to strip all html tags and where there are <br> and <p> tags replace with a single space and remove all line breaks
So if you want the "full" solution, you can do this:
$a = '<h1>Heading</h1>
<br>
<br />
hyperlink
<p></p>
<p><big>p</big>aragraph1</p><p>paragraph2</p>';
echo preg_replace([
'/<(?:br|p)[^>]*>/i', //replace br p with ' '
'/<[^>]*>/', //replace any tag with ''
'/\s+/', //remove run on space
'/^\s+|\s+$/' //trim
],[
' ', '', ' ', ''
], $a);
Please note i added a <big> tag in and removed any space between the <p> tags. These were done to highlight a few things.
For example if you take the text from the second example and use it in the first you will get this (because the big tag):
Heading hyperlink p aragraph1 paragraph2
The updated example outputs correctly. But, and this is a big but, I changed the input text, so it may not be necessary to over-complicate it.
The <p> tag thing just shows that it puts space in between them before removing all the HTML tags with ''.
Sandbox
UPDATE
#ArtisticPhoenix how would I accomodate <p> </p>
First I would convert the string using html_entity_decode however there are a few sticky points with that. These have to do with encoding. So this is the correct way to do it:
$a = '<h1>Heading</h1>
<br>
<br />
hyperlink
<p> </p>
<p><big>p</big>aragraph1</p><p>paragraph2</p>';
//convert entities using UTF-8
$a = html_entity_decode($a, ENT_QUOTES, 'UTF-8');
echo preg_replace([
'/<(?:br|p)[^>]*>/i', //replace br p with ' '
'/<[^>]*>/', //replace any tag with ''
'/\s+/u', //remove run on space - replace using the unicode flag
'/^\s+|\s+$/u' //trim - replace using the unicode flag
],[
' ', '', ' ', ''
], $a);
Please note the addition of the u flag to the regex above /\s+/u and /^\s+|\s+$/u.
u (PCRE_UTF8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and subject strings are treated as UTF-8. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been regarded as valid UTF-8.
The problem comes from decoding it into a ASCII 160 (nbsp) instead of ASCII 32 character (single space). Anyway we can use UTF-8 to sort it out as shown above.
Sandbox

Treating HTML like a string and using regular expressions is never a good idea. The only decent solution that doesn't involve a DOM parser would be using PHP's built-in strip_tags function (which uses a state machine, so is still vulnerable to potential problems with broken HTML) and then you can compact the resulting whitespace with a regex:
<?php
$html = '<h1>Heading</h1>
<br>
<br />
hyperlink
<p></p>
<p>paragraph1</p>
<p>paragraph2</p>';
echo preg_replace("/\s+/", " ", strip_tags($html));
Output:
Heading hyperlink paragraph1 paragraph2

The way to do it is by using two patterns
P1 : <[\/\d\w]+.*?>
which will clean all the tags.
P2 : [\n\s]+ and replace it by single Whitespace
Exemple :
$string = preg_replace( "<[\/\d\w]+.*?>","",$string);
$string = preg_replace("[\n\s]+"," ",$string);

You can use this
<\s*\/?\s*br[^>]*>|<\s*\/?\s*p[^>]*>|\n
Explanation
<\s*\/?\s*br[^>]*> - Matches <br> or </br> or <br/> with any number of white space and matches attributes also.
<\s*\/?\s*p[^>]*> - Matches <p> or </p> or <p/> with any number of white space matches attributes also.
\n - Matches new line.
Demo

You can group multiple tags that are surrounded by white spaces and replace them with a single space. The regex to be replaced would be,
(\s*<[^>]+>\s*)+
This would give you a single space in place of all those tags and finally use trim() to get rid of right most and left space spaces that you may not need.
Demo
Here is the php code for demo,
$html = '<h1>Heading</h1>
<br>
<br />
hyperlink
<p></p>
<p>paragraph1</p>
<p>paragraph2</p>';
echo trim(preg_replace("/(\s*<[^>]+>\s*)+/", " ", $html));
Prints,
Heading hyperlink paragraph1 paragraph2

You could keep what you have an remove extra spaces
$stripped = preg_replace('/\s+/', ' ', $string);
That returns:
Heading hyperlink paragraph1 paragraph2

PHP: Clean HTML by merging line breaks and removing whitespaces properly

I am using a WYSIWYG editor and have a bunch of regular expressions that take care of dirty HTML. Reason: My users often hit the enter key too often and produce many redundant new lines such as:
<br><br><br> ...
<p> <br /> </p>
<p> <br /><br /> </p>
<p> <br /> </p>
<p> <br /> </p>
<p> <br /> </p>
and many more varieties including p, and br
This is how I try to fight such inputs currently, trying to merge many successive line breaks into 1, using many different regular expressions:
// merge empty p tags into one
// http://stackoverflow.com/q/16809336/1066234
$content = preg_replace('/((<p\s*\/?>\s*) (<\/p\s*\/?>\s*))+/im', "<p> </p>\n", $content);
// remove sceditor's: <p>\n<br>\n</p> from end of string
// http://stackoverflow.com/questions/25269584/how-to-replace-pbr-p-from-end-of-string-that-contain-whitespaces-linebrea
// \s* matches any number of whitespace characters (" ", \t, \n, etc)
// (?:...)+ matches one or more (without capturing the group)
// $ forces match to only be made at the end of the string
$content = preg_replace("/(?:<p>\s*(<br>\s*)+\s*<\/p>\s*)+$/", "", $content);
// remove sceditor's double: http://http://
$content = str_replace('http://http://', 'http://', $content);
// remove spaces from end of string ( )
$content = preg_replace('/( )+$/', '', $content);
// remove also <p><br></p> from end of string
$content = preg_replace('/(<p><br><\/p>)+$/', '', $content);
// remove line breaks from end of string - $ is end of line, +$ is end of line including \n
// html with <p> </p>
$content = preg_replace('/(<p> <\/p>)+$/', '', $content);
$content = preg_replace('/(<br>)+$/', '', $content);
// remove line breaks from beginning of string
$content = preg_replace('/^(<p> <\/p>)+/', '', $content);
I am searching for a new solution. Is there any HTML parser that I can tell to merge line breaks and whitespaces? Or maybe someone has another approach to that problem.
The regex solutions above do not seem proper enough because new combinations of line break "attempts" by my users slip through.

I have developed following snippet that removes duplicate br-Tags.
<?php
$content = "<h1>Hello World</h1><p>Test\r\n<br>\r\n<br >\r\n<br >\r\n<br/>Test\r\n<br />\r\n<br /></p>";
echo "<code>{$content}</code><hr>\r\n\r\n\r\n\r\n";
$contentStripped = preg_replace('/(<br {0,}\/{0,1}>(\\r|\\n){0,}){2,}/', '<br class="reduced" />', $content);
echo "<code>{$contentStripped}</code>\r\n\r\n\r\n\r\n";
You may have to add more test cases.

You can use nl2br(strip_tags($content)) instead of above long code.

How to remove paragraph By using php

I have paragraph like this
<p>
<p>This is new content</p>
</p>
I want to remove that outer paragraph tag by using php.
how can we remove .can you please explain.

At face value, you can do
$content = preg_replace('~<p>\s*<p>(.*?)</p>\s*</p>~s','<p>$1</p>',$content);
here is a demo (demo has forwardslashes escaped since it uses / as delim)
.. but I have a sneaking suspicion your real content is more complex than this...

You can use str_replace function. This may not be the correct answer. But it works.
Try this
$string = "<p><p>This is new content</p></p>";
$string = str_replace("<p><p>","<p>",$string);
$string = str_replace("</p></p>","</p>",$string);
now $string will have "<p>This is new content</p>"

You can try this:
$str = preg_replace('!<p>(.*?)</p>!Uis', '$1', $str);
If you provide better context I can provide a better answer. This will take what's inside a pair of <p></p> tags and return it. If they are nested <p> tags then it will remove a level of nesting. If they arn't nested it will remove the <p> tag.

Replace spaces and line breaks between angle braces and then replace two paragraph tags with one tag.
$string = preg_replace('/>\s+</', '><', $string);
$string = str_replace("<p><p>","<p>",$string);
$string = str_replace("</p></p>","</p>",$string);

regex to replace input tags

I have input Tags like this:
<input style="font-size:12px;width:100%" type="text" value="http://www.google.de/ggg">
And want to replace them with nothing.
This is, what I tried:
$pattern = '/<input style="font-size:12px;width:100%" type="text" value="(.+?)">/';
echo preg_replace( $pattern, "", $content )
I did not succeed with that.
What is the error in my function? Maybe the regex?
A function which replace all input tags inside the string would be fine.

Will replace all input tags whether they begin with white spaces. /i ignores case and /x ignores spaces in regexp so it primary purpose is more readable regexp,
echo preg_replace("/<\s* input [^>]+ >/xi", "", $content);

You probably need some pattern modifiers I am assuming you have a block of text so try adding sU after your pattern.

Using DOM objects is generally better for handling HTML than regexp.
$content = '<input style="font-size:12px;width:100%" type="text" value="asd"/>' ;
$dom = new DOMDocument ;
$dom->loadHTML($content);
$node = $dom->getElementsByTagName("input")->item(0);
$node->setAttribute("value", "");
echo $dom->saveHTML() ;
This also can explain why: RegEx match open tags except XHTML self-contained tags

PHP regular expressions to clean duplicated HTML tags

I am trying to get a regular expression to work, but not having a whole lot of luck.
the source file I am reading(poorly formatted, but nothing I can do there) has the following in its source between elements
<BR>
<BR>
<BR>
how do I match this with a php regular expression?

Something like this:
preg_match('/(<br>\s*){3}/i', $str, $matches);
This is a bit more lenient than your example - it does a case-insensitive match and matches any whitespace between the <br>s, not just newlines.
To match 3 or more instead of 3:
preg_match('/(<br>\s*){3,}/i', $str, $matches);

If you just want to replace the <BR> instances then you're better off doing a string replacement. It is a lot faster then regex.
$newstr = str_replace('<BR>', 'replacement...', $str);

My take on it
<?php
$html = <<<HTML
<BR>
<BR>
<BR>
<p>^^ Replace 3 consecutive BR tags with nothing</p>
<BR>
<BR>
<p>^^ those should stay, there's only 2 of them</p>
<BR>
<BR>
<BR>
<p>^^ But those should go, whitespace and newlines shouldn't matter
HTML;
echo preg_replace( "/(?:<br>\s*){3}/i", '', $html );

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

RegEx: First <br> within a paragraph - php

Related

Strip tags and replace all br and p tags with a single space

PHP: Clean HTML by merging line breaks and removing whitespaces properly

How to remove paragraph By using php

regex to replace input tags

PHP regular expressions to clean duplicated HTML tags

Categories

Resources