PHP: Clean HTML by merging line breaks and removing whitespaces properly - php

I am using a WYSIWYG editor and have a bunch of regular expressions that take care of dirty HTML. Reason: My users often hit the enter key too often and produce many redundant new lines such as:
<br><br><br> ...
<p> <br /> </p>
<p> <br /><br /> </p>
<p> <br /> </p>
<p> <br /> </p>
<p> <br /> </p>
and many more varieties including p, and br
This is how I try to fight such inputs currently, trying to merge many successive line breaks into 1, using many different regular expressions:
// merge empty p tags into one
// http://stackoverflow.com/q/16809336/1066234
$content = preg_replace('/((<p\s*\/?>\s*) (<\/p\s*\/?>\s*))+/im', "<p> </p>\n", $content);
// remove sceditor's: <p>\n<br>\n</p> from end of string
// http://stackoverflow.com/questions/25269584/how-to-replace-pbr-p-from-end-of-string-that-contain-whitespaces-linebrea
// \s* matches any number of whitespace characters (" ", \t, \n, etc)
// (?:...)+ matches one or more (without capturing the group)
// $ forces match to only be made at the end of the string
$content = preg_replace("/(?:<p>\s*(<br>\s*)+\s*<\/p>\s*)+$/", "", $content);
// remove sceditor's double: http://http://
$content = str_replace('http://http://', 'http://', $content);
// remove spaces from end of string ( )
$content = preg_replace('/( )+$/', '', $content);
// remove also <p><br></p> from end of string
$content = preg_replace('/(<p><br><\/p>)+$/', '', $content);
// remove line breaks from end of string - $ is end of line, +$ is end of line including \n
// html with <p> </p>
$content = preg_replace('/(<p> <\/p>)+$/', '', $content);
$content = preg_replace('/(<br>)+$/', '', $content);
// remove line breaks from beginning of string
$content = preg_replace('/^(<p> <\/p>)+/', '', $content);
I am searching for a new solution. Is there any HTML parser that I can tell to merge line breaks and whitespaces? Or maybe someone has another approach to that problem.
The regex solutions above do not seem proper enough because new combinations of line break "attempts" by my users slip through.

I have developed following snippet that removes duplicate br-Tags.
<?php
$content = "<h1>Hello World</h1><p>Test\r\n<br>\r\n<br >\r\n<br >\r\n<br/>Test\r\n<br />\r\n<br /></p>";
echo "<code>{$content}</code><hr>\r\n\r\n\r\n\r\n";
$contentStripped = preg_replace('/(<br {0,}\/{0,1}>(\\r|\\n){0,}){2,}/', '<br class="reduced" />', $content);
echo "<code>{$contentStripped}</code>\r\n\r\n\r\n\r\n";
You may have to add more test cases.

You can use nl2br(strip_tags($content)) instead of above long code.

Related

Strip tags and replace all br and p tags with a single space

What is the regex to strip all html tags and where there are <br> and <p> tags replace with a single space and remove all line breaks?
e.g:
<h1>Heading</h1>
<br>
<br />
hyperlink
<p></p>
<p>paragraph1</p>
<p>paragraph2</p>
Should become:
Heading hyperlink paragraph1 paragraph2
I have tried the following:
$string = preg_replace( ["/<br\s*\/?>/i","/<\/p\s*>/i"]," ",$string);
$string = preg_replace(["/<\/?[^>]+>/", "/\r?\n|\r/"],"",$string);
Which gives me:
Heading hyperlink paragraph1 paragraph2
any ideas of a single line or more elegant solution which actually works?
This is what I would do:
$a = '<h1>Heading</h1>
<br>
<br />
hyperlink
<p></p>
<p>paragraph1</p>
<p>paragraph2</p>';
echo trim(preg_replace(['/<[^>]*>/','/\s+/'],' ', $a));
Output
Heading hyperlink paragraph1 paragraph2
Sandbox
The first regex removes tags replacing them with a space, the second one takes multiple spaces and changes it to one.
This works pretty good, but I can see a way that it could deviate from what was specifically requested.
What is the regex to strip all html tags and where there are <br> and <p> tags replace with a single space and remove all line breaks
So if you want the "full" solution, you can do this:
$a = '<h1>Heading</h1>
<br>
<br />
hyperlink
<p></p>
<p><big>p</big>aragraph1</p><p>paragraph2</p>';
echo preg_replace([
'/<(?:br|p)[^>]*>/i', //replace br p with ' '
'/<[^>]*>/', //replace any tag with ''
'/\s+/', //remove run on space
'/^\s+|\s+$/' //trim
],[
' ', '', ' ', ''
], $a);
Please note i added a <big> tag in and removed any space between the <p> tags. These were done to highlight a few things.
For example if you take the text from the second example and use it in the first you will get this (because the big tag):
Heading hyperlink p aragraph1 paragraph2
The updated example outputs correctly. But, and this is a big but, I changed the input text, so it may not be necessary to over-complicate it.
The <p> tag thing just shows that it puts space in between them before removing all the HTML tags with ''.
Sandbox
UPDATE
#ArtisticPhoenix how would I accomodate <p> </p>
First I would convert the string using html_entity_decode however there are a few sticky points with that. These have to do with encoding. So this is the correct way to do it:
$a = '<h1>Heading</h1>
<br>
<br />
hyperlink
<p> </p>
<p><big>p</big>aragraph1</p><p>paragraph2</p>';
//convert entities using UTF-8
$a = html_entity_decode($a, ENT_QUOTES, 'UTF-8');
echo preg_replace([
'/<(?:br|p)[^>]*>/i', //replace br p with ' '
'/<[^>]*>/', //replace any tag with ''
'/\s+/u', //remove run on space - replace using the unicode flag
'/^\s+|\s+$/u' //trim - replace using the unicode flag
],[
' ', '', ' ', ''
], $a);
Please note the addition of the u flag to the regex above /\s+/u and /^\s+|\s+$/u.
u (PCRE_UTF8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and subject strings are treated as UTF-8. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been regarded as valid UTF-8.
The problem comes from decoding it into a ASCII 160 (nbsp) instead of ASCII 32 character (single space). Anyway we can use UTF-8 to sort it out as shown above.
Sandbox
Treating HTML like a string and using regular expressions is never a good idea. The only decent solution that doesn't involve a DOM parser would be using PHP's built-in strip_tags function (which uses a state machine, so is still vulnerable to potential problems with broken HTML) and then you can compact the resulting whitespace with a regex:
<?php
$html = '<h1>Heading</h1>
<br>
<br />
hyperlink
<p></p>
<p>paragraph1</p>
<p>paragraph2</p>';
echo preg_replace("/\s+/", " ", strip_tags($html));
Output:
Heading hyperlink paragraph1 paragraph2
The way to do it is by using two patterns
P1 : <[\/\d\w]+.*?>
which will clean all the tags.
P2 : [\n\s]+ and replace it by single Whitespace
Exemple :
$string = preg_replace( "<[\/\d\w]+.*?>","",$string);
$string = preg_replace("[\n\s]+"," ",$string);
You can use this
<\s*\/?\s*br[^>]*>|<\s*\/?\s*p[^>]*>|\n
Explanation
<\s*\/?\s*br[^>]*> - Matches <br> or </br> or <br/> with any number of white space and matches attributes also.
<\s*\/?\s*p[^>]*> - Matches <p> or </p> or <p/> with any number of white space matches attributes also.
\n - Matches new line.
Demo
You can group multiple tags that are surrounded by white spaces and replace them with a single space. The regex to be replaced would be,
(\s*<[^>]+>\s*)+
This would give you a single space in place of all those tags and finally use trim() to get rid of right most and left space spaces that you may not need.
Demo
Here is the php code for demo,
$html = '<h1>Heading</h1>
<br>
<br />
hyperlink
<p></p>
<p>paragraph1</p>
<p>paragraph2</p>';
echo trim(preg_replace("/(\s*<[^>]+>\s*)+/", " ", $html));
Prints,
Heading hyperlink paragraph1 paragraph2
You could keep what you have an remove extra spaces
$stripped = preg_replace('/\s+/', ' ', $string);
That returns:
Heading hyperlink paragraph1 paragraph2

RegEx: First <br> within a paragraph

How to capture and remove the first occurrence of a <br/> tag within a paragraph.
<p><br/>Hello World</p>
Becomes:
<p>Hello World</p>
But importantly the following remains unchanged:
<p><br/></p>
Remove leading <br> tags from paragraphs that contain text
What I have so far:
preg_replace('/(<p>\s*<br *\/?>(.*?)<\/p>)+/si', '<p>$2</p>', $html);
Although this captures <p><br></p> instances...
Here is how you would do it using PHP's built in DOMDocument and DOMXPath classes:
$html = "<div><p><br/>Hello World</p><p><br/></p><p> <br> </p></div>";
$doc = new DOMDocument();
$doc->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($doc);
// find <br> within a <p> that has text content
$breaks = $xpath->query("//p[normalize-space()!='']/br");
$breaks = $xpath->query("//p[text()!='']/br");
// and remove them
foreach ($breaks as $br) {
$br->parentNode->removeChild($br);
}
echo $doc->saveHTML();
Note that there are two lines assigning values to $breaks. You should use the one which meets your requirements: the first will only strip <br> from elements which have non-whitespace characters between the <p> and </p>, while the second will also strip them from <p> elements containing only whitespace. The different effects can be seen in this demo.
It is not a recommended way to parse html with regex. But just for a quick and temporary work, you may use this regex to capture a linebreak <br/> that is preceded by <p> tag and some text and do a lookahead that it should not be immediately followed by closing </p> tag.
<p>.*?\K<br\/>(?!<\/p>)
and replace such captured <br/> with empty string hence removing it.
Explanation:
<p>.*? --> Match a paragraph tag followed by any characters in non-greedy way
\K --> Reset whatever matched as we don't intend to replace that
<br\/>(?!<\/p>) --> Match a line break tag that is not immediately followed by closing paragraph tag, which will be replaced with empty string.
Demo
Here are sample php codes,
$html = '<p><br/>Hello World</p>';
$html = preg_replace('/<p>.*?\K<br\/>(?!<\/p>)/si', '', $html);
echo $html. "\n";
$html = '<p><br/></p>';
$html = preg_replace('/<p>.*?\K<br\/>(?!<\/p>)/si', '', $html);
echo $html. "\n";
Which prints following output,
<p>Hello World</p>
<p><br/></p>
If there are more rules, we can pass array in preg_replace. In my solution, first element in pattern will look for <br /> with text. and second will look for just <br /> without text. Also this search is made from the beginng of the string (/^..).
preg_replace(['/^(<p>\s*(<br *\/?>)([a-zA-Z0-9 ]+)<\/p>)+/si', '/^(<p>\s*(<br *\/?>)<\/p>)+/si'], ['<p>$3</p>', '$0'], $html);

PHP: How to remove all occurrences of linebreaks at the end of a string

how can I remove all occurrences of linebreaks at the end of a string.
$string = "Hello<br/>my Text<br> <br/> <br /> <br /> <br >";
I would like to get this result: Hello<br/>my Text
This didn't help, as
$string = preg_replace('#(( ){0,}<br( {0,})(/{0,1})>){1,}$#i', '', $string);
didn't work.
Here is a similar post that didn't solve my problem though.
remove <br>'s from the end of a string
Thank you very much
Tom
This is how the regex was built:
#<br># // bare <br>
#<br */?># // <br> with internal spaces and maybe a slash
#(?: *)?<br */?># // maybe there are spaces in front
#(?: *)?<br */?>$# // at the end of the line
#(?:(?: *)?<br */?>)+$# // the whole thing one or more times at the end of the line
so:
echo preg_replace('#(?:(?: *)?<br */?>)+$#', '', 'Hello<br/>my Text<br> <br/> <br /> <br /> <br >');
// Output: Hello<br/>my Text
Debuggex is your [and my] friend.
REGEX is not really intended for DOM-parsing, but if you need/choose to use it, this should do it:
preg_replace('/<br *\/?>(?!\w)/', '', $string);
That removes any <br /> tag (with our without space or self-closing slash) that is not immediately followed by an alphanumeric character, hence the first one is preserved.
That, though, highlights the limitations of REGEX for problems such as this. You might wish to keep a tag that is not followed by an alphanumeric character. In that case, there's not a lot REGEX can offer. A DOM-parser, in contrast, could iteratively remove the trailing tags by virtue of each being the last-child of the container.
You will still be left with the spaces between the removed tags, but you can then just trim() the result or run a further REGEX on it:
preg_replace('/\s*$/', '', $string);
This did the job.
$string=preg_replace('/\s+/',' ', trim($string));
echo preg_replace('#(?:(?: *)?<br */?>)+$#', '', $string);

remove space from either side of br tag

Id like to know how you remove a space if it exists on either side of a <br /> tag
$str = 'remov e <br /><br />from<br /><br /> r<br />'
i tryed the following but i cant get it fully working.
preg_replace('/[\s]<br \/>/', '', $str );
preg_replace(":\s*<br \/>\s*:mis", "<br />", $str);
Keep in mind that \s matches any white-space, including tabs and newlines - not just spaces chr(20).

regex: change html before saving in database

Before saving into database i need to
delete all tags
delete all more then one white space characters
delete all more then one newlines
for it i do the following
$content = preg_replace('/<[^>]+>/', "", $content);
$content = preg_replace('/\n/', "NewLine", $content);it's for not to lose them when deleting more then one white space character
$content = preg_replace('/(\&nbsp\;){1,}/', " ", $content);
$content = preg_replace('/[\s]{2,}/', " ", $content);
and finnaly i must delete more then one "NewLine" words.
after first two points i get text in such format-
NewLineWordOfText
NewLine
NewLine
NewLine NewLine WordOfText "WordOfText WordOfText" WordOfText NewLine"WordOfText
...
how telede more then one newline from such content?
Thanks
First of all, while HTML is not regular and thus it is a bad idea to use regular expressions to parse it, PHP has a function that will remove tags for you: strip_tags
To squeeze spaces while preserving newlines:
$content = preg_replace('/[^\n\S]{2,}/', " ", $content);
$content = preg_replace('/\n{2,}/', "\n", $content);
The first line will squeeze all whitespace other than \n ([^\n\S] means all characters that aren't \n and not a non-whitespace character) into one space. The second will squeeze multiple newlines into a single newline.
why don't you use nl2br() and then preg_replace all <br /><br />s with just <br /> then all <br />s back to \n?

Categories