removing \r \n and escape characters from html

removing \r \n and escape characters from html - php

I have the following html, which I have extracted from an email using imap_fetchbody,
<div dir=\"ltr\"><br><div class=\"gmail_quote\"><div dir=\"ltr\"><br><div class=\"gmail_quote\"><div class=\"\">
---------- Forwarded message ----------<br>
<span style=\"font-family:"Helvetica","sans-serif"\"><\/span>
From: <span style=\"font-family:"Helvetica","sans-serif"\">"
<span>xyz<\/span>" <<a href=\"mailto:support#xyz.com\" target=\"_blank\">support#<span>xyz<\/span>.com<\/a>><\/span><br>
\r\n\r\n\r\n\r\nDate: Fri, Apr 18, 2014 at 7:17 PM<br>
Subject: Bla bla xyz<br><\/div><div><div class=\"h5\">To: XYZ <<a href=\"mailto:xyz#gmail.com\" target=\"_blank\">xyz#gmail.com<\/a>><br><br><br>\r\n\r\n<div dir=\"ltr\">\r\n\r\n\r\n\r\n
<div class=\"gmail_quote\"><div><div><div dir=\"ltr\"><div class=\"gmail_quote\"><div dir=\"ltr\"><div><div class=\"gmail_quote\">
<div dir=\"ltr\"><div><div><div class=\"gmail_quote\"><div style=\"word-wrap:break-word\" lang=\"EN-US\">\r\n\r\n\r\n\r\n
<div>
<div>
<div>
<blockquote style=\"margin-top:5pt;margin-bottom:5pt\">
<div><div>
<table style=\"width:100%;background:none repeat scroll 0% 0% rgb(207,207,207)\" cellpadding=\"0\" cellspacing=\"0\" border=\"0\" width=\"100%\">
<tbody>
<tr>\r\n\r\n\r\n\r\n
<td style=\"width:325pt;padding:0in\" width=\"650\">\r\n\r\n<div align=\"center\"><table style=\"width:325pt;background:none repeat scroll 0% 0% rgb(207,207,207)\" cellpadding=\"0\" cellspacing=\"0\" border=\"0\" width=\"650\">\r\n\r\n\r\n\r\n
<tbody><tr>
<td style=\"padding:0in 0in 5.25pt\"><p style=\"text-align:center\" align=\"center\">
<span style=\"font-size:7.5pt;font-family:"Arial","sans-serif";color:rgb(64,64,64)\">If you are unable to see this message,
<a href=\"http:\/\/click.e.xyz.com\/?qs=3771d7c90c958f02a4b2e78494f12a3116ddb15df79b8d04cdf5aeba42012b118\" target=\"_blank\">
<span style=\"color:rgb(64,64,64)\">click here<\/span><\/a> to view.<br>
\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nTo ensure delivery to your inbox, please add <a href=\"mailto:support#xyz.com\" target=\"_blank\">support#xyz.com<\/a> to your address book. <\/span><\/p>
<\/td>
<\/tr>
<\/tbody>
<\/table>
<\/div><\/div><\/div><\/div>
I want to get rid of all the \,\r, \n and still keep < and > of the html as is.
I have tried stripslashes, stripcslashes, nl2br, htmlspecialchars_decode. But I am not able to achieve what I want.
Here is what I have tried along with imap_qprint function,
$text = stripslashes(imap_qprint($text));
$body = preg_replace('/(\v|\s)+/', ' ', $text );
Res: It doesn't remove all the white space characters.

Match the following regex:
(\\r|\\n|\\) with the g modifier
and replace with
'' (empty string)
Demo: http://regex101.com/r/mS3wM2

$html = preg_replace('/[\\\\\r\n]/', '', $html);
Match a single character present in the list below «[\\\r\n]»
A \ character «\\»
A carriage return character «\r»
A line feed character «\n»
UPDATE:
Based on your comment I've updated my answer:
$html = preg_replace('%\\\\/%sm', '/', $html);
$html = preg_replace('/\\\\"/sm', '"', $html);
$html = preg_replace('/[\r\n]/sm', '', $html);

You could use something like this to interpret the escape sequences:
function interpret_escapes($str) {
return preg_replace_callback('/\\\\(.)/u', function($matches) {
$map = ['n' => "\n", 'r' => "\r", 't' => "\t", 'v' => "\v", 'e' => "\e", 'f' => "\f"];
return isset($map[$matches[1]]) ? $map[$matches[1]] : $matches[1];
}, $str);
}

If string functions can do the trick, always favor stringfunctions above regex´s. Performace/speed will be better compared to regex's, and they's easier to read in the code:
$message = str_replace("\r\n", '', $message ); // replace all newlines, use double quotes!
$message = stripslashes( $message );
First you have to remove the newlines. As far as I can tell, the \r and \n always come together, so I replace them in 1 go. After that, the stripslashes will remove all escaping slashes.
You have to the the stripslashes after the newlines, else \r\n would result in rn, making them harder to find
This works perfect in my tests:
echo '<textarea style="width:100%; height: 33%;">'.$message.'</textarea>';
echo '<hr />';
$message = str_replace("\r\n", '', $message); // use double quotes!
echo '<textarea style="width:100%; height: 33%;">'.$message.'</textarea>';
echo '<hr />';
$message = stripslashes($message);
echo '<textarea style="width:100%; height: 33%;">'.$message.'</textarea>';

If you can open the file in vi, it would be as easy as:
%s/\\r\|\\n//g
on vi cmd mode

Related

preg_replace: UBB Code [textarea] - do not replace \n by <br> within tags [textarea][/textarea]

So I wanted to enable UBB Code on my Website using preg_replace
$text = $dbentry['text'];
$bbformat = array(
"/\\r?\\n/i" => "<br>",
"/\[i\](.*?)\[\/i\]/i" => "<i>$1</i>",
"/\[code\](.*?)\[\/code\]/i" => "<div align=\"center\"><center><table border=\"1\" cellpadding=\"0\" cellspacing=\"0\" width=\"600\"><tr bgcolor=\"magenta\"><td>Codeblock:</td></tr><tr><td><form><textarea rows=\"10\" cols=\"82\">$1</textarea></form></td></tr></table></center></div>",
....
);
foreach($bbformat as $match=>$replacement){
$text = preg_replace($match, $replacement, $text);
}
echo $text;
This works, however, it replaces text between two [code][/code] elements too, resulting in <br> or <i></i> codes in the textarea, where they do not belong:
I would like to skip any text that is between [code][/code] UBB elements and output the raw text as it is stored in the DB:
How can this be achieved in PHP?
I actually tried "/\[code\][^>]+\[\/code\](*SKIP)(*FAIL)|\\r?\\n/i" => "<br>", (which I read of here) with the result, that the latter argument "/\[code\](.*?)\[\/code\]/i" to replace [code][/code] itself with <textarea></textarea> is also skipped and the output on the website is [code]bl \n ah[i]test[/i][/code] instead of <textarea>bl \n ah[i]test[/i]</textarea>. I could workaround this by adding these two arguments to the array:
"/\[code\]/i" => "<div align=\"center\"><center><table border=\"1\" cellpadding=\"0\" cellspacing=\"0\" width=\"600\"><tr bgcolor=\"magenta\"><td>Codeblock:</td></tr><tr><td><form><textarea rows=\"10\" cols=\"82\">",
"/\[\/code\]/i" => "</textarea></form></td></tr></table></center></div>",
This replaces [code] and [/code] on each own with the unsexy result that the website looks like garbage if someone forgets to write [/code]. Looks like that the condition "/\[code\](.*?)\[\/code\]/i" is no longer valid within the same array due to (*SKIP)(*FAIL) why replacing each on its own works?
There must be a better solution...what am I missing?

You are dealing with one string each time, preg_replace does not know about the UBB, so it will replace all \n occurrences to <br />, the only way I can see is to use using PHP is something like preg_replace_callback_array.
$bbformat = array(
"/\\r?\\n/i" => function ($m) {
return "<br>";
},
"/\[i\](.*?)\[\/i\]/i" => function ($m) {
return "<i>".$m[1]."</i>";
},
"/\[code\](.*?)\[\/code\]/i" => function ($m) {
return "<div align=\"center\"><center><table border=\"1\" cellpadding=\"0\" cellspacing=\"0\" width=\"600\"><tr bgcolor=\"magenta\"><td>Codeblock:</td></tr><tr><td><form><textarea rows=\"10\" cols=\"82\">".str_replace("<br>", "\n", $m[1])."</textarea></form></td></tr></table></center></div>";
},
);
$text = preg_replace_callback_array($bbformat, $text);
echo $text;
but as you can see, you must re-replace the <br> to \n again which may consume a lot of memory in some cases.
however, you can play around with regex itself to exclude the \n => <br> replacement if the occurrence is not between the [code] block by changing your first pattern from /\\r?\\n/i to: /\\r?\\n+(?![^\[].*\])/i or even /\\r?\\n+(?![^\[code\]].*\[\/code\])/i if you want to keep your [code] block
so your final code may look like this:
$bbformat = array(
"/\\r?\\n+(?![^\[code\]].*\[\/code\])/i" => "<br>",
"/\[i\](.*?)\[\/i\]/i" => "<i>$1</i>",
"/\[code\](.*?)\[\/code\]/i" => "<div align=\"center\"><center><table border=\"1\" cellpadding=\"0\" cellspacing=\"0\" width=\"600\"><tr bgcolor=\"magenta\"><td>Codeblock:</td></tr><tr><td><form><textarea rows=\"10\" cols=\"82\">$1</textarea></form></td></tr></table></center></div>",
);
foreach($bbformat as $match=>$replacement){
$text = preg_replace($match, $replacement, $text);
}
echo $text;

Append html to all lines not ending with html tag

I'm trying to append <br/> to all lines that do not end with an html tag, but I'm unable to get it working.
I've got this so far, but it seems to match nothing at all (in PHP).
$message=preg_replace("/^(.*[^>])([\n\r])$/","\${1}<br/>\${2}",$message);
Any ideas on how to get this working properly?

I think you need the m modifier on your regex:
$message=preg_replace("/^(.*[^>])$/m", "$1<br/>\n", $message);
// ^
// Here
m makes ^ and $ match start/end of lines in addition to start/end of string.
No [\n\r] needed.
Also, why do you want to match all the line to just put it back after ?
It is actually as simple as
$message = preg_replace ('/([^>])$/m', '$1<br />', $message);
Example code:
<?php
$message = "<strong>Hey</strong>
you,
No you don't have to go !";
$output = preg_replace ('/([^>])$/m', '$1<br />', $message);
echo '<pre>' . htmlentities($output) . '</pre>';
?>

You can use this:
$message = preg_replace('~(?<![\h>])\h*\R~', '<br/>', $message);
where:
`\h` is for horizontal white spaces (space and tab)
`\R` is for newline
(?<!..) is a negative lookbehind (not preceded by ..)

I've found this that works somehow, see http://phpfiddle.org/main/code/259-vvp:
<?php
//
$message0 = "You are OK.
<p>You are good,</p>
You are the universe.
<strong>Go to school</strong>
This is the end.
";
//
if(preg_match("/^WIN/i", PHP_OS))
{
$message = preg_replace('#(?<!\w>)[\r]$#m', '<br />', $message0);
}
else
{
$message = preg_replace('#(?<!\w>)$#m', '<br />', $message0);
}
echo "<textarea style=\"width: 700px; height: 90px;\">";
echo($message);
echo "</textarea>";
//
?>
Gives:
You are OK.<br />
<p>You are good,</p>
You are the universe.<br />
<strong>Go to school</strong>
This is the end.<br /><br />
Add a < br /> if not ended up with a HTML tag like:
< /p>, < /strong>, ...
Explanation:
(?<!\w>): negative lookbehind, if a newline character is not preceded
by a partial html close tag, \w word character + closing >, like a>, 1> for h1>, ...
[\r\n]*$: end by any newline character or not.
m: modifier for multiline mode.

How to convert <div style="xyz"> to [div style="xyz"] with preg_replace php

I have an html string
$html = '<div style="background-color:#000;border:1px solid #000">
<b>Some Text</b></div><span>I have amount > 1000 USD</span>';
I want to convert it to this
$html = '[div style="background-color:#000;border:1px solid #000"]
[b]Some Text[/b][/div][span]I have amount > 1000 USD[/span]';
I have searched a lot on google to get some php script to convert html to bbcode but could not find. I don't know the regex. If you give me some idea, with example code, It will give me the startup.
If it could be done with some other php function, please suggest me that.

use this
$html = '<div style="background-color:#000;border:1px solid #000">
<b>Some Text</b></div><span>This is an other text</span>';
echo str_replace(array("<",">"),array("[","]"),$html);
http://codepad.org/kjKVCzjw
output
[div style="background-color:#000;border:1px solid #000"]
[b]Some Text[/b][/div][span]This is an other text[/span]

You could use str_replace:
$html = str_replace(array('<', '>'), array('[', ']'), $html);

All you need is replace < with [ and > with ]. Simply use str_replace().
$newString = str_replace( "<", "[", str_replace(">", "]", $StringInput) );

Using nl2br with html tags

I use nl2br when displaying some information that is saved somewhere, but when HTML tags are used I want not to add <br> tags for them.
For example if I use
<table>
<th></th>
</table>
it will be transformed to
<table><br />
<th></th><br />
</table><br />
and that makes a lot of spaces for this table.
Ho can break line tags be added only for other non-HTML content?
Thanks.

I'd the same issue,
I made this code, adding a <br /> at the end of each line except if the line finished with an html tag:
function nl2br_save_html($string)
{
if(! preg_match("#</.*>#", $string)) // avoid looping if no tags in the string.
return nl2br($string);
$string = str_replace(array("\r\n", "\r", "\n"), "\n", $string);
$lines=explode("\n", $string);
$output='';
foreach($lines as $line)
{
$line = rtrim($line);
if(! preg_match("#</?[^/<>]*>$#", $line)) // See if the line finished with has an html opening or closing tag
$line .= '<br />';
$output .= $line . "\n";
}
return $output;
}

You could replace the closing tags and newlines by only closing tags:
$str = str_replace('>
', '>', $str);

I think your question is wrong. If you are typing
<table>
<th></th>
</table>
into a text area then no matter what you do It will include <br /> in between them. Because it is what nl2br is supposed to do.

How to remove <br /> tags and more from a string?

I need to strip all <br /> and all 'quotes' (") and all 'ands' (&) and replace them with a space only ...
How can I do this? (in PHP)
I have tried this for the <br />:
$description = preg_replace('<br />', '', $description);
But it returned <> in place of every <br />...
Thanks

<?php
$text = '<p>Test paragraph.</p><!-- Comment --> Other text';
echo strip_tags($text);
echo "\n";
// Allow <p> and <a>
echo strip_tags($text, '<p><a>');
?>
http://php.net/manual/en/function.strip-tags.php

You can use str_replace like this:
str_replace("<br/>", " ", $orig );
preg_replace etc uses regular expressions and that may not be what you want.

If str_replace() isnt working for you, then something else must be wrong, because
$string = 'A string with <br/> & "double quotes".';
$string = str_replace(array('<br/>', '&', '"'), ' ', $string);
echo $string;
outputs
A string with double quotes .
Please provide an example of your input string and what you expect it to look like after filtering.

To manipulate HTML it is generally a good idea to use a DOM aware tool instead of plain text manipulation tools (think for example what will happen if you enounter variants like <br/>, <br /> with more than one space, or even <br> or <BR/>, which altough illegal are sometimes used). See for example here: http://sourceforge.net/projects/simplehtmldom/

To remove all permutations of br:
<br> <br /> <br/> <br >
check out the user contributed strip_only() function in
http://www.php.net/strip_tags
The "Use the DOM instead of replacing" caveat is always correct, but if the task is really limited to these three characters, this should be o.k.

Try this:
$description = preg_replace('/<br \/>/iU', '', $description);

$string = "Test<br>Test<br />Test<br/>";
$string = preg_replace( "/<br>|\n|<br( ?)\/>/", " ", $string );
echo $string;

This worked for me, to remove <br/> :
(> is recognised whereas > isn't)
$temp2 = str_replace('<','', $temp);
// echo ($temp2);
$temp2 = str_replace('/>','', $temp2);
// echo ($temp2);
$temp2 = str_replace('br','', $temp2);
echo ($temp2);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

removing \r \n and escape characters from html - php

Match the following regex: (\\r|\\n|\\) with the g modifier and replace with '' (empty string) Demo: http://regex101.com/r/mS3wM2

If you can open the file in vi, it would be as easy as: %s/\\r\|\\n//g on vi cmd mode

Related

preg_replace: UBB Code [textarea] - do not replace \n by <br> within tags [textarea][/textarea]

Append html to all lines not ending with html tag

How to convert <div style="xyz"> to [div style="xyz"] with preg_replace php

Using nl2br with html tags

How to remove <br /> tags and more from a string?

Categories

Resources