Stripping input to complete plain text - php

Currently finalising the coding for my comment system, and it want it to work a little how Stack Overflow works with their posts etc, I would like my users to be able to use BOLD, Italic and Underscore only, and to do that I would use following:
_ Text _ * BOLD * -Italic-
Now, firstly I would like to know a way of stripping a comment completely clean of any tags, html entities and such, so for example, if a user was to use any html / php tags, they would be removed from the input.
I am currently using Strip_tags, but that can leave the output looking quite nasty, even if an abusive or blatent XSS/Injection attempt has been made, I would still like the plain-text to be outputted in full, and not chopped up as strip_tags seems to make an absolute mess when it comes to that.
What I will then do, is replace the asterisks with bold html tags, and so on AFTER stripping the content clean of html tags.
How do people suggest I do this, currently this is the comment sanitize function
function cleanNonSQL( $str )
{
return strip_tags( stripslashes( trim( $str ) ) );
}

PHP tags are surrounded by <? and ?>, or maybe <% and %>on some ages-old installations, so removing PHP tags can be managed by a regex:
$cleaned=preg_replace('/\<\?.*?\?\>/', '', $dirty);
$cleaned=preg_replace('/\<\%.*?\%\>/', '', $cleaned);
Next you take care of the HTML tags: These are surrounded by < and >. Again you can do this with a regex
$cleaned=preg_replace('/\<.*?\>/','',$cleaned);
This will transform
$dirty="blah blah blah <?php echo $this; ?> foo foo foo <some> html <tag> and <another /> bar bar";
into
$cleaned="blah blah blah foo foo foo html and bar bar";

You could try using regular expressions to strip the tags, such as:
preg_replace("/\<(.+?)\>/", '', $str);
Not sure if that's what you're looking for, but it will remove anything inside < and >. You can also make it a little more foolproof by requiring the first character after the < to be a letter.

The correct way is not to delete html tags from your user's comment, but to tell the browser that the following text should not be interpreted as HTML, Javascript, whatever. Imagine someone wants to post example code like we do here on stackoverflow. If you just bluntly remove any parts of a comment that seem to be code, you will mess up the user's comment.
The solution is to use htmlentities which will escape symbols used for html markup in the comment so that it will actually show up as just text in the browser.
For example the browser will interpret a < as the beginning of a html tag. if you just want the browser to display a <, you have to write < in the source code. htmlentities will convert all the relevant symbols into their html entities for you.
Longer Example
echo htmlentities("<b>this text should not be bold</b><?php echo PHP_SELF;?>");
Outputs
<b>this text should not be bold</b><?php echo PHP_SELF;?>
The browser will output
<b>this text should not be bold</b><?php echo PHP_SELF;?>
Consider the following real life example with the solution, you accepted. Imagine a user writing this comment.
i'm in a bad mood today :<. but your blog made me really happy :>
You will now do your preg_replace("/\<(.+?)\>/", '', $comment); on the text and it will remove half the comment:
i'm in a bad mood today :
If that's what you wanted, never mind this answer. If you don't, use htmlentities.
If you want to save the comment as a file and not have the server interpret PHP code inside it, save it with an extension like '.html' or '.txt', so that the web server won't call the PHP interpreter in the first place. There is usually no need to escape PHP code.

Related

Properly rendering stored HTML

A part of my site allows users to create comments in a text box to be stored in an SQL database. Because a lot of people copy/paste things in from word or other places, I have to keep <p> and <br> tags to keep formatting, and also <a> tags to let users create their own links. Everything else gets stripped out. I was accomplishing this like so:
$text = strip_tags( $text, '<br><a><p>' );
But today a user came to me and told me they lost a large portion of their text because they made a arrow <- for visual effect. So now I know strip tags removes everything after a <.
I can accomplish a similar effect with preg_replace like so:
preg_replace('/((?!<((\/)?p|br|a))<[^>]*>)/', "", $text);
But this still has the downside of only working if the tag spans one line (I think), leaving in html comments and probably a few other things that I'm not aware of. What are my options? Is there a catch all solution? A library I can use? I most work alone so I'm not really aware of industry standards.
Use html purifier. It help clean the summited html and removes the unwanted codes for example if a user adds a scripts tag that might cause harm to your website (XSS Attack) html purifier before submitting. It also adds or completes html for example a user inputs < strong > gamer ... with out closing the tag, it will close the tag and output cleaner html.
I can accomplish a similar effect with preg_replace...But this still has the downside of only working if the tag spans one line (I think). Not really! You could use some modifiers to make PHP Regular Expressions span multiple lines. Consider the Example below with Multiline HTML String:
<?php
// $s IS A MULTILINE HTML SNIPPET CONTAINING THE FOLLOWING HTML TAGS
// <div>, <a>, <blockquote>, <em>, <strong>, <span>, <br />
$s = "<div class='one'>
<a href='/link.php'>
<blockquote>
There is real Power in the Hearts of men: not just Power but
\"something so much powerful than Power\" that Power itself begs to \"power down\".
</blockquote>
</a>
<p class='lv'>
This Power is not in the Head nor in the Intellect nor in the Skills of Man...
<em class='em1'>but in the deep recess of the Human Heart...</em>
and it speaks volumes yet only very few understand its language -
<strong>The Language of Love</strong>
- The Greatest Power You can have.... The Power to which nothing is Impossible!!!
</p>
<br />
<span>Do you know this Power? <--</span>
<strong>Do you Speak Love???</strong>
</div>";
// THIS CONCISE REGEX PATTERN REMOVES ALL HTML TAGS WITHIN THE MULTILINE STRING
// EXCEPT FOR TAGS LIKE: <a> <p> <br />
// IT WOULD ALSO LEAVE <- OR <-- OR <------ UNTOUCHED
$r = preg_replace("#<(?!\/[ap]|[ap\-]|br).*?>#si", "", $s);
echo ($r);
If you viewed the Source Code, You would observe that all HTML Tags except for <br>, <p>, <a> and Symbols like <-- were stripped out. In effect, the Source would look something like this:
<a href='/link.php'>
There is real Power in the Hearts of men: not just Power but
"something so much powerful than Power" that Power itself begs to "power down".
</a>
<p class='lv'>
This Power is not in the Head nor in the Intellect nor in the Skills of Man...
but in the deep recess of the Human Heart...
and it speaks volumes yet only very few understand its language -
The Language of Love
- The Greatest Power You can have.... The Power to which nothing is Impossible!!!
</p>
<br />
Do you know this Power? <--
Do you Speak Love???
Cheers and Good-Luck...
If your case is simple as how you showed us in your question, I won't go with external libraries like HTML Purifier.
strip_tags() function has its own way to determine tags. One way that it doesn't consider a < a real tag is when it's followed by an space. By space I mean any character between 0x09 to 0x0d as well as 0x20 (it is how isSpace() internal function works by its call from php_strip_tags_ex()).
So a workaround could be putting one of those allowed spaces between <- characters and then revert it after doing a strip_tags() but you'd better take care of not only a < character followed by - but any < character followed by a [^a-zA-Z!?\s] character (a character which is not an alphabet, ! and ? marks, \s any kind of white-space characters (spaces are fine!))
I'd like to choose my space character to be a carriage-return \r which is 0x0D in hex. That is more specific:
$text = preg_replace( "~<\r([^a-zA-Z!?\s])~", "<\1", strip_tags( preg_replace( '~<([^a-zA-Z!?\s])~', "<\r\1", $text ), '<p><a><br>' ) );
I can recommend you to encode the data that the user submits and then remove the tags you don't allow. This way you won't remove tags that appear normally on the page.
Please note that running complex regex expression on big string so not efficient.
Take the input from the user encode it so instead of <p> you will save <p> and then you can insert it to the page as html so it will render as html but without the actual tags, that way you don't need to remove anything.
You can use htmlspecialchars(string) here is an example

how to stop user putting HTML code in text inputs

iv been building a website and while testing I noticed that if I put <em>bob</em>
or something similar in my text fields on my register/udate pages they are stored on the database as entered '<em>bob</em>' but when called back on to the website they display in italics
is there a way to block html code from my text inputs?
or dose it only read as html when being echoed back on the page from the database?
mostly just curious to know what's happening here?
the name displaying in italics isn't a major issue but seems like something the user shouldn't be able to control?
p.s. i can provide code if needed but didn't think it would be much help in this question?
You can also just use htmlspecialchars() to output exactly what they typed on the page — as-is.
So if they enter <i>bob</i> then what will show up on the page is literally <i>bob</i> — that way you're "allowing" all the input in the world, but none of it is ever rendered.
If you want to just get rid of the tags, strip_tags() is the better option, so <i>bob</i> would show up as bob. This works if you're sure there's no legitimate scenario where someone would want to enter an HTML tag. (For example, Stack Overflow obviously can't just strip the tags out of stuff we type, since a lot of questions involve typing HTML tags.)
You can use strip_tags to remove all HTML tags from a string: http://php.net/manual/es/function.strip-tags.php
<?php
$text = '<p>Test paragraph.</p><!-- Comment --> Other text';
echo strip_tags($text); // Output: Test paragraph. Other text
echo "\n";
// Allows <p> and <a>
echo strip_tags($text, '<p><a>'); // Output: <p>Test paragraph.</p> Other text
?>
You can use builtin PHP function strip_tags. It will remove all HTML tags from a string and return the result.
Something like that:
$cleaned_string = strip_tags($_GET['field']);

How to Secure Data Submitted Through CKEditor

I am using CKEditor in my site to let the users post their comments. CKEditor has many buttons to compose the comment. Suppose If a User makes his comment bold and italic Such Like
This is comment
And CKEditor will ouput the following html
<i><strong>This is comment</strong></i>
Now, If I store this html in the mysql database and output on the webpage as it is, without wrapping it with htmlspecialchars(), then The Comment will be shown on the page bold and italic and this is what I want.
But on the other hand If I wrap the comment with htmlspecialchars() and displays it on the webpage it will be shown as
<i><strong>This is comment</strong></i>
But I do not want to show like this, I want the user formatting. But If I do not wrap it with htmlspecialchars(), it is risky and it can cause XSS Attack and other security risks.
How Can I Achieve both Purposes
(1). Keep the User Formatting
(2). Also Secure the HTML Contents
You need to draw up a whitelist of what elements and attributes you want to allow your users to include (eg allow <strong> but not <script>; allow <a href> but not <div onmouseover>), and then enforce it by parsing the input, removing all elements and attributes that don't fit your pattern, and serialising the results back into HTML.
This is a hard job that cannot be done with a few simple regexes or strip_tags (which is NOT an adequate solution for XSS even if it did fit your needs). You would be well advised to use an existing library to do it - HTML Purifier is one such for PHP.
i think you are looking for strip_tags. it will remove all the html and php tags from the string and only allow the given tags like <strong><i> etc
<?php
$str = "<i><strong>this is a comment<strong></i><script>here is script</script>";
echo $str = strip_tags($str,"<i><strong>");
?>
php.net documentation for strip_tags
strip_tags function has option to allow or disallow tags. use php.net for more reference about strip tags. You must strip unwanted or not allowed tags. if you don't then it might be vunerable by javascripts too.
Use htmlspecialchars while u are storing and use htmlspecialchars_decode while you are displaying. This will help you to keep format of user formated content
Two options spring to mind. First of all you can strip out all HTML and use a BB code parser to allow the user to post BB tags, rather than HTML - http://php.net/manual/en/book.bbcode.php
Secondly, you could strip out all HTML except a few tags. I don't know of any parser that does that personally, however I have seen it in action on sites before (Murphy's law I can't find any right now). You should be able to achieve this with a sophisticated enough RegEx replacement check.
Use this before printing it back on screen:
function html_escape($raw_input)
{
return htmlspecialchars($raw_input, ENT_QUOTES | ENT_HTML401, 'UTF-8');
}

preg_replace on text only and not inside href's

I have this code to do some ugly inline text style color formatting on html content.
But this breaks anything inside tags as links, and emails.
I've figured out halfways how to prevent it from formatting links, but it still doesn't prevent the replace when the text is info#mytext.com
$text = preg_replace('/(?<!\.)mytext(?!\/)/', '<span style="color:#DD1E32">my</span><span style="color:#002d6a">text</span>', $text);
What will be a better approach to only replace text, and prevent the replacing on links?
A better approach would be to use XML functions instead.
Your lookbehind assertion only tests one character, so it's insufficient to assert matches outside of html tags. This is something where regular expression aren't the best option. You can however get an approximation like:
preg_replace("/(>[^<]*)(?<![#.])(mytext)/", "$1<span>$2</span>",
This would overlook the first occourence of mytext if it's not preceeded by a html tag. So works best if $text = "<div>$text</div>" or something.
[edited] ohh I see you solved the href problem.
to solve your email problem, change all #mytext. to [email_safeguard] with str_replace, before working on the text, and when your finished, change it back. :)
$text = str_replace('info#mytext.com','[email_safeguard]',$text);
//work on the text with preg_match()
$text = str_replace('[email_safeguard]','info#mytext.com',$text);
that should do the trick :)
but as people have mentioned before, you better avoid html and regex, or you will suffer the wrath of Cthulhu.
see this instead

php - preg_match string not within the href attribute

i find regex kinda confusing so i got stuck with this problem:
i need to insert <b> tags on certain keywords in a given text. problem is that if the keyword is within the href attribute, it would result to a broken link.
the code goes like this:
$text = preg_replace('/(\b'.$keyword.'\b)/i','<b>\1</b>',$text);
so for cases like
this keyword here
i end up with:
this <b>keyword</b> here
i tried all sorts of combinations but i still couldn't get the right pattern.
thanks!
You can't only use Regex to do that. They are powerful, but they can't parse recursive grammar like HTML.
Instead you should properly parse the HTML using a existing HTML parser. you just have to echo the HTML unless you encouter some text entity. In that case, you run your preg_repace on the text before echoing it.
If your HTML is valid XHTML, you can use the xml_parse function. if it's not, then use whatever HTML parser is available.
You can use preg_replace again after the first replacement to remove b tags from href:
$text=preg_replace('#(href="[^"]*)<b>([^"]*)</b>#i',"$1$2",$text);
Yes, you can use regex like that, but the code might become a little convulted.
Here is a quick example
$string = 'link text with keyword and stuff';
$keyword = 'keyword';
$text = preg_replace(
'/(<a href=")('.$keyword.')(.php">)(.*)(<\/a>)/',
"$1$2$3<b>$4</b>$5",
$string
);
echo $string."\n";
echo $text."\n";
The content inside () are stored in variables $1,$2 ... $n, so I don't have to type stuff over again. The match can also be made more generic to match different kinds of url syntax if needed.
Seeing this solution you might want to rethink the way you plan to do matching of keywords in your code. :)
output:
link text with keyword and stuff
<b>link text with keyword and stuff</b>

Categories