Whenever display text in an HTML document I always put it through htmlentities for a number of reasons. One of the reasons is that if the text contains HTML, I want the browser to display the HTML code, not render it.
The application I am writing requires that I still encode using htmlentities but hyper links need to be left alone.
Is there a way to do this efficiently using existing functions or do I need to implement this functionality?
You can roll your own format (or use bbcode, markdown or others).
You can parse HTML (using a proper library; not regex, please) and selectively keep all the <a> tags.
You can use regex to allow an HTML-like <a>-tag syntax, say in the form of
<a href="..."[ rel="..."]>...</a>
but keep in mind that it will not be HTML. (HTML allows rel to be specified before href, for starters.)
Also see this question; particularly the comments to my answer.
The usual way is to pass any "possibly harmful data" through htmlspecialchars() before showing it as part of a webpage. You can do that for user's comment, note, etc.
For any URL that users entered, you can show it on screen using htmlspecialchars(). The URL will be displayed on screen as it is. (any & will be escaped to & but when shown on screen, it will become & again. Maybe your concern is when it is linked, as in text, in which case you can escape the 4 characters: < > " ' because you don't want the & to be further escaped into &, or you can use filter_var() to sanitize the url: http://us3.php.net/manual/en/function.filter-var.php
Related
I am working on a comment system in Codeigniter and would appreciate some advice on what kind of validation rules that I should employ. I don't want to allow any images or other any HTML.
So far I just have trim and max_length set. I also run the content through htmlspecialchars before I insert in the database. I have XSS filtering enabled globally.
What other precautions should I take? Is htmlspecialchars enough for preventing Javascript or other malicious code from being entered?
You should probably do a regular form validation on required and max_length, and obviously xss filtering before pushing things to the database. The htmlspecialchars should only be applied to characters that aren't in tags, so you can't just do htmlspecialchars directly. You need to:
1 - strip the tag elements (and store them) like "<br/>" or "<b>", but not their content, that means nothing inside the "<b>" and "</b>". You can probably do this with a preg_match.
2 - execute htmlentities on all the remaining text
3 - remove all unwanted explicit tags (from the stored bunch of tags)
strip_tags ( string $str [, string $allowable_tags ] )
4 - then filter the allowed tags for attributes and content. It's not uncommon for hackers to use code like
<b onMouseOver="window.open(..)"></b>
To fix this, either you'll have to do a little bit of extra work and probably work with some regex-es. If you want me to write some more sample code let me know.
6 - re-add the tag elements back to the document.
I just basically cooked this up right now. The algorithm can be improved in efficiency (i.e. strip the unwanted tags first, and then proceed with filtering html entities and tag contents) but I'll leave that up to you.
This is as far as I can see the potential hacks right now. There might be other ways to hack your input though, so you might want to check what other comment box systems out there use for their validation, such as the phpbb forum system. Another option might be to use the phpbb square-bracket format to deal with tags so you don't let users input ANY html tags whatsoever, but instead use square-bracket tags that you control.
Does this answer your question ?
We have our own blog system and the post data is stores a raw html, so when it's called from the db we can just echo it and it's formatted completely, no need for BB codes in our situation. Our issue now is that our blog posts sometimes are too long and need to be trimmed.
The problem is that our data contains html, mostly <font>, <span>, <p>, <b>, and other styling tags. I made a php function that trims the characters, but it doesn't take into account the html tags. If the trim function trims the blog it should not trim tags because it messes the whole page. The function needs to be able to close the html tags if they're trimmed. Is there a function out there that can do this? or a function where I could start and build from it?
There's a good example here of truncating text while preserving HTML tags.
There is strip_tags which gets rid of all HTML tags but other than that there isn't much.
This is not an easy thing by the way, you have to actually parse the HTML to find out which tags are left open - that's the most robust approach anyway. Also, don't use a regular expression.
The right solution is to not store display information in your database layer.
Failing that, you could use CSS overflow properties: print the whole post, and then have the display layer handle sizing it to fit. This mitigates the problem of having formatting information in your database by putting the resizing (a display issue, not a content issue) into the display layer as well.
Failing that, you could parse the HTML and "round up" or "round down" to the nearest tag boundary, then insert the tag-close characters necessary to finish the block you were in.
Another option is to iframe the content.
I know this isn't the best way to do it programatically, but have you considered manually specifying where the cut should be? Adding something like and cutting it there manually would allow you to control where the cut happened, regardless of the number of characters before it. For example, you could always put that below the first paragraph.
Admittedly, you lose the ability to just have it happen automatically, but I bring it up in case that doesn't matter as much to you.
I need to use a wysiwyg editor for handling user input.
How do you process this in php?
If I retrieve the data and use htmlspecialchars then all the characters that were converted to special characters by the wysiwyg editor will be messed up.
For example quote will be "e;
When I use htmlspecialchars in php the & will be converted to &
It will be an obvious problem. Any ideas?
Have you considered keeping a plain-text and an additional HTML record of whatever is being modified? You can display the plaintext and when you save it you could convert it to html also and save that in a seperate field?
If special chars are being converted to HTML though, wouldn't they still appear properly (to the user) when you are printing text out to editable form fields in html?
Let me know if I've misunderstood
Most editors (CKEditor, CLEditor and NicEdit to mention a few) supports two modes of input: Visual and direct input (usually called HTML mode).
When the user is entering text in visual mode, the editor takes care of converting html-like characters to the respective HTML entity while the user is typing his/her content. In this mode, the editor will typically add markup for the user (mostly paragraphs).
Direct input works like you'd expect from the name; The user is exposed to the HTML his or her content is made up of.
How you should handle the input data depends mostly on the users role.
If the user is trusted (i.e. an administrator for a company website), the user should be able to use both input modes.
If the user is untrusted (an anonymous user posting a comment on a blog post), the user should not be able to input (potentially malicious, think XSS) markup.
If your users needs some options for formatting their content, you should probably look into using another type of markup, e.g BBCode. This prevents the user from injecting any <script> tags into the content that might be shown to other users.
You will still need to strip any HTML tags from the user content though.
I have an HTML table that displays information from a database, and one of the database fields contains a parameter list such as:
id=eff34-435-567rt-65u¬ification=5
But when I display this in the table the ¬ becomes ¬
I know that you can manually force it to print the right way by using
¬
But I would really rather be able to just use something to force the HTML to ignore the code so I can just pull the text straight from the database and print it to the table without having to do a regex to find out if there are any & and replace them with & I tried using the <pre> tag but that did not work.
Is there any way to force the HTML to print exactly what is typed for that specific td field?
Nothing practical (CDATA doesn't have browser support in text/html mode). Write proper HTML instead.
You should be running anything that comes out of the database through a conversion function to make it HTML safe anyway (to protect against XSS if nothing else). PHP has htmlspecialchars(), TT has | html. Whatever you are using should have something other then a regex.
& is the correct HTML encoding for the &. You will need to write the ¬ for it to display correctly.
If you're pulling from a database, you can use whatever programming language that is available to you to decode HTML entities for you.
For example, in PHP, you could use htmlentities or htmlspecialchars.
Try using htmlspecialchars().
most frameworks have HTML Encode functions.
in JavaScript: encode
in C# .NET: HttpServerUtility.HtmlEncode
Just run an HTMLEncode on the string before outputting it. Every server-side scripting language I know of has a built in command to do this. Not to mention that you are eventually going to run into another character that causes problems too.
ASP.NET: HttpServerUtility.HtmlEncode
PHP: htmlentities
Regex should definitely NOT be necessary.
After I implemented my sanitize functions (according to requested specifics), my boss decided to change the accepted input. Now he wants to keep some specific tag and its attributes. I suggested to implement a BBCode-like language which is safer imho but he doesn't want to because it would be to much work.
This time I would like to keep it simple so I will not kill him the next time he asks me to change again this thing. And I know he will.
Is it enough to use first the strip_tags with the tag parameter to preserve and then htmlentities?
strip_tags does not necessarily result in safe content. strip_tags followed by htmlentities would be safe, in that anything HTML-encoded is safe, but it doesn't make any sense.
Either the user is inputting plain text, in which case it should be output using htmlspecialchars (in preference to htmlentities), or they're inputting HTML markup, in which case you need to parse it properly, fixing broken markup and removing elements/attributes that aren't in a safe whitelist.
If that's what you want, use an existing library to do it (eg. htmlpurifier). Because it's not a trivial task and if you get it wrong you've given yourself XSS security holes.
You can keep specific tags using strip_tags with this syntax: strip_tags($text, '<p><a>');
That snippet would strip all tags except p and a. Attributes are kept for tags you have allowed (p and a in the above example).
However, this doesn't mean that the attributes are safe. Does he want specific attributes or does he want to keep all of them on allowed tags? For the first case, you would need to parse each tag and remove the ones desired, sanitizing the values. To keep all attributes on allowed tags, you still need to sanitize them. I would recommend running htmlentities on the attribute values to sanitize them (for display, I would assume).