handling WYSIWYG data in PHP

handling WYSIWYG data in PHP - php

I need to use a wysiwyg editor for handling user input.
How do you process this in php?
If I retrieve the data and use htmlspecialchars then all the characters that were converted to special characters by the wysiwyg editor will be messed up.
For example quote will be &quote;
When I use htmlspecialchars in php the & will be converted to &
It will be an obvious problem. Any ideas?

Have you considered keeping a plain-text and an additional HTML record of whatever is being modified? You can display the plaintext and when you save it you could convert it to html also and save that in a seperate field?
If special chars are being converted to HTML though, wouldn't they still appear properly (to the user) when you are printing text out to editable form fields in html?
Let me know if I've misunderstood

Most editors (CKEditor, CLEditor and NicEdit to mention a few) supports two modes of input: Visual and direct input (usually called HTML mode).
When the user is entering text in visual mode, the editor takes care of converting html-like characters to the respective HTML entity while the user is typing his/her content. In this mode, the editor will typically add markup for the user (mostly paragraphs).
Direct input works like you'd expect from the name; The user is exposed to the HTML his or her content is made up of.
How you should handle the input data depends mostly on the users role.
If the user is trusted (i.e. an administrator for a company website), the user should be able to use both input modes.
If the user is untrusted (an anonymous user posting a comment on a blog post), the user should not be able to input (potentially malicious, think XSS) markup.
If your users needs some options for formatting their content, you should probably look into using another type of markup, e.g BBCode. This prevents the user from injecting any <script> tags into the content that might be shown to other users.
You will still need to strip any HTML tags from the user content though.

Related

Best practice when sanitizing HTML form user input in PHP / CodeIgniter 4

I have a simple app programmed in PHP using CodeIgniter 4 framework and, as a web application, it has some HTML forms for user input.
I am doing two things:
In my Views, all variables from the database that come from user input are sanitized using CodeIgniter 4's esc() function.
In my Controllers, when reading HTTP POST data, I am using PHP filters:
$data = trim($this->request->getPost('field', FILTER_SANITIZE_SPECIAL_CHARS));
I am not sure if sanitizing both when reading data from POST and when printing/displaying to HTML is a good practice or if it should only be sanitized once.
In addition, FILTER_SANITIZE_SPECIAL_CHARS is not working as I need. I want my HTML form text input to prevent users from attacking with HTML but I want to keep some 'line breaks' my database has from the previous application.
FILTER_SANITIZE_SPECIAL_CHARS will NOT delete HTML tags, it will just store them in the database, not as HTML, but it is also changing my 'line breaks'. Is there a filter that doesn't remove HTML tags (only stores them with proper condification) but that respects \n 'line breaks'?

You don't need to sanitize User input data as explained in the question below:
How can I sanitize user input with PHP?
It's a common misconception that user input can be filtered. PHP even
has a (now deprecated) "feature", called
magic-quotes,
that builds on this idea. It's nonsense. Forget about filtering (or
cleaning, or whatever people call it).
In addition, you don't need to use FILTER_SANITIZE_SPECIAL_CHARS, htmlspecialchars(...), htmlentities(...), or esc(...) either for most use cases:
-Comment from OP (user1314836)
I definitely think that I don't need to sanitize user-input data
because I am not writing SQL directly but rather using CodeIgniter 4's
functions to create SQL safe queries. On the other hand, I do
definitely need to esc() that same information when showing to avoid
showing html where just text is expected.
The reason why you don't need the esc() method for most use cases is:
Most User form input in an application doesn't expect a User to submit/post HTML, CSS, or JavaScript that you plan on displaying/running later on.
If the expected User input is just plain text (username, age, birth date, etc), images, or files, use form validation instead to disallow unexpected data.
I.e: Available Rules and Creating Custom Rules
By using the Query Builder for your database queries and rejecting unexpected User input data using validation rules (alpha, alpha_numeric_punct, numeric, exact_length, min_length[8], valid_date, regex_match[/regex/], uploaded, etc), you can avoid most potential security holes i.e: SQL injections and XSS attacks.

Answer from steven7mwesigwa gets my vote, but here is how you should be thinking about it.
Rules Summary
You should always hold in memory the actual data that you want to process.
You should always convert the data on output into a format that the output can process.
Inputs:
You should strip from all untrusted inputs (user forms, databases that you didn't write to, XML feeds that you don't control etc)
any data that you are unable to process (e.g. if you are not able to handle multi-byte strings as you are not using the right functions, or your DB won't support it, or you can't handle UTF8/16 etc, strip those extra characters you can't handle).
any data that will never form part of the process or output (e.g. if you can only have an integer/bool than convert to int/bool; if you are only showing data on an HTML page, then you may as well trim spaces; if you want a date, strip anything that can't be formatted as a date [or reject*]).
This means that many "traditional" cleaning functions are not needed (e.g. Magic Quotes, strip_tags and so on): but you need to know you can handle the code. You should only strip_tags or escape or so on if you know it is pointless having that data in that field.
Note: For user input I prefer to hold the data as the user entered and reject the form allowing them to try again. e.g. If I'm expected a number and I get "hello" then I'll reload the form with "hello" and tell the user to try again. steven7mwesigwa has links to the validation functions in CI that make that happen.
Outputs:
Choose the correct conversion for the output: and don't get them muddled up.
htmlspecialchars (or family) for outputting to HTML or XML; although this is usually handled by any templating engine you use.
Escaping for DB input; although this should be left to the DB engine you use (e.g. parameterised queries, query builder etc).
urlencode for outputting a URL
as required for saving images, json, API responses etc
Why?
If you do out output conversion on input, then you can easily double-convert an input, or lose track of if you need to make it safe before output, or lose data the user wanted to enter. Mistakes happen but following clean rules will prevent it.
This also mean there is no need to reject special characters (those forms that reject quote marks are horrible user experience, for example, and anyone putting restrictions on what characters can go in a password field are only weakening security)
In your particular case:
Drop the FILTER_SANITIZE_SPECIAL_CHARS on input, hold the data as the user gave it to you
Output using template engine as you have it: this will display < > tags as the user entered then, but won't break your output.
You will essentially sanitize each and every output (that you appear to want to avoid), but that's safer than accidentally missing a sanitize on output and a better user experience than losing stuff they typed.

From my understanding,
FILTER_SANITIZE_SPECIAL_CHARS is used to sanitize the user input before you act on it or store it.
Whereas esc is used to escape HTML etc in the string so they don't interfere with normal html, css etc. It is used for viewing the data.
So, you need both, one for input and the other for output.
Following from codeigniter.com. Note, it uses the Laminas Escaper library.
esc($data[, $context = 'html'[, $encoding]])
Parameters
$data (string|array) – The information to be escaped.
$context (string) – The escaping context. Default is ‘html’.
$encoding (string) – The character encoding of the string.
Returns
The escaped data.
Return type
mixed
Escapes data for inclusion in web pages, to help prevent XSS attacks. This uses the Laminas Escaper library to handle the actual filtering of the data.
If $data is a string, then it simply escapes and returns it. If $data is an array, then it loops over it, escaping each ‘value’ of the key/value pairs.
Valid context values: html, js, css, url, attr, raw
From docs.laminas.dev
What laminas-Escaper is not
laminas-escaper is meant to be used only for escaping data for output, and as such should not be misused for filtering input data. For such tasks, use laminas-filter, HTMLPurifier or PHP's Filter functionality should be used.
Some of the functions they do are similar. Such as both may/will convert < to &lt. However, your stored data may not have come just from user input and it may have < in it. It is perfectly safe to store it this way
but it needs to be escaped for output otherwise the browser could get confused, thinking its html.

I think for this situation using esc is sufficient. FILTER_SANITIZE_SPECIAL_CHARS is a PHP sanitize filter that encode '"<>& and optionally strip or encode other special characters according to the flag. To do that you need to set the flag. It is third parameter in getPost() method. Here is an example
$this->request->getPost('field', FILTER_SANITIZE_SPECIAL_CHARS, FILTER_FLAG_ENCODE_HIGH)
This flag can be change according to your requirements. You can use any PHP filter with a flag. Please refer php documentation for more info.

Using HTML Purifier on a site with only plain text input

I would appreciate an answer to settle a disagreement between me and some co-workers.
We have a typical PHP / LAMP web application.
The only input we want from users is plain text. We do not invite or want users to enter HTML at any point. Form elements are mostly basic input text tags. There might be a few textareas, checkboxes etc.
There is currently no sanitizing of output to pages. All dynamic content, some of which came from user input, is simply echoed to the page. We obviously need to make it safe.
My solution is to use htmlspecialchars on all output at the time it is echoed on the page.
My co-workers' solution is to add HTML Purifier to the database layer. They want to pass all user entered input through HTML Purifier before it is saved to the database. Apparently they've used it like this on other projects but I think that is a misunderstanding of what HTML Purifier is for.
My understanding is that it only makes sense to use HTML Purifier on a site which allows the user to enter HTML. It takes HTML and makes it safer and cleaner based on a whitelist and other rules.
Who's right and who's wrong?
There's also the whole "escape on input or output" issue but I guess that's a debate for another time and place.
Thanks

As a general rule, escaping should be done for context and for use-case.
If what you want to do is output plain text in an HTML context (and you do), then you need to use escaping functionality that will ensure that you will always output plain text in an HTML context. Given basic PHP, that would indeed be htmlspecialchars($yourString, ENT_QUOTES, 'yourEncoding');.
If what you want to do is output HTML in an HTML context (you don't), then you would want to santitise the HTML when you output it to prevent it from doing damage - here you would $purifier->purify($yourString); on output.
If you want to store plain text user input in a database (again, you do) by executing SQL statements, then you should either use prepared statements to prevent SQL injection, or an escaping function specific to your DB, such as mysql_real_escape_string($yourString).
You should not:
escape for HTML when you are putting data into the database
sanitise as HTML when you are putting data into the database
sanitise as HTML when you are outputting data as plain text
Of those, all are outright harmful, albeit to different degrees. Note that the following assumes the database is your only or canonical storage medium for the data (it also assumes you have SQL injection taken care of in some other way - if you don't, that'll be your primary issue):
if you escape for HTML when you put the data into the database, you rely on the guarantee that you will always be outputting the data into an HTML context; suddenly if you want to just put it into a plaintext file for printing as-is, you need to decode the data before you output it.
if you sanitise as HTML when you put the data into the database, you are destroying information that your user put there. Is it a messaging system and your user wanted to tell someone else about <script> tags? Your user can't do that - you'll destroy that part of his message!
Sanitising as HTML when you're outputting data as plain text (without also escaping it) may have confusing, page-breaking results if you don't set your sanitising module to strip all HTML (which you shouldn't, since then you clearly don't want to be outputting HTML).
Did you sanitise for a <div> context, but are putting your data into an inline element? Your user might put a <div> into your inline element, forcing a layout break into your page layout (how annoying this is depends on your layout), or to influence user perception of metadata (for example to make phishing easier), e.g. like this:
Name: John Doe(Site admin)
Did you sanitise for a <span> context? The user could use other tags to influence user perception of metadata, e.g. like this:
Name: John Doe (this user is an administrator)
Worst-case scenario: Did you sanitise your HTML with a version of HTML Purifier that later turns out to have a bug that does allow a certain kind of malicious HTML to survive? Now you're outputting untrusted data and putting users that view this data on your web page at risk.
Sanitising as HTML and escaping for HTML (in that order!) does not have this problem, but it means the sanitising step is unnecessary, meaning this constellation will just cost you performance. (Presumably that's why your colleague wanted to do the sanitising when saving the data, not when displaying it - presumably your use-case (like most) will display the data more often than the data will be submitted, meaning you would avoid having to deal with the performance hit frequently.)
tl;dr
Sanitising as HTML when you're outputting as plain text is not a good idea.
Escape / sanitise for use-case and context.
In your situation, you want to escape plain text for an HTML context (= use htmlspecialchars()).

Displaying user entered text for re-editing won't pass w3c validataion

I am trying to wrap my head around how to have an enter/save/re-display/re-edit/save cycle and pass w3c validation for html5.
The data can have htmlspecialchars in it (e.g. links or other html code)
The data is purified with HTMLPurify() before being saved to the database.
The problem:
If I display the data for re-editing as is, w3c will generate errors
for '&' and other special characters entered in the text.
If I use htmlspecialchars() before displaying, I can't use
htmlspecialchars_decode() after editing without potentially messing
up what was entered. For example, if I enter & manually in the
text while re-editing, I don't want it decoded back to &.
Is there any way around this? This is not an issue of security, but how to follow the rules and retain functionality.

How do you apply htmlentities selectively?

Whenever display text in an HTML document I always put it through htmlentities for a number of reasons. One of the reasons is that if the text contains HTML, I want the browser to display the HTML code, not render it.
The application I am writing requires that I still encode using htmlentities but hyper links need to be left alone.
Is there a way to do this efficiently using existing functions or do I need to implement this functionality?

You can roll your own format (or use bbcode, markdown or others).
You can parse HTML (using a proper library; not regex, please) and selectively keep all the <a> tags.
You can use regex to allow an HTML-like <a>-tag syntax, say in the form of
<a href="..."[ rel="..."]>...</a>
but keep in mind that it will not be HTML. (HTML allows rel to be specified before href, for starters.)
Also see this question; particularly the comments to my answer.

The usual way is to pass any "possibly harmful data" through htmlspecialchars() before showing it as part of a webpage. You can do that for user's comment, note, etc.
For any URL that users entered, you can show it on screen using htmlspecialchars(). The URL will be displayed on screen as it is. (any & will be escaped to & but when shown on screen, it will become & again. Maybe your concern is when it is linked, as in text, in which case you can escape the 4 characters: < > " ' because you don't want the & to be further escaped into &amp;, or you can use filter_var() to sanitize the url: http://us3.php.net/manual/en/function.filter-var.php

working with user input data in php. What's better?

I am trying to figure out what is the best way to manage the data a user inputs concerning non desirable tags he might insert:
strip_tags() - the tags are removed and they are not inserted in the database
the tags are inserted in the database, but when reading that field and displaying it to the user we would use htmlspecialchars()
What's the better, and is there any disadvantage in any of these?
Regards

This depends on what your priority is:
if it's important to display special characters from user input (like on StackOverflow, for example), then you'll need to store this information in the database and sanitize it on display - in this case, you'll want to at least use htmlspecialchars() to display the output (if not something more sophisticated)
if you just want plain text comments, use strip_tags() before you stick it in the database - this way you'll reduce the amount of data that you need to store, and reduce processing time when displaying the data on the screen

the tags are inserted in the database, but when reading that field and displaying it to the user we would use htmlspecialchars()
This. You usually want people to be able to type less-than signs and ampersands and have them displayed as such on the page. htmlspecialchars on every text-to-HTML output step (whether that text came directly from user input, or from the database, or from somewhere else entirely) is the right way to achieve this. Messing about with the input is a not-at-all-appropriate tactic for dealing with an output-encoding issue.
Of course, you will need a different escape — or parameterisation — for putting text in an SQL string.

The measures taken to secure user input depends entirely on in what context the data is being used. For instance:
If you're inserting it into a SQL database, you should use parameterized statements. PHP's mysql_real_escape_string() works decently, as well.
If you're going to display it on an HTML page, then you need to strip or escape HTML tags.
In general, any time you're mixing user input with another form of mark-up or another language, that language's elements need to be escaped or stripped from the input before put into that context.
The last point above segues into the next point: Many feel that the original input should always be maintained. This makes a lot of sense when, later, you decide to use the data in a different way and, for instance, HTML tags aren't a big deal in the new context. Also, if your site is in some way compromised, you have a record of the exact input given.
Specifically related to HTML tags in user input intended for display on an HTML page: If there is any conceivable reason for a user to input HTML tags, then simply escape them. If not, strip them before display.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.