I have a website which runs on PHP and a MySQL database. I was wondering how to best treat user input in regard to HTML encoding (I am well aware that I should store as received and decode in output: that's what I do) and this cycle in particular:
user registers filling in a form with a username field, the content of the field is validated and sent and stored in the DB as is (no HTML encoding) as it will be required to output HTML, XML, JSON, plaintext and other formats;
on any page requiring the username to be shown, it will be fetched from the database, HTML-encoded and displayed in the page;
on a particular page the username is placed in the "value" field of an html text input: obviously this means that the username must be HTML encoded (otherwise XSS and all those fantastic things...). However this also means that if the original username was "però" the text field will be <input value="però"> and when the user submits it the server will receive però instead of però.
Now my question is: should the server decode all the received inputs so that però gets decoded to the original però?
My doubt is that this would mean that if an user inputs è as his username it will be registered as è and not as he actually intended...
I know this is not such a big problem (don't know of many users which would want to use HTML special characters encoding literals in their usernames...), but it puzzled me and I could not find a completely satisfying solution.
Unless I've misunderstood what you're asking, you seem to have the wrong impression about the effect of outputting HTML encoded strings into text inputs. Here's a basic example of what will happen. Let's say you have a user who wants to be named PB&J. Sure, it's weird, but not everyone can pick a nice non-weird username like "Bonvi" or "Don't Panic".
So you save that in your database as is.
Later, when you're using it in another form, you escape it for output.
<input type="text" name="username" value="<?= htmlspecialchars($username) ?>">
In your page source, you'll see
<input type="text" name="username" value="PB&J">
with the ampersand converted to an HTML entity. (Which is what you want, in case they really wanted to be named bob"><script>alert("però!")</script><p class="ha or something worse.)
But the value displayed in the text box will be PB&J, and when the user submits the form, the value in $_POST['username'] will be PB&J, not PB&J. It will not be changed to the encoded value.
(I used htmlspecialchars in this example, but the same would apply with your example using però with htmlentities.)
I'm trying to explain it basically, so I apologize if I did misunderstand you - I don't intend to sound condescending.
Related
I have text fetch from db and output inside of input type text & textarea for user to edit their text, my questiong is do I still need use htmlentities?
seems like the code will not run in input type text & textarea
ex.
$data="<h1>efijfie</h1>";
<textarea><?PHP echo $data;?></textarea>
The short answer is yes you need to sanitize data displayed on the web that originates from database data that was input by your users. Here is some more high level information: http://diovo.com/2008/09/sanitizing-user-data-how-and-where-to-do-it/. But a good rule of thumb is to never trust data at any point that is provided by users. Never Ever. Ever.
I have a form in which I want to enter elements that will later be assembled into HTML documents. What I am entering, and what I need to end up with, often includes things such as é, —, and related elements. The editor went and converted those for me, exactly what I'm trying to avoid! What I typed as examples were the HTML codes for a non-breaking space (ampersand-n-b-s-p-semicolon), the letter e with an acute accent (ampersand-e-a-c-u-t-e-semicolon), and an em-dash (ampersand-m-d-a-s-h-semicolon).
I need to have those strings preserved. I want them saved in the database, which they are once, but when I resubmit the page with 20 or so fields on it, because I've made a change to some other field, then I end up with the code being rendered. But I don't want it rendered, I want it preserved so that the browser will render my final document correctly. After submitting it a second or third time, I invariably end up with garbage where my entities had been.
I've tried mysql_real_escape_string(), htmlentities(), htmlspecialcharacters(), even html_entity_decode(htmlentities()) and nothing works. I end up with various levels of nonsense.
I do not need the system to take an em-dash or an accented character and turn it into the entity, although that wouldn't hurt. I just want it to preserve the codes that I've put in.
How do I do this? (And why is it so much work?)
Van
Here's the form field:
<textarea name="qih_quote" cols="75" rows="5" wrap="soft"><?php echo $s['qih_quote'];?></textarea>
Here's the line in the submit script that reads that:
$qih_quote = $_POST['qih_quote'];
I've wrapped the $_POST variable in just about everything I can think of as mentioned above. All I want is for the exact string that I put in that textarea to be saved in the table, to be displayed in the textarea when I come back to it, and to be saved to the table again without any modifications at any time.
Try to ensure you have the correct collation in the MySQL table you are saving the data in so that the special characters are preserved, such as utf8_general_ci, which should handle unicode.
Then try using htmlspecialchars() when saving the data into the database and htmlspecialchars_decode() when reading the data.
Okay, the issue was in the form textarea and I needed to encode HTML entities there. This is the final solution:
<textarea name="qih_quote" cols="75" rows="5" wrap="soft"><?php echo htmlspecialchars ($s['qih_quote'], ENT_QUOTES);?></textarea>
Van
I think anyone with this issue might want to have a look at html_entity_decode instead of htmlspecialchars. The former renders ALL html entities as strings, whereas the latter only works on a small subset, at least according to the documentation I read.
I am importing data into a database to a text field. However when I try to input
<strong> Hi There </strong>
I find it in the table (using php myadmin) as
"<strong> Hi There </strong>"
That displays it on my front webpage as
<strong> Hi There </strong>
Clearly not the desired result.
Any ideas here? I am using a regular text form.
When you are entering the data, it is probably being scrubbed - likely with htmlspecialchars() or htmlentities()
To decode the tags, use html_entity_decode()
http://php.net/manual/en/function.html-entity-decode.php
Yeah. What's happening here is simple encoding, so that the stored form is safe. Before displaying it on the webpage, pass it through the PHP builtin html_entity_decode().
Note that if this didn't happen, it would be very easy for someone to input their own HTML to a field that shouldn't have HTML (like username) and they could then modify your website.
When handling different user inputs that are held in the database or displayed back within your content you should always be aware of xss attacks. Better safe then sorry...
Usernames:
Have a check for minimum & maximum length, no out of the ASCII range & strictly no html or special chars like <>;'"% and trim spaces from the start & end. If outputting to a form always use htmlspecialchars().
Passwords:
Have a check for minimum & maximum length, make securer by having at lease 1 capitol and one alpha char. Always encrypt when saving to database & dont use md5. If outputting to a form always use htmlspecialchars() if not using the type="password" attribute.
Emails:
Check that it is a valid email address.
Main Comments,Posts Submission areas:
Strip all javascript, html and/or allow user to insert BBcode if needed for images, links, formatting then convert the BBcode to valid html when displaying.
I need to use a wysiwyg editor for handling user input.
How do you process this in php?
If I retrieve the data and use htmlspecialchars then all the characters that were converted to special characters by the wysiwyg editor will be messed up.
For example quote will be "e;
When I use htmlspecialchars in php the & will be converted to &
It will be an obvious problem. Any ideas?
Have you considered keeping a plain-text and an additional HTML record of whatever is being modified? You can display the plaintext and when you save it you could convert it to html also and save that in a seperate field?
If special chars are being converted to HTML though, wouldn't they still appear properly (to the user) when you are printing text out to editable form fields in html?
Let me know if I've misunderstood
Most editors (CKEditor, CLEditor and NicEdit to mention a few) supports two modes of input: Visual and direct input (usually called HTML mode).
When the user is entering text in visual mode, the editor takes care of converting html-like characters to the respective HTML entity while the user is typing his/her content. In this mode, the editor will typically add markup for the user (mostly paragraphs).
Direct input works like you'd expect from the name; The user is exposed to the HTML his or her content is made up of.
How you should handle the input data depends mostly on the users role.
If the user is trusted (i.e. an administrator for a company website), the user should be able to use both input modes.
If the user is untrusted (an anonymous user posting a comment on a blog post), the user should not be able to input (potentially malicious, think XSS) markup.
If your users needs some options for formatting their content, you should probably look into using another type of markup, e.g BBCode. This prevents the user from injecting any <script> tags into the content that might be shown to other users.
You will still need to strip any HTML tags from the user content though.
I would like to clarify what is the proper way to filter user input with php. For example I have a web form that a user enters information into. When submitted the data from the form will be entered into a database.
My understanding is you don't want to sanitize the data going into the database, except for escaping it such as mysql_escape_string, you want to sanitize it when displaying it on the front end with something like htmlentities or htmlspecialchars. However if you want you can validate/filter the user input when they submit the form to make sure the data is in the proper format such as if a field is for an email address you want to validate that it has the proper email format. Is that correct?
My next question is what do you do with the data when you re-display it in a web form? Lets say the user is allowed to edit the information in that form after they filled it out and the information was added to the database. They then go back in and see the data in the fields they originally entered, do you have to sanitize the data for it to show correctly in the form fields? For example there is a field called My Title, the person enters My title is "Manager". You see the quotations around manager, when you display it as is into the form field it breaks because of the quotations:
<input type="text" name="title" value="My title is "Manager"">
So don't you have to do something like htmlentities to turn the quotations into its html entities? Otherwise the value of the field would look like My title is
Hope this makes sense.
Nothing says you can't sanitize data before database insertion. After all, if your script/site/company has a certain policy regarding what's acceptable in a form field, it's best to strip out anything that's not allowed before saving it. That way you only sanitize once, before data insertion/update, rather than EVERY TIME you retrieve the data.
If you allow HTML entities for (say) accented characters, but not HTML tags, then you have to both check for invalid entities (&foobar;?) and HTML tags as well. Since you don't allow them, don't bother storing them. If you require a valid email address, then check if it's at RFC 5322 compliant and only store it once the user's entered proper data. (Whether that email address actually exists is another matter).
Now, let's get one thing straight. There's a difference between sanitization and escaping. Sanitization means literally to clean up - you're removing anything you don't want from the data. You can either silently drop it, or present an error to the user and tell them to fix it. On the other hand, escaping is just a means of encoding data so it's displayed properly.
With your My title is "Manager" string, you don't need to sanitize it, as there's nothing really wrong or offensive about it. What you do need to do is escape it, with at least htmlspecialchars(), so that the embedded double quotes don't "break" your form. If you embed it verbatim, most browsers will see it as having value="My title is" and some bogus attribute/garbage Manager"". So, you run it through htmlspecialchars and end up My title is "Manager", which embeds into the value="" perfectly with no trouble. No sanitization, just proper encoding.
Now, when that form is submitted, then you do have to sanitize/validate again, as the data's been in the hands of a potentially malicious user, and the data could have been changed to My title is <script>document.location='http://attacksite.com';</script>pwn me.
Basically, the workflow should be:
present form to user
get data submitted.
sanitize data
if form is not correctly filled out, displays errors and go to 1)
escape data for sql query
insert into database
then later
retrieve data from database
escape/encode as appropriate for however it will be displayed
display data. if data's going into a form, do 1-6 as before.