Preserving HTML entities in PHP/MySQL - php

I have a form in which I want to enter elements that will later be assembled into HTML documents. What I am entering, and what I need to end up with, often includes things such as é, —, and related elements. The editor went and converted those for me, exactly what I'm trying to avoid! What I typed as examples were the HTML codes for a non-breaking space (ampersand-n-b-s-p-semicolon), the letter e with an acute accent (ampersand-e-a-c-u-t-e-semicolon), and an em-dash (ampersand-m-d-a-s-h-semicolon).
I need to have those strings preserved. I want them saved in the database, which they are once, but when I resubmit the page with 20 or so fields on it, because I've made a change to some other field, then I end up with the code being rendered. But I don't want it rendered, I want it preserved so that the browser will render my final document correctly. After submitting it a second or third time, I invariably end up with garbage where my entities had been.
I've tried mysql_real_escape_string(), htmlentities(), htmlspecialcharacters(), even html_entity_decode(htmlentities()) and nothing works. I end up with various levels of nonsense.
I do not need the system to take an em-dash or an accented character and turn it into the entity, although that wouldn't hurt. I just want it to preserve the codes that I've put in.
How do I do this? (And why is it so much work?)
Van
Here's the form field:
<textarea name="qih_quote" cols="75" rows="5" wrap="soft"><?php echo $s['qih_quote'];?></textarea>
Here's the line in the submit script that reads that:
$qih_quote = $_POST['qih_quote'];
I've wrapped the $_POST variable in just about everything I can think of as mentioned above. All I want is for the exact string that I put in that textarea to be saved in the table, to be displayed in the textarea when I come back to it, and to be saved to the table again without any modifications at any time.

Try to ensure you have the correct collation in the MySQL table you are saving the data in so that the special characters are preserved, such as utf8_general_ci, which should handle unicode.
Then try using htmlspecialchars() when saving the data into the database and htmlspecialchars_decode() when reading the data.

Okay, the issue was in the form textarea and I needed to encode HTML entities there. This is the final solution:
<textarea name="qih_quote" cols="75" rows="5" wrap="soft"><?php echo htmlspecialchars ($s['qih_quote'], ENT_QUOTES);?></textarea>
Van

I think anyone with this issue might want to have a look at html_entity_decode instead of htmlspecialchars. The former renders ALL html entities as strings, whereas the latter only works on a small subset, at least according to the documentation I read.

Related

Is it safe to unescape ampersand for user input?

After a few hours of bug searching, I found out the cause of one of my most annoying bugs.
When users are typing out a message on my site, they can title it with plaintext and html entities.
This means in some instances, users will type a title with common html entity pictures like this face. ( ͡° ͜ʖ ͡°).
To prevent html injection, I use htmlspecialchars(); on the title, and annoyingly it would convert the picture its html entity format when outputted onto the page later on.
( ͡° ͜ʖ ͡°)
I realized the problem here was that the title was being encoded as the example above, and htmlspecialchar, as well as doing what I wanted and encoding possible html injection, was turning the ampersand in the entities to
&.
By un-escaping all the ampersands, and changing them back to & this fixed my problem and the face would come out as expected.
However I am unsure if this is still safe from malicious html. Is it safe to decode the ampersands in user imputed titles? If not, how can I go about fixing this issue?
If your entities are displayed as text, then you're probably calling htmlspecialchars() twice.
If you are not calling htmlspecialchars() twice explicitly, then it's probably a browser-side auto-escaping that may occur if the page containing the form is using an obsolete single-byte encoding like Windows-1252. Such automatic escaping is the only way to correctly represent characters not present in character set of the specific single-byte encoding. All current browsers (including Firefox, Opera, and IE) do this.
Make sure you are using Unicode (UTF-8 in particular) encoding.
To use Unicode as encoding, add the <meta charset="utf-8" /> element to the HEAD section of the HTML page that contains the form. And don't forget to save the HTML page itself in UTF-8 encoding. To use Unicode in PHP, it's typically enough to use multibyte (mb_ prefixed) string functions. Finally, database engines like MySQL do support UTF-8 long ago.
As a temporary workaround, you can disable reencoding existing entities by setting 4th parameter ($double_encode) of the htmlspecialchars() function to false.
There is no straight answer. You may unesacape <script...> into <script...> and end in trouble, however it looks like the code has been double encoded - probably once on input and then again when you output to screen. If you can guarantee it has been double encoded, then it should be safe to undo one of those.
However, the best solution is to keep the "raw" value in memory, and sanitize/encode for outputting into databases, html, JSON etc.
So - when you get input, sanitise it for anything you don't want, but don't actually convert it into HTML or escape it or anything else at this stage. Escape it into a database, html encode it when output to screen / xml etc.

real_escape_string not cleaning up entered text

I thought the proper way to "sanitize" incoming data from an HTML form before entering it into a mySQL database was to use real_escape_string on it in the PHP script, like this:
$newsStoryHeadline = $_POST['newsStoryHeadline'];
$newsStoryHeadline = $mysqli->real_escape_string($newsStoryHeadline);
$storyDate = $_POST['storyDate'];
$storyDate = $mysqli->real_escape_string($storyDate);
$storySource = $_POST['storySource'];
$storySource = $mysqli->real_escape_string($storySource);
// etc.
And once that's done you could just insert the data to the DB like this:
$mysqli->query("INSERT INTO NewsStoriesTable (Headline, Date, DateAdded, Source, StoryCopy) VALUES ('".$newsStoryHeadline."', '".$storyDate."', '".$dateAdded."', '".$storySource."', '".$storyText."')");
So I thought doing this would take care of cleaning up all the invisible "junk" characters that may be coming in with your submitted text.
However, I just pasted some text I copied from a web-page into my HTML form, clicked "submit" - which ran the above script and inserted that text into my DB - but when I read that text back from the DB, I discovered that this piece of text did still have junk characters in it, such as –.
And those junk characters of course caused the PHP script I wrote that retrieves the information from the DB to crash.
So what am I doing wrong?
Is using real_escape_string not the way to go here? Or should I be using it in conjunction with something else?
OR, is there something I should be doing (like more escaping) when reading reading data back out from the the mySQL database?
(I should mention that I'm an Objective-C developer, not a PHP/mySQL developer, but I've unfortunately been given this task to do some DB stuff - hence my question...)
thanks!
Your assumption is wrong. mysqli_real_escape_string’s only intention is to escape certain characters so that the resulting string can be safely used in a MySQL string literal. That’s it, nothing more, nothing less.
The result should be that exactly the passed data is retained, including ‘junk’. If you don’t want that ‘junk’ in your database, you need to detect, validate, or filter it before passing to to MySQL.
In your case, the ‘junk’ seems to be due to different character encodings: You input data seems to be encoded with UTF-8 while it’s later displayed using Windows-1250. In this scenario, the character – (U+2013) would be encoded with 0xE28093 in UTF-8 which would represent the three characters â, €, and “ in Windows-1250. Properly declaring the document’s encoding would probably fix this.
Sanitization is a tricky subject, because it never means the same thing depending on the context. :)
real_escape_string just makes sure your data can be included in a request (inside quotes, of course) without having the possibility to change the "meaning" of the request.
The manual page explains what the function really does: it escapes nul characters, line feeds, carriage returns, simple quotes, double quotes, and "Control-Z" (probably the SUBSTITUTE character). So it just inserts a backslash before those characters.
That's it. It "sanitizes" the string so it can be passed unchanged in a request. But it doesn't sanitize it under any other point of view: users can still pass for instance HTML markers, or "strange" characters. You need to make rules depending on what your output format is (most of the time HTML, but HTTP isn't restricted to HTML documents), and what you want to let your users do.
If your code can't handle some characters, or if they have a special meaning in the output format, or if they cause your output to appear "corrupted" in some way, you need to escape or remove them yourself.
You will probably be interested in htmlspecialchars. Control characters generally aren't a problem with HTML. If your output encoding is the same as your input encoding, they won't be displayed and thus won't be an issue for your users (well, maybe for the W3C validator). If you think it is, make your own function to check and remove them.

CodeIgniter 2.0 input library character issue

I am using codeigniter in an app. There is a form. In the textarea element, I wrote something including
%Features%
However, when I want to echo this by $this->input->post(key), I get something like
�atures%
The '%Fe' are vanished.
In main index.php file of CI, I tried var_dump($_POST) and I see the above word is fully ok. but when I am fetching it with the input library (xss filtering is on) I get the problem.
When the XSS filtering is off, it appears ok initially. however, if I store it in database and show next time, I see same problem (even the xss filtering is off).
%Fe happens to look like a URL-encoded sequence %FE, representing character 254. It's being munched into the Unicode "I have no idea what that sequence means" glyph, �.
It's clear that the "XSS filter" is being over-zealous when decoding the field on submission.
It's also very likely that a URL-decode is being run again at some point later in the process, when you output the result from the database. Check the database to make sure that the actual string is being represented properly.
First: Escape the variables before storing them into db. % has special meaning in SQL.
Second: % also has special meaning in URLs eg. %20 is %FE will map to some character which will be decoded by input()

Preserve user comments format in database

HI,
I am creating on comments form where users will be commented and will be stored in the MYSQL database. The problem what I am facing is, it is stored as the single line in the database. It should be stored with exact format how user is entered in the form(like new lines and everything). I am using PHP to store it in the MySQL db.
First store it as text or longtext. Second, when showing the comment, use a function like nl2br to convert newlines to html <br> elements. This way, linebreaks are preserved.
Your text is stored just fine in the database if you are putting it into a long enough text-type field (e.g. TEXT), including the newlines in the user input.
Your problem is how to display the text formatted the way the user was seeing it when entering it. This is a more generic problem, and it only has to do with how HTML treats whitespace.
One approach would be to call nl2br on the comments, as Ikke says. This would replace all newlines (which the browser disregards) with <br> tags which have a visible effect on the rendered output.
Another option would be to put the text inside a <pre>...</pre> tag. This will force the browser to render it with whitespace preserved.
It's really up to what's more convenient/suitable for you.
Update: Just to be clear: do not modify the user input before inserting it in the database (unless it's part of your input validation, like e.g. stripping HTML tags from the input). Store it in an "untouched" format, and only do some processing on it before you output the data. This way, you always have the option of performing the correct processing if your output channel changes in the future (e.g. export comments to a text file vs displaying them as HTML).
you can store the comments in the same form in the mysql database. one difference would be when you retrieve the comments that has new line your code should look for \r\n and interpret it.. and also when you insert the data in mysql you will have to escape ' and \ characters from the comment.

What's the best practice method for storing raw php, javascript, html or similar in a mysql database?

The example web page has 2 fields and allows a user to enter a title and code. Both fields would later be embed and displayed in an HTML page for viewing and/or editing but not execution. In other words, any PHP or javascript or similar should not run but be displayed for editing and copying.
In this case, what is the best way to escape these fields before database insertion and after (for HTML display)
You need to use the function htmlspecialchars() in php
that will change any special characters (eg < and >) into their special HTML encoded characters (eg &lt and &gt). When you get these from the database and output them as HTML they will display as code, but won't harm your script or execute.
I faced with the same problem a few days back, to put the codes (javascript or PHP ) in the html in a non executable way, I used textarea, it solved the purpose.
The problem however, was with the database. I cannot use the typical escape functions with the data, as it is affecting my data, for example the tags are getting messed up.
To solve this problem, I encoded the data in base 64 format before putting it in the database. So what is happening is my JavaScript code is encoded and the resultant code is no longer a Javascript code and I can use the escape functions on this and store it in the database.
I am open to suggestions, feel free to comment.

Categories