HTML Purifier selectively eating special characters

HTML Purifier selectively eating special characters - php

Using PHP against a UTF-8 compliant database. Here's how input goes in.
user types input into textarea
textarea encoded with javascript escape()
passed via HTTP post
decoded with PHP rawurldecode()
passed through HTMLPurifier with default settings
escaped for MySQL and stored in database
And it comes out in the usual way and I run unescape() on page load. This is to allow people to, say, copy and paste directly from a word document and have the smart quotes show up.
But HTMLPurifier seems to be clobbering non-UTF-8 special characters, ones that escape() to a simple % expression, like Ö, which escapes to %D6, whereas smartquotes escape to %u2024 or something and go into the database that way. It takes out both the special character and the one immediately following.
I need to change something in this process. Perhaps I need to change multiple things.
What can I do to not get special characters clobbered?

textarea encoded with javascript escape()
escape isn't safe for non-ascii. Use escapeURIComponent
passed via HTTP post
I assume that you use XmlHttpRequest? If not, make sure that the page containing the form is served as utf-8.
decoded with PHP rawurldecode()
If you access the value through $_POST, you should not decode it, since that has already been done. Doing so will mess up data.
escaped for MySQL and stored in database
Make sure you don't have magic quotes turned on. Make sure that the database stores tables as utf-8 (The encoding and the collation must be both utf-8). Make sure that the connection between php and MySql is utf-8 (Use set names utf8, if you don't use PDO).
Finally, make sure that the page is served as utf-8 when you output the string again.

Related

Is it safe to unescape ampersand for user input?

After a few hours of bug searching, I found out the cause of one of my most annoying bugs.
When users are typing out a message on my site, they can title it with plaintext and html entities.
This means in some instances, users will type a title with common html entity pictures like this face. ( ͡° ͜ʖ ͡°).
To prevent html injection, I use htmlspecialchars(); on the title, and annoyingly it would convert the picture its html entity format when outputted onto the page later on.
( ͡° ͜ʖ ͡°)
I realized the problem here was that the title was being encoded as the example above, and htmlspecialchar, as well as doing what I wanted and encoding possible html injection, was turning the ampersand in the entities to
&.
By un-escaping all the ampersands, and changing them back to & this fixed my problem and the face would come out as expected.
However I am unsure if this is still safe from malicious html. Is it safe to decode the ampersands in user imputed titles? If not, how can I go about fixing this issue?

If your entities are displayed as text, then you're probably calling htmlspecialchars() twice.
If you are not calling htmlspecialchars() twice explicitly, then it's probably a browser-side auto-escaping that may occur if the page containing the form is using an obsolete single-byte encoding like Windows-1252. Such automatic escaping is the only way to correctly represent characters not present in character set of the specific single-byte encoding. All current browsers (including Firefox, Opera, and IE) do this.
Make sure you are using Unicode (UTF-8 in particular) encoding.
To use Unicode as encoding, add the <meta charset="utf-8" /> element to the HEAD section of the HTML page that contains the form. And don't forget to save the HTML page itself in UTF-8 encoding. To use Unicode in PHP, it's typically enough to use multibyte (mb_ prefixed) string functions. Finally, database engines like MySQL do support UTF-8 long ago.
As a temporary workaround, you can disable reencoding existing entities by setting 4th parameter ($double_encode) of the htmlspecialchars() function to false.

There is no straight answer. You may unesacape <script...> into <script...> and end in trouble, however it looks like the code has been double encoded - probably once on input and then again when you output to screen. If you can guarantee it has been double encoded, then it should be safe to undo one of those.
However, the best solution is to keep the "raw" value in memory, and sanitize/encode for outputting into databases, html, JSON etc.
So - when you get input, sanitise it for anything you don't want, but don't actually convert it into HTML or escape it or anything else at this stage. Escape it into a database, html encode it when output to screen / xml etc.

Covert special character to html number/name and again to special character in php

I am facing a problem with storing special characters in database and retrieving again as symbol.
For example, I have a string like Côte d'Ivoire
What I want to do is converting the special character ô to HTML number ô or name ô and at the time of retrieval I need to convert HTML to special symbol again.
I also need to pass this string as JSON response of a web service.
I tried some php functions like htmlspecialchars() and htmlspecialchars_decode() but not getting the desired output.
Any help will be appreciated. If there is any other way to do it then it will also be very helpful.
Thanks in advance

You can use the htmlentities function to transform the special characters.
You have to pass UTF8 to the json_encode function, so you can use utf8_encode on your data before encoding.
http://php.net/manual/en/function.htmlentities.php
http://php.net/manual/en/function.utf8-encode.php

use 'utf8_unicode_ci' for Collation while saving data on database and retrieve data usual way.and check exact data is saving on database.

This problem is much easier to solve, when you use UTF-8 for the whole site including your database. Escaping should be done as late as possible and only for the needed target system.
An example:
Your HTML page is UTF-8 encoded and you receive user input, you get the user input also in UTF-8. This value you can store as it is to the database, just use prepared statements or call mysqli_real_escape_string() before building the SQL-string. This escapes the input just to make it safe for SQL-statements, the database will contain the original user input.
When receiving the value back from the database you get the original UTF-8 input, then you can call htmlspecialchars() to escape it for displaying in HTML output. I wrote a small article about using UTF-8 for the whole site there you can find more information.

real_escape_string not cleaning up entered text

I thought the proper way to "sanitize" incoming data from an HTML form before entering it into a mySQL database was to use real_escape_string on it in the PHP script, like this:
$newsStoryHeadline = $_POST['newsStoryHeadline'];
$newsStoryHeadline = $mysqli->real_escape_string($newsStoryHeadline);
$storyDate = $_POST['storyDate'];
$storyDate = $mysqli->real_escape_string($storyDate);
$storySource = $_POST['storySource'];
$storySource = $mysqli->real_escape_string($storySource);
// etc.
And once that's done you could just insert the data to the DB like this:
$mysqli->query("INSERT INTO NewsStoriesTable (Headline, Date, DateAdded, Source, StoryCopy) VALUES ('".$newsStoryHeadline."', '".$storyDate."', '".$dateAdded."', '".$storySource."', '".$storyText."')");
So I thought doing this would take care of cleaning up all the invisible "junk" characters that may be coming in with your submitted text.
However, I just pasted some text I copied from a web-page into my HTML form, clicked "submit" - which ran the above script and inserted that text into my DB - but when I read that text back from the DB, I discovered that this piece of text did still have junk characters in it, such as â€“.
And those junk characters of course caused the PHP script I wrote that retrieves the information from the DB to crash.
So what am I doing wrong?
Is using real_escape_string not the way to go here? Or should I be using it in conjunction with something else?
OR, is there something I should be doing (like more escaping) when reading reading data back out from the the mySQL database?
(I should mention that I'm an Objective-C developer, not a PHP/mySQL developer, but I've unfortunately been given this task to do some DB stuff - hence my question...)
thanks!

Your assumption is wrong. mysqli_real_escape_string’s only intention is to escape certain characters so that the resulting string can be safely used in a MySQL string literal. That’s it, nothing more, nothing less.
The result should be that exactly the passed data is retained, including ‘junk’. If you don’t want that ‘junk’ in your database, you need to detect, validate, or filter it before passing to to MySQL.
In your case, the ‘junk’ seems to be due to different character encodings: You input data seems to be encoded with UTF-8 while it’s later displayed using Windows-1250. In this scenario, the character – (U+2013) would be encoded with 0xE28093 in UTF-8 which would represent the three characters â, €, and “ in Windows-1250. Properly declaring the document’s encoding would probably fix this.

Sanitization is a tricky subject, because it never means the same thing depending on the context. :)
real_escape_string just makes sure your data can be included in a request (inside quotes, of course) without having the possibility to change the "meaning" of the request.
The manual page explains what the function really does: it escapes nul characters, line feeds, carriage returns, simple quotes, double quotes, and "Control-Z" (probably the SUBSTITUTE character). So it just inserts a backslash before those characters.
That's it. It "sanitizes" the string so it can be passed unchanged in a request. But it doesn't sanitize it under any other point of view: users can still pass for instance HTML markers, or "strange" characters. You need to make rules depending on what your output format is (most of the time HTML, but HTTP isn't restricted to HTML documents), and what you want to let your users do.
If your code can't handle some characters, or if they have a special meaning in the output format, or if they cause your output to appear "corrupted" in some way, you need to escape or remove them yourself.
You will probably be interested in htmlspecialchars. Control characters generally aren't a problem with HTML. If your output encoding is the same as your input encoding, they won't be displayed and thus won't be an issue for your users (well, maybe for the W3C validator). If you think it is, make your own function to check and remove them.

Store special character in mysql database that can be read by JavaScript and HTML

I'm storing data in a MySQL database that may have some special characters. I'm wondering how to store it so that these characters are preserved if they're either output to HTML via PHP OR via JavaScript, e.g. createTextNode.
For example, the division symbol (÷) has the html code ÷, and when I store it as that it shows up fine when put directly into HTML by PHP, but when I pull it into JavaScript using $.getJSON and then insert it with createTextNode it shows up looking like ÷.
I also tried storing the symbol in the SQL directly, but my understanding is that the column would need to be changed from VARCHAR to NVARCHAR and that would cause a performance hit that doesn't seem necessary.
Given that I can modify the SQL, the PHP, or the JavaScript, is there an easy fix here? Maybe a way to unescape the HTML entity in JavaScript?

As answered by Yogesh, you should switch your collation of the DB to utf8_general_ci

So there's probably two things going on:
JSON escapes special characters.
Somewhere, something in your code flow is URL encoding the strings too.
So you just need to decode the string in your JavaScript, or you need to find what part of your code is URL encoding those strings and fix it.

MySQL Database has non escaped single quotes in entires ... How to show them?

The database has a ton of entries that were not escaped because they were inputted manually when they were inserted so they look like: Don't inside of the entry, but when I try to display them they have a weird characters when I output in PHP. Before I would put anything into the database I would usually use mysqli_real_escape_string and then do the same when I go to retrieve the data, but since the data is already stored without using real_escape how do I display it properly?
The character being displayed instead of the single quotes looks like this: �
If it helps the data is stored as 'text'.
Thanks!
For future users of the same problem here's the steps:
Check your website headers to see what the encoding is
Check your mysql table columns and make sure they match.
If they don't change them to match. utf8_general in mysql and utf8 in my HTML worked for me
You will have to go back through the old mysql tables and update them so the new encoding is set properly.
New entries should work fine
When you output your results in PHP (or I guess whatever language you use), depending on if you are using any validation, you may have to use mysqli_real_escape_string or a similar function, such as stripslashes()

You'll need to read up on text encoding.
The usual solution is to make sure everything (the content-type encoding on your pages, and your mysql) are set to UTF-8
Chances are your data is Latin1 and you're displaying UTF-8 or vise versa

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

HTML Purifier selectively eating special characters - php

Related

Is it safe to unescape ampersand for user input?

Covert special character to html number/name and again to special character in php

real_escape_string not cleaning up entered text

Store special character in mysql database that can be read by JavaScript and HTML

MySQL Database has non escaped single quotes in entires ... How to show them?

Categories

Resources