After a few hours of bug searching, I found out the cause of one of my most annoying bugs.
When users are typing out a message on my site, they can title it with plaintext and html entities.
This means in some instances, users will type a title with common html entity pictures like this face. ( ͡° ͜ʖ ͡°).
To prevent html injection, I use htmlspecialchars(); on the title, and annoyingly it would convert the picture its html entity format when outputted onto the page later on.
( ͡° ͜ʖ ͡°)
I realized the problem here was that the title was being encoded as the example above, and htmlspecialchar, as well as doing what I wanted and encoding possible html injection, was turning the ampersand in the entities to
&.
By un-escaping all the ampersands, and changing them back to & this fixed my problem and the face would come out as expected.
However I am unsure if this is still safe from malicious html. Is it safe to decode the ampersands in user imputed titles? If not, how can I go about fixing this issue?
If your entities are displayed as text, then you're probably calling htmlspecialchars() twice.
If you are not calling htmlspecialchars() twice explicitly, then it's probably a browser-side auto-escaping that may occur if the page containing the form is using an obsolete single-byte encoding like Windows-1252. Such automatic escaping is the only way to correctly represent characters not present in character set of the specific single-byte encoding. All current browsers (including Firefox, Opera, and IE) do this.
Make sure you are using Unicode (UTF-8 in particular) encoding.
To use Unicode as encoding, add the <meta charset="utf-8" /> element to the HEAD section of the HTML page that contains the form. And don't forget to save the HTML page itself in UTF-8 encoding. To use Unicode in PHP, it's typically enough to use multibyte (mb_ prefixed) string functions. Finally, database engines like MySQL do support UTF-8 long ago.
As a temporary workaround, you can disable reencoding existing entities by setting 4th parameter ($double_encode) of the htmlspecialchars() function to false.
There is no straight answer. You may unesacape <script...> into <script...> and end in trouble, however it looks like the code has been double encoded - probably once on input and then again when you output to screen. If you can guarantee it has been double encoded, then it should be safe to undo one of those.
However, the best solution is to keep the "raw" value in memory, and sanitize/encode for outputting into databases, html, JSON etc.
So - when you get input, sanitise it for anything you don't want, but don't actually convert it into HTML or escape it or anything else at this stage. Escape it into a database, html encode it when output to screen / xml etc.
Related
I plan on using HTML Purify for the outputs of my webservice. I did not see an integrated "loggin" functionality to check what is replaced, so I wrote it myself.
However, the purifier() function automatically transforms my special character "entities".
For example:
& -> &
ø -> ø
The problem is now, that these will also be "logged" as my logging function compares the differences between the "purified" string and the original one. Is there a way to avoid this automatic encoding/decoding, or does anyone have better idea of how to check what is actually replaced?
Thank you!
The two examples you cite are actually two different use-cases; the one is because HTML Purifier is making your output safe (& -> &), the other is HTML Purifier using UTF-8 instead of entities because that's its internal representation.
Generally speaking, if your HTML is safe, HTML Purifier will output semantically equivalent HTML, it's not actually guaranteed to keep e.g. all whitespace or representation, because its focus is entirely on security, not idempotence for safe HTML, and it transforms incoming HTML quite heavily in the interest of thorough analysis.
You could force it to always turn all non-ASCII characters into entities with Core.EscapeNonASCIICharacters, but I doubt that's what you want - it will also change any UTF-8 that's not currently an entity into an entity. It also doesn't solve that unescaped HTML special characters will be escaped (& -> &) - HTML Purifier doesn't take chances, so even those HTML special characters that are coincidentally/contextually safe will always be encoded.
Instead, take a look at Core.CollectErrors. That should enable checking for the changes that you're looking for. Despite the warning in the docs, it is a solid feature. You can see an example usage of that feature here. The tl;dr is that to get the error collector, you use $purifier->context->get('ErrorCollector');, and to get your list of errors (which includes replacements), $errorCollector->getRaw(). Try that and see if it works?
I am using mysqli_real_escape_string to parse characters in PHP. When I go to databases, I see:
हाँस्न सकिन
instead of:
हाँस्न सकिन
I know these charcters represent the UNICODE of this characters. Is there a way to see the actual content without the unicode codes?
Table Collation is utf16_unicode_ci.
Those are HTML character references. mysqli_real_escape_string doesn't do this, something else is.
That thing could be a web browser, if the data got in there from form input on a page that wasn't marked as <meta charset="utf-8"/>. In this case the browser has to guess what encoding the page is, and may wrongly guess it is Western European (Windows code page 1252). In that case the characters हाँस्न सकिन are not present in the form's encoding, so browsers panic and do a last-ditch-fallback to HTML-encoding. This is a data mangling which you can't reliably undo. You should avoid this by making sure your pages are served as UTF-8, which allows all characters.
What does your web application show on-page for this value? You should see हा... literally, with the ampersands and everything. If you see हाँस्न सकिन, that would imply you are not HTML-escaping your database contents when outputting them, which is bad news as it would likely mean you have HTML-injection (XSS) vulnerabilities.
I thought the proper way to "sanitize" incoming data from an HTML form before entering it into a mySQL database was to use real_escape_string on it in the PHP script, like this:
$newsStoryHeadline = $_POST['newsStoryHeadline'];
$newsStoryHeadline = $mysqli->real_escape_string($newsStoryHeadline);
$storyDate = $_POST['storyDate'];
$storyDate = $mysqli->real_escape_string($storyDate);
$storySource = $_POST['storySource'];
$storySource = $mysqli->real_escape_string($storySource);
// etc.
And once that's done you could just insert the data to the DB like this:
$mysqli->query("INSERT INTO NewsStoriesTable (Headline, Date, DateAdded, Source, StoryCopy) VALUES ('".$newsStoryHeadline."', '".$storyDate."', '".$dateAdded."', '".$storySource."', '".$storyText."')");
So I thought doing this would take care of cleaning up all the invisible "junk" characters that may be coming in with your submitted text.
However, I just pasted some text I copied from a web-page into my HTML form, clicked "submit" - which ran the above script and inserted that text into my DB - but when I read that text back from the DB, I discovered that this piece of text did still have junk characters in it, such as –.
And those junk characters of course caused the PHP script I wrote that retrieves the information from the DB to crash.
So what am I doing wrong?
Is using real_escape_string not the way to go here? Or should I be using it in conjunction with something else?
OR, is there something I should be doing (like more escaping) when reading reading data back out from the the mySQL database?
(I should mention that I'm an Objective-C developer, not a PHP/mySQL developer, but I've unfortunately been given this task to do some DB stuff - hence my question...)
thanks!
Your assumption is wrong. mysqli_real_escape_string’s only intention is to escape certain characters so that the resulting string can be safely used in a MySQL string literal. That’s it, nothing more, nothing less.
The result should be that exactly the passed data is retained, including ‘junk’. If you don’t want that ‘junk’ in your database, you need to detect, validate, or filter it before passing to to MySQL.
In your case, the ‘junk’ seems to be due to different character encodings: You input data seems to be encoded with UTF-8 while it’s later displayed using Windows-1250. In this scenario, the character – (U+2013) would be encoded with 0xE28093 in UTF-8 which would represent the three characters â, €, and “ in Windows-1250. Properly declaring the document’s encoding would probably fix this.
Sanitization is a tricky subject, because it never means the same thing depending on the context. :)
real_escape_string just makes sure your data can be included in a request (inside quotes, of course) without having the possibility to change the "meaning" of the request.
The manual page explains what the function really does: it escapes nul characters, line feeds, carriage returns, simple quotes, double quotes, and "Control-Z" (probably the SUBSTITUTE character). So it just inserts a backslash before those characters.
That's it. It "sanitizes" the string so it can be passed unchanged in a request. But it doesn't sanitize it under any other point of view: users can still pass for instance HTML markers, or "strange" characters. You need to make rules depending on what your output format is (most of the time HTML, but HTTP isn't restricted to HTML documents), and what you want to let your users do.
If your code can't handle some characters, or if they have a special meaning in the output format, or if they cause your output to appear "corrupted" in some way, you need to escape or remove them yourself.
You will probably be interested in htmlspecialchars. Control characters generally aren't a problem with HTML. If your output encoding is the same as your input encoding, they won't be displayed and thus won't be an issue for your users (well, maybe for the W3C validator). If you think it is, make your own function to check and remove them.
I need to pull the content from the database on the page, but some of this contents have the whole HTML page - with css, head, etc...
What would be the best way prevent having all htlm tags, scripts, css? Would iframe help here?
The most bothering thing is that I'm getting strange characters on the page: �
and as found out it is due to different encoding.
The site has utf-8 encoding and if the content contains different encoding, these signs come out and I cannot replace them.
The only thing it make them remove was to change my encoding, but this is not the real solution.
If someone could tell me how to remove them, would be really great.
Solution: with your help I checked encoding, but couldn't change it. I set names in mysql_query to UTF-8, and stripped unusefull tags. Now it seems ok.
Thanks to all of you.
I think you have no chance apart an ugly iframe. About encoding, you should check db encoding, connection encoding and convert as needed. Use iconv for full control over conversion, for example:
$html=iconv("UTF-8", "ISO-8859-15"."//TRANSLIT//IGNORE",$html]);
In this case, you're going to lose some characters not mapped in ISO-8859-15. Consider moving your whole site to UTF-8 encoding.
The � tags in fact might not be due to encoding, the problem might be the content that is stored in the database.
Check for double quotes like “ which are supposed to be ", more so if the data in the table was copy pasted.
I need to store special characters and symbols into mysql database. So either I can store it as it is like 'ü' or convert it to html code such as 'ü'
I am not sure which would be better.
Also I am having symbols like '♥', '„' .
Please suggest which one is better? Also suggest if there is any alternative method.
Thanks.
HTML entities have been introduced years ago to transport character information over the wire when transportation was not binary safe and for the case that the user-agent (browser) did not support the charset encoding of the transport-layer or server.
As a HTML entity contains only very basic characters (&, ;, a-z and 0-9) and those characters have the same binary encoding in most character sets, this is and was very safe from those side-effects.
However when you store something in the database, you don't have these issues because you're normally in control and you know what and how you can store text into the database.
For example, if you allow Unicode for text inside the database, you can store all characters, none is actually special. Note that you need to know your database here, there are some technical details you can run into. Like you don't know the charset encoding for your database connection so you can't exactly tell your database which text you want to store in there. But generally, you just store the text and retrieve it later. Nothing special to deal with.
In fact there are downsides when you use HTML entities instead of the plain character:
HTML entities consume more space: ü is much larger than ü in LATIN-1, UTF-8, UTF-16 or UTF-32.
HTML entities need further processing. They need to be created, and when read, they need to be parsed. Imagine you need to search for a specific text in your database, or any other action would need additional handling. That's just overhead.
The real fun starts when you mix both concepts. You come to a place you really don't want to go into. So just don't do it because you ain't gonna need it.
Leave your data raw in the database. Don't use HTML entities for these until you need them for HTML. You never know when you may want to use your data elsewhere, not on a web page.
My suggestion would mirror the other contributors, don't convert the special entities when saving them to your database.
Some reasons against conversion:
K.I.S.S principle (my biggest reason not to do it)
most entities will end up consuming more space then prior to being converted
loose the ability to search for the entities ü in a word, would be [word]+ü+[/word], and you would have to do a string comparison of the html equivalent of ü => [word]+ü+[/word].
your ouput may change from HTML to say an API for mobile, etc which makes conversion very unnecessary.
need to convert on input of data, and on output (again if your output changes from plain HTML to something else).