I am using mysqli_real_escape_string to parse characters in PHP. When I go to databases, I see:
हाँस्न सकिन
instead of:
हाँस्न सकिन
I know these charcters represent the UNICODE of this characters. Is there a way to see the actual content without the unicode codes?
Table Collation is utf16_unicode_ci.
Those are HTML character references. mysqli_real_escape_string doesn't do this, something else is.
That thing could be a web browser, if the data got in there from form input on a page that wasn't marked as <meta charset="utf-8"/>. In this case the browser has to guess what encoding the page is, and may wrongly guess it is Western European (Windows code page 1252). In that case the characters हाँस्न सकिन are not present in the form's encoding, so browsers panic and do a last-ditch-fallback to HTML-encoding. This is a data mangling which you can't reliably undo. You should avoid this by making sure your pages are served as UTF-8, which allows all characters.
What does your web application show on-page for this value? You should see हा... literally, with the ampersands and everything. If you see हाँस्न सकिन, that would imply you are not HTML-escaping your database contents when outputting them, which is bad news as it would likely mean you have HTML-injection (XSS) vulnerabilities.
Related
After a few hours of bug searching, I found out the cause of one of my most annoying bugs.
When users are typing out a message on my site, they can title it with plaintext and html entities.
This means in some instances, users will type a title with common html entity pictures like this face. ( ͡° ͜ʖ ͡°).
To prevent html injection, I use htmlspecialchars(); on the title, and annoyingly it would convert the picture its html entity format when outputted onto the page later on.
( ͡° ͜ʖ ͡°)
I realized the problem here was that the title was being encoded as the example above, and htmlspecialchar, as well as doing what I wanted and encoding possible html injection, was turning the ampersand in the entities to
&.
By un-escaping all the ampersands, and changing them back to & this fixed my problem and the face would come out as expected.
However I am unsure if this is still safe from malicious html. Is it safe to decode the ampersands in user imputed titles? If not, how can I go about fixing this issue?
If your entities are displayed as text, then you're probably calling htmlspecialchars() twice.
If you are not calling htmlspecialchars() twice explicitly, then it's probably a browser-side auto-escaping that may occur if the page containing the form is using an obsolete single-byte encoding like Windows-1252. Such automatic escaping is the only way to correctly represent characters not present in character set of the specific single-byte encoding. All current browsers (including Firefox, Opera, and IE) do this.
Make sure you are using Unicode (UTF-8 in particular) encoding.
To use Unicode as encoding, add the <meta charset="utf-8" /> element to the HEAD section of the HTML page that contains the form. And don't forget to save the HTML page itself in UTF-8 encoding. To use Unicode in PHP, it's typically enough to use multibyte (mb_ prefixed) string functions. Finally, database engines like MySQL do support UTF-8 long ago.
As a temporary workaround, you can disable reencoding existing entities by setting 4th parameter ($double_encode) of the htmlspecialchars() function to false.
There is no straight answer. You may unesacape <script...> into <script...> and end in trouble, however it looks like the code has been double encoded - probably once on input and then again when you output to screen. If you can guarantee it has been double encoded, then it should be safe to undo one of those.
However, the best solution is to keep the "raw" value in memory, and sanitize/encode for outputting into databases, html, JSON etc.
So - when you get input, sanitise it for anything you don't want, but don't actually convert it into HTML or escape it or anything else at this stage. Escape it into a database, html encode it when output to screen / xml etc.
I'm having some trouble with the dreaded UTF-8 Character Encoding! It's driving me insane, no matter which way I approach it or how many online guides I follow, I can never get it to return the desired results. Here's what's going on:
My whole website uses a simple text-file database that is UTF-8 encoded, and it correctly shows all manner of special characters, latin, arabic, japanese, you name it, they all show correctly, with one exception:
When the user uses the "Search" input box I have on my website, I use $search = $_REQUEST['search']; to get the input data on the results page and show results accordingly. When a user inserts special characters in the search box, they get "Percent Encoded" in the URL (for example, "ï" becomes "%E3%AF"). When showing $string in the actual website, any special character appears as � (black diamond with question mark).
I have tried everthing it says here http://malevolent.com/weblog/archive/2007/03/12/unicode-utf8-php-mysql/ with the exception of the header(). I have set the charset as UTF-8 in my head section with an http-equiv meta but for some reason whenever I set it as a header() my PHP stylesheet stops working (and the character problem remains). Maybe this is a clue?
I have tried urldecode and rawurldecode too, but they don't change anything.
Keep in mind special characters appear correctly elsewhere on the site, it's only with the $search string where this problem appears. As a side-note, even though the characters are not visualizing correctly, my search engine does actually interpret the special characters correctly when filtering the results. This makes me understand that the special character is actually there and correctly encoded, but it's just a matter of making it visualize correctly with the correct charset. However... everything appears to be UTF-8.
To be honest I'm so confused about this that this question might also appear to be confusing and the information I'm giving you might not be very well structured either, so I apologize and will try to provide more detailed information for any questions.
Thank you!
Make sure not to have any function which alters your $_REQUEST. Some functions are not aware of special encodings.
The best way to investigate is checking the state of the variables before and after they are altered.
I would like to add one thing more point regarding utf-8 string manipulation.
When manipulating utf-8 strings always use multibyte string functions.
use mb_strtolower in place of strtolower()
http://php.net/manual/en/ref.mbstring.php.
I am somewhat confused with this whole character set thingy. Everything seems fine when the data is inputting manually into the web sites and database tables. Except when data is inputted by copy and pasting – the character sets being to get screwy.
I asked several clients where there are getting this data from – the majority seems to be either from another web site or from a MS Document.
The characters that seem to be messing up are common characters like the following:
‘ © "
What is being inserted the the black triangle with the dreaded question mark! On my server I have the following settings.
PHP TIDY to clean the text before input to web page or database - output-encoding > UTF-8
Each web page has meta tag > charset=UTF-8
The database tables default > latin1_swedish_ci
I assume at first it was a database problem until I noticed that the same issue occurs with static web pages that are not database driven.
Help?
It's not really a good solution to replace away the smart quotes. If you can't cope with smart quotes or the copyright symbol, you can't cope with any other non-ASCII characters either, leaving you with an ASCII-only application (which these days is a pretty sad thing).
Instead you should ideally ensure that your web application using UTF-8 throughout, which means:
Serve all your pages as UTF-8 using a header('Content-Type: text/html; charset=utf-8'); and/or a <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>.
Ensure your .php source files are saved as UTF-8, if they contain any non-ASCII characters themselves.
Use mysql_set_charset('utf-8') when connecting to the database.
Ensure your MySQL tables are created with a UTF-8 CHARACTER SET/COLLATION. They won't be by default if you didn't specify one when you created them. In this case you would need to ALTER TABLE on each text column to change it.
If you use htmlentities() to HTML-escape database content when putting it into the page, you need to pass in utf-8 for the $charset argument or it will mangle all non-ASCII characters by treating them as ISO-8859-1 (which is never the proper encoding). Better: use htmlspecialchars() instead, which doesn't touch non-ASCII characters so doesn't care.
I have a website that tells the output is UTF-8, but I never make sure that it is. Should I use a regular expression or Iconv library to convert UTF-8 to UTF-8 (leaving invalid sequences)? Is this a security issue if I do not do it?
First of all I would never just blindly encode it as UTF-8 (possibly) a second time because this would lead to invalid chars as you say. I would certainly try to detect if the charset of the content is not UTF-8 before attempting such a thing.
Secondly if the content in question comes from a source wich you have control over and control the charset for such as a file with UTF-8 or a database with UTF-8 in use in the tables and on the connection, I would trust that source unless something gives me hints that I can't and there is something funky going on. If the content is coming from more or less random places outside your control, well all the more reason to inspect it and possibly try to re-encode og transform from other charsets if you can detect it. So the bottom line is: It depends.
As to wether this is a security issue or not I wouldn't think so (at least I can't think of any scenarios where this could be exploitable) but I'll leave to others to be definitive about that.
Not a security issue, but your users (especially non-english speaking) will be very annoyed, if you send invalid UTF-8 byte streams.
In the best case (what most browsers do) all invalid strings just disappear or show up as gibberish. The worst case is that the browser quits interpreting your page and says something like "invalid encoding". That is what, e.g., some text editors (namely gedit) on Linux do.
OK, to keep it realistic: If you have an english-centered website without heavily relying on some maths characters or Unicode arrows, it will almost make no difference. But if you serve, e.g., a Chinese site, you can totally screw it up.
Cheers,
Everybody gets charsets messed up, so generally you can't trust any outside source. It's a good practise to verify that the provided input is indeed valid for the charset that it claims to use. Luckily, with UTF-8, you can make a fairly safe assertion about the validity.
If it's possible for users to send in arbitrary bytes, then yes, there are security implications of not ensuring valid utf8 output. Depending on how you're storing data, though, there are also security implications of not ensuring valid utf8 data on input (e.g., it's possible to create a variant of this SQL injection attack that works with utf8 input if the utf8 is allowed to be invalid utf8), so you really should be using iconv to convert utf8 to utf8 on input, and just avoid the whole issue of validating utf8 on output.
The two main security reason you want to check that the output is valid utf-8 is to avoid "overlong" byte sequences - that is, cases of byte sequences that mean some character like '<' but are encoded in multiple bytes - and to avoid invalid byte sequences. The overlong encoding issue is obvious - if your filter changes '<' into '<', it might not convert a sequence that means '<' but is written differently. Note that all current-generation browsers will mark overlong sequences as invalid, but some people may be using old browsers.
The issue with invalid sequences is that some utf-8 parsers will allow an invalid sequence to eat some number of valid bytes that follow the invalid ones. Again, not an issue if everyone always has a current browser, but...
I'm looking at encoding strings to prevent XSS attacks. Right now we want to use a whitelist approach, where any characters outside of that whitelist will get encoded.
Right now, we're taking things like '(' and outputting '(' instead. As far as we can tell, this will prevent most XSS.
The problem is that we've got a lot of international users, and when the whole site's in japanese, encoding becomes a major bandwidth hog. Is it safe to say that any character outside of the basic ASCII set isn't a vulnerability and they don't need to be encoded, or are there characters outside the ASCII set that still need to be encoded?
Might be (a lot) easier if you just pass the encoding to htmlentities()/htmlspecialchars
echo htmlspecialchars($string, ENT_QUOTES, 'utf-8');
But if this is sufficient or not depends on what you're printing (and where).
see also:
http://shiflett.org/blog/2005/dec/googles-xss-vulnerability
http://jimbojw.com/wiki/index.php?title=Sanitizing_user_input_against_XSS
http://www.erich-kachel.de/?p=415 (in german. If I find something similar in English -> update) edit: well, I guess you can get the main point without being fluent in german ;)
The stringjavascript:eval(String.fromCharCode(97,108,101,114,116,40,39,88,83,83,39,41)) passes htmlentities() unchanged. Now consider something like<a href="<?php echo htmlentities($_GET['homepage']); ?>"which will send<a href="javascript:eval(String.fromCharCode(97,108,101,114,116,40,39,88,83,83,39,41))">to the browser. And that boils down tohref="javascript:eval(\"alert('XSS')\")"While htmlentities() gets the job done for the contents of an element, it's not so good for attributes.
In general, yes, you can depend on anything non-ascii to be "safe", however there are some very important caveats to consider:
Always ensure that what you're
sending to the client is tagged as
UTF-8. This means having a header
that explicitly says "Content-Type:
text/html; charset=utf-8" on every
single page, including all of your
error pages if any of the content on
those error pages is generated from
user input. (Many people forget to
test their 404 page, and have that
page include the not-found URL verbatim)
Always ensure that
what you're sending to the client is
valid UTF-8. This means you
cannot simply pass through
bytes received from the user back to
the user again. You need to decode
the bytes as UTF-8, apply your html-encoding XSS prevention, and then encode
them as UTF-8 as you write them back
out.
The first of those two caveats is to keep the client's browser from seeing a bunch of stuff including high-letter characters and falling back to some local multibyte character set. That local multi-byte character set may have multiple ways of specifying harmful ascii characters that you won't have defended against. Related to this, some older versions of certain browsers - cough ie cough - were a bit overeager in detecting that a page was UTF-7; this opens up no end of XSS possibilities. To defend against this, you might want to make sure you html-encode any outgoing "+" sign; this is excessive paranoia when you're generating proper Content-Type headers, but will save you when some future person flips a switch that turns off your custom headers. (For example, by putting a poorly configured caching reverse proxy in front of your app, or by doing something to insert an extra banner header - php won't let you set any HTTP headers if any output is already written)
The second of those is because it is possible in UTF-8 to specify "overly short" sequences that, while invalid under current specs, will be interpreted by older browsers as ASCII characters. (See what wikipedia has to say) Also, it is possible that someone may insert a single bad byte into a request; if you pass this pack to the user, it can cause some browsers to replace both the bad byte and one or more bytes after it with "?" or some other "couldn't understand this" character. That is, a single bad byte could cause some good bytes to also be swallowed up. If you look closely at what you're outputting, there's probably a spot somewhere where an attacker who was able to wipe a byte or two out of the output could do some XSS. Decoding the input as UTF-8 and then re-encoding it prevents this attack vector.