Allow only hexadecimal html entities - php

Got a forum and posting HTML is forbidden.
However, some users would like to have the possibility to post some symbolic signs, hexadecimal html entities, such as:
💗
See: http://graphemica.com/%F0%9F%92%97 for more info.
My questions are:
Is this safe to allow them such symbols at all (XSS, etc..)?
What's the best function to use, to allow it? Actually the symbolic html entities appear as plain text.
I want to disallow members using & or » and so on, so just html-entities starting with &# and followed by a number plus the semicolon at the end.
Any idea how to solve this?

Another answer is to use jQueries .text method to add the message to your forum message element.
Although you will have to change how your forum creates the message structure.
You can safely add any sequence of characters and none of them will be interpreted by the browser as HTML.
Example:
$('#message_text').text(naughty_msg_string);

Is this safe to allow them such symbols at all (XSS, etc..)?
No, this is never safe. For example, & is just a convenient alias for &, which is still an ampsersand. Similarly < is a lesser-than sign, and thus 'naively' allowing numeric HTML-entities can still open up an XSS attack surface, if you forget this during processing.
You could consider only allowing numeric symbols outside the main ASCII table (128+), which would be more safe.
What's the best function to use, to allow it? Actually the symbolic html entities appear as plain text.
Considering the above function, preg_replace_callback is a good candidate, as it allows you to test the content before (dis)allowing it.
This also answers the third question, as you can just test for numbers in the regexp.

Related

HTML Purifier - Character Encoding

I plan on using HTML Purify for the outputs of my webservice. I did not see an integrated "loggin" functionality to check what is replaced, so I wrote it myself.
However, the purifier() function automatically transforms my special character "entities".
For example:
& -> &
ø -> ø
The problem is now, that these will also be "logged" as my logging function compares the differences between the "purified" string and the original one. Is there a way to avoid this automatic encoding/decoding, or does anyone have better idea of how to check what is actually replaced?
Thank you!
The two examples you cite are actually two different use-cases; the one is because HTML Purifier is making your output safe (& -> &), the other is HTML Purifier using UTF-8 instead of entities because that's its internal representation.
Generally speaking, if your HTML is safe, HTML Purifier will output semantically equivalent HTML, it's not actually guaranteed to keep e.g. all whitespace or representation, because its focus is entirely on security, not idempotence for safe HTML, and it transforms incoming HTML quite heavily in the interest of thorough analysis.
You could force it to always turn all non-ASCII characters into entities with Core.EscapeNonASCIICharacters, but I doubt that's what you want - it will also change any UTF-8 that's not currently an entity into an entity. It also doesn't solve that unescaped HTML special characters will be escaped (& -> &) - HTML Purifier doesn't take chances, so even those HTML special characters that are coincidentally/contextually safe will always be encoded.
Instead, take a look at Core.CollectErrors. That should enable checking for the changes that you're looking for. Despite the warning in the docs, it is a solid feature. You can see an example usage of that feature here. The tl;dr is that to get the error collector, you use $purifier->context->get('ErrorCollector');, and to get your list of errors (which includes replacements), $errorCollector->getRaw(). Try that and see if it works?

What unicode character groups should we limit the user to, to create Beautiful URLs?

I recently started looking at adding untrusted usernames in prettied urls, eg:
mysite.com/
mysite.com/user/sarah
mysite.com/user/sarah/article/my-home-in-brugge
mysite.com/user/sarah/settings
etc..
Note the username 'sarah' and the article name 'my-home-in-brugge'.
What I would like to achieve, is that someone could just copy-paste the following url somewhere:
(1)
mysite.com/user/Björk Guðmundsdóttir/articles
mysite.com/user/毛泽东/posts
...and it would just be very clear, before clicking on the link, what to expect to see. The following two exact same urls, where the usernames have been encoded using PHP rawurlencode() (considered the proper way of doing this):
(2)
mysite.com/user/Bj%C3%B6rk%20Gu%C3%B0mundsd%C3%B3ttir/articles
mysite.com/user/%E6%AF%9B%E6%B3%BD%E4%B8%9C/posts
...are a lot less clear.
There are three ways to securely (to some level of guarantee) pass an untrusted name containing readable utf8 characters into a url path as a directory:
A. You reparse the string into allowable characters whilst still keeping it uniquely associated in your database to that user, eg:
(3)
mysite.com/user/bjork-guomundsdottir/articles
mysite.com/user/mao-ze-dong12/posts
B. You limit the user's input at string creation time to acceptable characters for url passing (you ask eg. for alphanumeric characters only):
(4)
mysite.com/user/bjorkguomundsdottir/articles
mysite.com/user/maozedong12/posts
using eg. a regex check (for simplicity sake)
if(!preg_match('/^[\p{L}\p{N}\p{P}\p{Zs}\p{Sm}\p{Sc}]+$/u', trim($sUserInput))) {
//...
}
C. You escape them in full using PHP rawurlencode(), and get the ugly output as in (2).
Question:
I want to focus on B, and push this as far as is possible within KNOWN errors/concerns, until we get the beautiful urls as in (1). I found out that passing many unicode characters in urls is possible in modern browsers. Modern browsers automatically convert unicode characters or non-url parseable characters into encoded characters, allowing the user to Eg. Copy paste the nice-looking unicode urls as in (1), and the browser will get the actual final url right.
For some characters, the browser will not get it right without encoding: Eg. ?, #, / or \ will definitely and clearly break the url.
So: Which characters in the (non-alphanumeric) ascii range can we allow at creation time, accross the entire unicode spectrum, to be injected into a url without escaping? Or better: Which groups of Unicode characters can we allow? Which characters are definitely always blacklisted ? There will be special cases: Spaces look fine, except at the end of the string, otherwise they could be mis-selected. Is there a reference out there, that shows which browsers interprete which unicode character ranges ok?
PS: I am very well aware that using improperly encoded strings in urls will almost never provide a security guarantee. This question is certainly not recommended practice, but I do not see the difference of asking this question, and the done-so-often matter of copy-pasting a url from a website and pasting it into the browser, without thinking it through whether that url was correctly encoded or not (the novice user wouldn't). Has someone looked at this before, and what was their code (regex, conditions, if-statement..) solution?

Help Implement Tags in PHP

In my recent PHP project, I need to implement Tags (searchable) separated by comma (similar to this site or something like in WordPress). What is the smart way to detect and remove unnecessary characters or tags? Putting the XSS concern aside, first of all I need to clean and extract only text if user inputs HTML(or other tags) instead of the plain text.
For example:
If user inputs <b>sdfasdf</b>, sdfsdfsdf, <sdfsdfsdf
It should strip out all the unnecessary characters and tags and only plain text should be saved in database.
I have tried it in WordPress and it is very smart to figure out this plus automatically extracts text only.
My question:
Is there an open source library available for this task, which I can integrate in my project. I have done some homework regarding this but *htmlentities(), strip_tags(), HTML Purifier* etc. doesn't seem suitable for this task. Or do need to build my own library combined with this?
Can somebody guide me on this?
Thanks!
In addition to removing "complete" tags (markup language elements) such as found in <b>sdfasdf</b>, sdfsdfsdf,
you can also remove "forbidden" characters such as "<", ">", and "&" (using preg_replace and the like), and collapse multiple spaces into a single space (also using preg_replace).
Remember, they're used only as tags (keywords), so it's acceptable here to use a somewhat restricted character set. In Stack
Overflow, for instance, only letters, numbers, and hyphens are allowed in tags.
I would look at this the other way around. What input is legal? Which characters are allowed in tag names? Ones those questions are answered I would build a server-side whitelist of legal characters using regex, state the rules in the UI, and simply reject input that does comply.
Massaging invalid inpu into valid, is rarely a good idea.
Characters allowed in tags are usually alphanumeric + dashes and underscores. Some sites also allow spaces.

How do I create a regular expression that disallows symbols?

I got a question regarding regexp in general. I'm currently building a register form where you can enter the full name (given name and family name) however I cant use [a-zA-Z] as a validation check because that would exclude everyone with a "foreign" character.
What is the best way to make sure that they don't enter a symbol, in both php and javascript?
Thanks in advance!
The correct solution to this problem (in general) is POSIX character classes. In particular, you should be able to use [:alpha:] (or [:alphanum:]) to do this.
Though why do you want to prevent users from entering their name exactly as they type it? Are you sure you're in a position to tell them exactly what characters are allowed to be in their names?
You first need to conceptually distinguish between a "foreign" character and a "symbol." You may need to clarify here.
Accounting for other languages means accounting for other code pages and that is really beyond the scope of a simple regexp. It can be done, but on a higher level, the codepages have to work.
If you strictly wanted your regexp to fail on punctuation and symbols, you could use [^[:punct:]], but I'm not sure how the [:punct:] POSIX class reacts to some of the weird unicode symbols. This would of course stop some one from putting in "John Smythe-Jones" as their name though (as '-' is a punctuation character), so I would probably advise against using it.
I don’t think that’s a good idea. See How to check real names and surnames - PHP
I don't know how you would account for what is valid or not, and depending on your global reach, you will probably not be able to remove anything without locking out somebody. But a Google search turned this up which may be helpful.
http://nadeausoftware.com/articles/2007/09/php_tip_how_strip_symbol_characters_web_page
You could loop through the input string and use the String.charCodeAt() function to get the integer character code for each character. Set yourself up with a range of acceptable characters and do your comparison.
As noted POSIX character classes are likely the best bet. But the details of their support (and alternatives) vary very much with the details of the specific regex variant.
PHP apparently does support them, but JavaScript does not.
This means for JavaScript you will need to use character ranges: /[\u0400-\u04FF]/ matches any one Cyrillic character. Clearly this will take some writing, but not the XML 1.0 Recommendation (from W3C) includes a listing of a lot of ranges, albeit a few years old now.
One approach might be to have a limited check on the client in JavaScript, and the full check only server side.

How do I HTML Encode all the output in a web application?

I want to prevent XSS attacks in my web application. I found that HTML Encoding the output can really prevent XSS attacks. Now the problem is that how do I HTML encode every single output in my application? I there a way to automate this?
I appreciate answers for JSP, ASP.net and PHP.
One thing that you shouldn't do is filter the input data as it comes in. People often suggest this, since it's the easiest solution, but it leads to problems.
Input data can be sent to multiple places, besides being output as HTML. It might be stored in a database, for example. The rules for filtering data sent to a database are very different from the rules for filtering HTML output. If you HTML-encode everything on input, you'll end up with HTML in your database. (This is also why PHP's "magic quotes" feature is a bad idea.)
You can't anticipate all the places your input data will travel. The safe approach is to prepare the data just before it's sent somewhere. If you're sending it to a database, escape the single quotes. If you're outputting HTML, escape the HTML entities. And once it's sent somewhere, if you still need to work with the data, use the original un-escaped version.
This is more work, but you can reduce it by using template engines or libraries.
You don't want to encode all HTML, you only want to HTML-encode any user input that you're outputting.
For PHP: htmlentities and htmlspecialchars
For JSPs, you can have your cake and eat it too, with the c:out tag, which escapes XML by default. This means you can bind to your properties as raw elements:
<input name="someName.someProperty" value="<c:out value='${someName.someProperty}' />" />
When bound to a string, someName.someProperty will contain the XML input, but when being output to the page, it will be automatically escaped to provide the XML entities. This is particularly useful for links for page validation.
A nice way I used to escape all user input is by writing a modifier for smarty wich escapes all variables passed to the template; except for the ones that have |unescape attached to it. That way you only give HTML access to the elements you explicitly give access to.
I don't have that modifier any more; but about the same version can be found here:
http://www.madcat.nl/martijn/archives/16-Using-smarty-to-prevent-HTML-injection..html
In the new Django 1.0 release this works exactly the same way, jay :)
My personal preference is to diligently encode anything that's coming from the database, business layer or from the user.
In ASP.Net this is done by using Server.HtmlEncode(string) .
The reason so encode anything is that even properties which you might assume to be boolean or numeric could contain malicious code (For example, checkbox values, if they're done improperly could be coming back as strings. If you're not encoding them before sending the output to the user, then you've got a vulnerability).
You could wrap echo / print etc. in your own methods which you can then use to escape output. i.e. instead of
echo "blah";
use
myecho('blah');
you could even have a second param that turns off escaping if you need it.
In one project we had a debug mode in our output functions which made all the output text going through our method invisible. Then we knew that anything left on the screen HADN'T been escaped! Was very useful tracking down those naughty unescaped bits :)
If you do actually HTML encode every single output, the user will see plain text of <html> instead of a functioning web app.
EDIT: If you HTML encode every single input, you'll have problem accepting external password containing < etc..
The only way to truly protect yourself against this sort of attack is to rigorously filter all of the input that you accept, specifically (although not exclusively) from the public areas of your application. I would recommend that you take a look at Daniel Morris's PHP Filtering Class (a complete solution) and also the Zend_Filter package (a collection of classes you can use to build your own filter).
PHP is my language of choice when it comes to web development, so apologies for the bias in my answer.
Kieran.
OWASP has a nice API to encode HTML output, either to use as HTML text (e.g. paragraph or <textarea> content) or as an attribute's value (e.g. for <input> tags after rejecting a form):
encodeForHTML($input) // Encode data for use in HTML using HTML entity encoding
encodeForHTMLAttribute($input) // Encode data for use in HTML attributes.
The project (the PHP version) is hosted under http://code.google.com/p/owasp-esapi-php/ and is also available for some other languages, e.g. .NET.
Remember that you should encode everything (not only user input), and as late as possible (not when storing in DB but when outputting the HTTP response).
Output encoding is by far the best defense. Validating input is great for many reasons, but not 100% defense. If a database becomes infected with XSS via attack (i.e. ASPROX), mistake, or maliciousness input validation does nothing. Output encoding will still work.
there was a good essay from Joel on software (making wrong code look wrong I think, I'm on my phone otherwise I'd have a URL for you) that covered the correct use of Hungarian notation. The short version would be something like:
Var dsFirstName, uhsFirstName : String;
Begin
uhsFirstName := request.queryfields.value['firstname'];
dsFirstName := dsHtmlToDB(uhsFirstName);
Basically prefix your variables with something like "us" for unsafe string, "ds" for database safe, "hs" for HTML safe. You only want to encode and decode where you actually need it, not everything. But by using they prefixes that infer a useful meaning looking at your code you'll see real quick if something isn't right. And you're going to need different encode/decode functions anyways.

Categories