Sanitising form content in an internationalised scenario

Sanitising form content in an internationalised scenario - php

In the good old days when I was a web developer (using PHP), I used to run all submitted form data through a regex before commencing any processing. For most cases, I would allow alphanumerics along with a small set of punctuation characters which would satisfy 99% of people 99% of the time whilst providing a defense against SQL injection and cross site scripting (yes I used PDO prepared statements as well).
More recently I've had to deal with input in an internationalised context, specifically, where the input can be in quite a few different western and eastern European languages as well as Arabic. In these cases, I resorted to removing potentially dangerous characters and letting everything else in. The application had a very small number of users (less than 10) and was only deployed on their internal network so I wasn't overly concerned about the security of the system but I wouldn't be comfortable taking this approach on a publicly accessible website.
In summary, I would like the input to be filtered so that what is left, is "plain text" but I'm not sure how to define the concept of plain text in an internationalised context. Are there any PHP libraries that address this?

Everything is "plain text". Even "' DROP TABLE users --" is plain text. Even "<script>" is just plain text.
What you're worried about are "special characters", i.e. plain text which has special meanings in certain contexts. For that, you need to escape theses special characters to "defuse" them in the given context. For HTML, escape them to HTML entities. For SQL, SQL-escape the string (or use prepared statements to avoid this problem in general). For CSV, CSV-escape the values... You get the idea. There are always functions or libraries available which will do this for you, don't try to reinvent the wheel here.
If you want to sanitize, i.e. remove content, you need to define better what you want to remove. Removing content also always runs the risk of removing legitimate content your users may want to use. So it's usually the annoying option.
For more on this topic, see The Great Escapism (Or: What You Need To Know To Work With Text Within Text).

Give strip_tags() a try. http://php.net/manual/en/function.strip-tags.php. It has worked for me for most english cases and might work for different languages.

Related

Displaying user input as foreign languages

I just recently encountered a problem that I wasn't really expecting. Up to this point I use htmlentites() to protect against XSS attacks. I don't make it an option for users to enter html so it works perfectly, except for when foreign users want to use the system, and htmlentites() displays their letters as gibberish.
I've used HTML purifier classes that are used along with WYSIWYG editors to protect against XSS, but it's a full class and it does make effect performance, and I don't really have to worry about allowing some html while blocking others because html isn't the problem, it
s just the special characters. If I have to I'll use one, but I was wondering if there was a built in php function for escaping html characters, but displaying foreign languages. I'd imagine their has to be, because PHP isn't only used by English speaking people.
Thanks!

rawurlencode for storing data

I have always used rawurlencode to store user entered data into my mysql databases. The main reason I do this is so that stroing foreign characters is very simple I find. I'd then use rawurldecode to retrieve and display the data.
I read somewhere that rawurlencode was not meant for this purpose. Are there any disadvantages to what I'm doing?
So let's say I have a German address with many characters like umlauts etc. What is the simplest way to store this in a mysql database with no risks of it coming out wrong and being searchable using a search script? So far rawurelencode has been excellent for our system. Perhaps the practise can be improved upon by only encoding foreign letters and not common characters like spaces etc, which is a waste of space I totally agree.

Sure there are.
Let's start with the practical: for a large class of characters you are spending 3 bytes of storage for every byte of data. The description of rawurlencode (and of course the RFC) say that those characters are
all non-alphanumeric characters except -_.~
This means that there is a total of 26 + 26 + 10 (alphanumeric) + 4 (special exceptions) = 66 characters for which you do not waste space.
Then there are also the logical drawbacks: You are not storing the data itself, but rather a representation of the data tailored to URLs. Unless the data itself is URLs, that's not what you should be doing.

Drawbacks I can think of:
Waste of disk space.
Waste of CPU cycles encoding and decoding on every read and every write.
Additional complexity (you can't even inspect data with a MySQL client).
Impossibility to use full text searches.
URL encoding is not necessarily unique (there're at least two RFCs). It may not lead to data loss but it can lead to duplicate data (e.g., unique indexes where two rows actually contain the same piece of data).
You can accidentally encode a non-string piece of data such as a date: 2012-04-20%2013%3A23%3A00
But the main consideration is that such technique is completely arbitrary and unnecessary since MySQL doesn't have the least problem storing the complete Unicode catalogue. You could also decide to swap e's and o's in all strings: Holle, werdl!. Your app would run fine but it would not provide any added value.
Update: As Your Common Sense points out, a SQL clause as basic as ORDER BYis no longer usable. It's not that international chars will be ignored; you'll basically get an arbitrary sort order based on the ASCII code of the % and hexadecimal characters. If you can't SELECT * FROM city ORDER BY city_name reliably, you've rendered your DB useless.

I am using a fork to eat a soup
I am using money bills to fire the coals for BBQ
I am using a kettle to boil eggs.
I am using a microscope to hammer the nails.
Are there any disadvantages to what I'm doing?
YES
You are using a tool not on purpose. This is always a disadvantage.
A sane human being alway using a tool that is intended for the certain job. Not some randomly picked one. Especially if there is no shortage in the right tool supply.
URL encoding is not intended to be used with database, as one can tell from the name. That's alone reason enough for the sane developer. Take a look around: find the proper tool.
There is a thing called "common sense" - a thing widely used in the regular life but for some reason always absent in the php world.
A common sense can warn us: if we're using a wrong tool, it may spoil the work. Sooner or later it will spoil it. No need to ask for the certain details - it's a general rule. We are learning this rule at about age of 5.
Why not to use it while playing with some web thingies too?
Why not to ask yourself a question:
What's wrong with storing foreign characters at all?
urlencode makes stroing foreign characters very simple
Any hardships you encountered without urlencode?
Although I feel that common sense should be enough to answer the question, people always look for the "omen", the proof. Here you are:
Database's job is not limited to just storing and retrieving data. A plain text file can handle such a primitive task as well.
Data manipulations is what we are using databases for.
Most widely used ones are sorting and filtering.
Such a quite intelligent thing as a database can sort and filter data character-insensitive, which is very handy feature. But of course it can be done only if characters being saved as is, not as some random codes.
Sorting texts also may use ordering other than just binary order in the character table. Some umlaut characters may be present at the other parts of the table but database collation will put them in the right place. Of course it can be done only if characters being saved as is, not as some random codes.
Sometimes we have to manipulate the data that already stored in the database. Say, cut some piece from the string and compare with the entered value. How it is supposed to be done with urlencoded data?

Preventing server-side scripting, XSS

Are there any pre-made scripts that I can use for PHP / MySQL to prevent server-side scripting and JS injections?
I know about the typical functions such as htmlentities, special characters, string replace etc. but is there a simple bit of code or a function that is a failsafe for everything?
Any ideas would be great. Many thanks :)
EDIT: Something generic that strips out anything that could be hazardous, ie. greater than / less than signs, semi-colons, words like "DROP", etc?
I basically just want to compress everything to be alphanumeric, I guess...?

Never output any bit of data whatsoever to the HTML stream that has not been passed through htmlspecialchars() and you're done. Simple rule, easy to follow, completely eradicates any XSS risk.
As a programmer it's your job to do it, though.
You can define
function h(s) { return htmlspecialchars(s); }
if htmlspecialchars() is too long to write 100 times per PHP file. On the other hand, using htmlentities() is not necessary at all.
The key point is: There is code, and there is data. If you intermix the two, bad things ensue.
In the case of HTML, code is elements, attribute names, entities, comments. Data is everything else. Data must be escaped to avoid being mistaken for code.
In case of URLs, code is the scheme, the host name, the path, the mechanism of the query string (?, &, =, #). Data is everything in the query string: parameter names and values. They must be escaped to avoid being mistaken for code.
URLs embedded in HTML must be doubly escaped (by URL-escaping and HTML-escaping) to ensure proper separation of code and data.
Modern browsers are capable of parsing amazingly broken and incorrect markup into something useful. This capability should not be stressed, though. The fact that something happens to work (like URLs in <a href> without proper HTML-escaping applied) does not mean that it's good or correct to do it. XSS is a problem that roots in a) people unaware of data/code separation (i.e. "escaping") or those that are sloppy and b) people that try to be clever about what part of data they don't need to escape.
XSS is easy enough to avoid if you make sure you don't fall into categories a) and b).

I think Google-caja maybe a solution. I write a taint analyzer for java web application to detect and prevent XSS automatically. But not for PHP. I think Learning to using caja not bad for web developer.

No, there isn't. Risks depend on what you do with data, you can't write something that makes data safe for everything (unless you want to discard most of the data)

is there a simple bit of code or a function that is a failsafe for everything?
No.
The representation of data leaving PHP must be converted / encoded specifically according where it is going. And therefore should only be converted/encoded at the point where it leaves PHP.
C.

You can refer to OWASP to get more understanding of XSS attacks:
https://www.owasp.org/index.php/XSS_Filter_Evasion_Cheat_Sheet
To avoid js attacks, you can try this project provided by open source excellence:
https://www.opensource-excellence.com/shop/ose-security-suite.html
my website had the js attacks before, and this tool helps me block all attacks everyday. i think it can help you guys to avoid the problem.
Further, you can add a filter in your php script to filter all js attacks, here is one pattern that can do job:
if (preg_match('/(?:[".]script\s*()|(?:\$\$?\s*(\s*[\w"])|(?:/[\w\s]+/.)|(?:=\s*/\w+/\s*.)|(?:(?:this|window|top|parent|frames|self|content)[\s*[(,"]\s[\w\$])|(?:,\s*new\s+\w+\s*[,;)/ms', strtolower($POST['VARIABLENAME'])))
{
filter_variable($POST['VARIABLENAME']);
}

To answer to your edition: everything except <> symbols has nothing to do with XSS.
And htmlspecialchars() can deal with them.
There is no harm in the word DROP table in the page's text ;)

for clean user data use
html_special_chars(); str_replace() and other funcs to cut unsafe data.

When to filter/sanitize data: before database insertion or before display?

As I prepare to tackle the issue of input data filtering and sanitization, I'm curious whether there's a best (or most used) practice? Is it better to filter/sanitize the data (of HTML, JavaScript, etc.) before inserting the data into the database, or should it be done when the data is being prepared for display in HTML?
A few notes:
I'm doing this in PHP, but I suspect the answer to this is language agnostic. But if you have any recommendations specific to PHP, please share!
This is not an issue of escaping the data for database insertion. I already have PDO handling that quite well.
Thanks!

When it comes to displaying user submitted data, the generally accepted mantra is to "Filter input, escape output."
I would recommend against escaping things like html entities, etc, before going into the database, because you never know when HTML will not be your display medium. Also, different types of situations require different types of output escaping. For example, embedding a string in Javascript requires different escaping than in HTML. Doing this before may lull yourself into a false sense of security.
So, the basic rule of thumb is, sanitize before use and specifically for that use; not pre-emptively.
(Please note, I am not talking about escaping output for SQL, just for display. Please still do escape data bound for an SQL string).

i like to have/store the data in original form.
i only escape/filter the data depending on the location where i'm using it.
on a webpage - encode all html
on sql - kill quotes
on url - urlencoding
on printers - encode escape commands
on what ever - encode it for that job

There are at least two types of filtering/sanitization you should care about :
SQL
HTML
Obviously, the first one has to be taken care of before/when inserting the data to the database, to prevent SQL Injections.
But you already know that, as you said, so I won't talk about it more.
The second one, on the other hand, is a more interesting question :
if your users must be able to edit their data, it is interesting to return it to them the same way they entered it at first ; which means you have to store a "non-html-specialchars-escaped" version.
if you want to have some HTML displayed, you'll maybe use something like HTMLPurifier : very powerful... But might require a bit too much resources if you are running it on every data when it has to be displayed...
So :
If you want to display some HTML, using a heavy tool to validate/filter it, I'd say you need to store an already filtered/whatever version into the database, to not destroy the server, re-creating it each time the data is displayed
but you also need to store the "original" version (see what I said before)
In that case, I'd probably store both versions into database, even if it takes more place... Or at least use some good caching mecanism, to not-recreate the clean version over and over again.
If you don't want to display any HTML, you will use htmlspecialchars or an equivalent, which is probably not that much of a CPU-eater... So it probably doesn't matter much
you still need to store the "original" version
but escaping when you are outputing the data might be OK.
BTW, the first solution is also nice if users are using something like bbcode/markdown/wiki when inputting the data, and you are rendering it in HTML...
At least, as long as it's displayed more often than it's updated -- and especially if you don't use any cache to store the clean HTML version.

Sanitize it for the database before putting it in the database, if necessary (i.e. if you're not using a database interactivity layer that handles that for you). Sanitize it for display before display.
Storing things in a presently unnecessary quoted form just causes too many problems.

I always say escape things immediately before passing them to the place they need to be escaped. Your database doesn't care about HTML, so escaping HTML before storing in the database is unnecessary. If you ever want to output as something other than HTML, or change which tags are allowed/disallowed, you might have a bit of work ahead of you. Also, it's easier to remember to do the escaping right when it needs to be done, than at some much earlier stage in the process.
It's also worth noting that HTML-escaped strings can be much longer than the original input. If I put a Japanese username in a registration form, the original string might only be 4 Unicode characters, but HTML escaping may convert it to a long string of "〹𐤲䡈穩". Then my 4-character username is too long for your database field, and gets stored as two Japanese characters plus half an escape code, which also probably prevents me from logging in.
Beware that browsers tend to escape some things like non-English text in submitted forms themselves, and there will always be that smartass who uses a Japanese username everywhere. So you may want to actually unescape HTML before storing.

Mostly it depends on what you are planning to do with the input, as well as your development environment.
In most cases you want original input. This way you get the power to tweak your output to your heart's content without fear of losing the original. This also allows you to troubleshoot issues such as broken output. You can always see how your filters are buggy or customer's input is erroneous.
On the other hand some short semantic data could be filtered immediately. 1) You don't want messy phone numbers in database, so for such things it could be good to sanitize. 2) You don't want some other programmer to accidentally output data without escaping, and you work in multiprogrammer environment. However, for most cases raw data is better IMO.

What is the correct/safest way to escape input in a forum?

I am creating a forum software using php and mysql backend, and want to know what is the most secure way to escape user input for forum posts.
I know about htmlentities() and strip_tags() and htmlspecialchars() and mysql_real_escape_string(), and even javascript's escape() but I don't know which to use and where.
What would be the safest way to process these three different types of input (by process, I mean get, save in a database, and display):
A title of a post (which will also be the basis of the URL permalink).
The content of a forum post limited to basic text input.
The content of a forum post which allows html.
I would appreciate an answer that tells me how many of these escape functions I need to use in combination and why.
Thanks!

When generating HTLM output (like you're doing to get data into the form's fields when someone is trying to edit a post, or if you need to re-display the form because the user forgot one field, for instance), you'd probably use htmlspecialchars() : it will escape <, >, ", ', and & -- depending on the options you give it.
strip_tags will remove tags if user has entered some -- and you generally don't want something the user typed to just disappear ;-)
At least, not for the "content" field :-)
Once you've got what the user did input in the form (ie, when the form has been submitted), you need to escape it before sending it to the DB.
That's where functions like mysqli_real_escape_string become useful : they escape data for SQL
You might also want to take a look at prepared statements, which might help you a bit ;-)
with mysqli - and with PDO
You should not use anything like addslashes : the escaping it does doesn't depend on the Database engine ; it is better/safer to use a function that fits the engine (MySQL, PostGreSQL, ...) you are working with : it'll know precisely what to escape, and how.
Finally, to display the data inside a page :
for fields that must not contain HTML, you should use htmlspecialchars() : if the user did input HTML tags, those will be displayed as-is, and not injected as HTML.
for fields that can contain HTML... This is a bit trickier : you will probably only want to allow a few tags, and strip_tags (which can do that) is not really up to the task (it will let attributes of the allowed tags)
You might want to take a look at a tool called HTMLPUrifier : it will allow you to specify which tags and attributes should be allowed -- and it generates valid HTML, which is always nice ^^
This might take some time to compute, and you probably don't want to re-generate that HTML each time is has to be displayed ; so you can think about storing it in the database (either only keeping that clean HTML, or keeping both it and the not-clean one, in two separate fields -- might be useful to allow people editing their posts ? )
Those are only a few pointers... hope they help you :-)
Don't hesitate to ask if you have more precise questions !

mysql_real_escape_string() escapes everything you need to put in a mysql database. But you should use prepared statements (in mysqli) instead, because they're cleaner and do any escaping automatically.
Anything else can be done with htmlspecialchars() to remove HTML from the input and urlencode() to put things in a format for URL's.

There are two completely different types of attack you have to defend against:
SQL injection: input that tries to manipulate your DB. mysql_real_escape_string() and addslashes() are meant to defend against this. The former is better, but parameterized queries are better still
Cross-Site scripting (XSS): input that, when displayed on your page, tries to execute JavaScript in a visitor's browser to do all kinds of things (like steal the user's account data). htmlspecialchars() is the definite way to defend against this.
Allowing "some HTML" while avoiding XSS attacks is very, very hard. This is because there are endless possibilities of smuggling JavaScript into HTML. If you decided to do this, the safe way is to use BBCode or Markdown, i.e. a limited set of non-HTML markup that you then convert to HTML, while removing all real HTML with htmlspecialchars(). Even then you have to be careful not to allow javascript: URLs in links. Actually allowing users to input HTML is something you should only do if it's absolutely crucial for your site. And then you should spend a lot of time making sure you understand HTML and JavaScript and CSS completely.

The answer to this post is a good answer
Basically, using the pdo interface to parameterize your queries is much safer and less error prone than escaping your inputs manually.

I have a tendency to escape all characters that would be problematic in page display, Javascript and SQL all at the same time. It leaves it readable on the web and in HTML eMail and at the same time removes any problems with the code.
A vb.NET Line Of Code Would Be:
SafeComment = Replace( _
Replace(Replace(Replace( _
Replace(Replace(Replace( _
Replace(Replace(Replace( _
Replace(Replace(Replace( _
HttpUtility.HtmlEncode(Trim(strInput)), _
":", ":"), "-", "-"), "|", "|"), _
"`", "`"), "(", "("), ")", ")"), _
"%", "%"), "^", "^"), """", """), _
"/", "/"), "*", "*"), "\", "\"), _
"'", "'")

First of all, general advice: don't escape variables literally when inserting in the database. There are plenty of solutions that let you use prepared statements with variable binding. The reason to not do this explicitly is because it is only a matter of time then before you forget it just once.
If you're inserting plain text in the database, don't try to clean it on insert, but instead clean it on display. That is to say, use htmlentities to encode it as HTML (and pass the correct charset argument). You want to encode on display because then you're no longer trusting that the database contents are correct, which isn't necessarily a given.
If you're dealing with rich text (html), things get more complicated. Removing the "evil" bits from HTML without destroying the message is a difficult problem. Realistically speaking, you'll have to resort to a standardized solution, like HTMLPurifier. However, this is generally too slow to run on every page view, so you'll be forced to do this when writing to the database. You'll also have to ensure that the user can see their "cleaned up" html and correct the cleaned up version.
Definitely try to avoid "rolling your own" filter or encoding solution at any step. These problems are notoriously tricky, and you run a large risk of overlooking some minor detail that has big security implications.

I second Joeri, do not roll your own, go here to see some of the the many possible XSS attacks
http://ha.ckers.org/xss.html
htmlentities() -> turns text into html, converting characters to entities. If using UTF-8 encoding then use htmlspecialchars() instead as the other entities are not needed. This is the best defence against XSS. I use it on every variable I output regardless of type or origin unless I intend it to be html. There is only a tiny performance cost and it is easier than trying to work out what needs escaping and what doesn't.
strip_tags() - turns html into text by removing all html tags. Use this to ensure that there is nothing nasty in your input as a adjunct to escaping your output.
mysql_real_escape_string() - escapes a string for mysql and is your defence against SQL injections from little Bobby tables (better to use mysqli and prepare/bind as escaping is then done for you and you can avoid lots of messy string concatenations)
The advice given obve re avoiding HTML input unless it is essential and opting for BBCode or similar (make your own up if needs be) is very sound indeed.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.