HTML Data exceeds field length after being hex-sanitized

HTML Data exceeds field length after being hex-sanitized - php

The problem is you can't tell the user how many characters are allowed in the field because the escaped value has more characters than the unescaped one.
I see a few solutions, but none looks very good:
One whitelist for each field (too much work and doesn't quite solve the problem)
One blacklist for each field (same as above)
Use a field length that could hold the data even if all characters are escaped (bad)
Uncap the size for the database field (worse)
Save the data hex-unescaped and pass the responsibility entirely to output filtering (not very good)
Let the user guess the maximum size (worst)
Are there other options? Is there a "best practice" for this case?
Sample code:
$string = 'javascript:alert("hello!");';
echo strlen($string);
// outputs 27
$escaped_string = filter_var('javascript:alert("hello!");', FILTER_SANITIZE_ENCODED);
echo strlen($escaped_string);
// outputs 41
If the length of the database field is, say, 40, the escaped data will not fit.

Don't build your application around the database - build the database for the application!
Design how you want the interface to work for the user first, work out the longest acceptable field length, and use that.
In general, don't escape before storing in the database - store raw data in the database and format it for display.
If something is going to be output many times, then store the processed version.
Remember disk space is relatively cheap - don't waste effort trying to make your database compact.

making some wild assumptions about the context here:
if the field can hold 32 characters, that is 32 unescaped characters
let the user enter 32 characters
escape/unescape is not the user's problem
why is this an issue?
if this is form data-entry it won't matter, and
if you are for some reason escaping the data and passing it back then unescape it before storage
without further context, it looks like you are fighting a problem that doesn't really exist, or that doesn't need to exist

This is an interesting problem.
I think the solution will be a problem if you assign any responsibility to them because of the sanitization. If they are responsible for guessing the maximum length, then they may well give up and pick something else (and not understand why their input was invalid).
Here's my idea: make the database field 150% the size of the input. This extra size serves as "padding" for the space of the hex-sanitization, and the maximum size shown to the user and validator is the actual desired size. Thus if you check the input length before sanitization and it is below that 66% limit on the length your sanitized data should be good to go. If they exceed that extra 34% field space for the buffer, then the input probably should not be accepted.
The only trouble is that your database tables will be larger. If you want to avoid this, well, you could always escape only the SQL sensitive characters and handle everything else on output.
Edit: Given your example, I think you're escaping far too much. Either use a smaller range of sanitization with HTMLSpecialChars() on output, or make your database fields as much as 200% of their present size. That's just bloated if you ask me.

Why are you allowing users to type in escaped characters?
If you do need to allow explicitly escaped characters, then interpolate the escaped character before sanity-checking it
You should pretty much never do any significant work on any string if it is somehow still encoded. Decode it first, then do your work.
I find some people have a tendancy to use escaping functions like addSlashes() (or whatever it is in PHP) too early, or decode stuff (like removing HTML-entities) too late. Decode first, do your stuff, then apply any encoding you need to store/output/etc.

Related

Is it worth it to check the length of the input (too large) before querying the database or the db takes care of it?

I have a dynamic PHP web app which gets input params in the url (no surprise here). However, bingbot sometimes requests etremely long URLs from the site. E.g. > 10000 characters long urls. One of the inputs is an UTF name and bingbot somehow submits sketchy input names, thousands of characters long like this: \xc2\x83\xc3\x86... (goes on for thousands of characters).
Obviously, it gets a 404, because there is no such name in the database (and therefore no such page), but it occurred to me whether it's worth it to check the input length before querying the db (e.g. a name cannot be more than 100 characters long) and return a 404 instantly if it's too long. Is it standard practice? Or it's not worth the trouble, because the db handles it?
I'm thinking of not putting extra load on the db unnecessarily. Is this long input submitted as is by the db client interface (two calls: first a prepare for sanitizing the input and then the actual query) or the php db client knows the column size and truncates the input string before sending it down the wire?

Not only what you're asking is more than legit, but I'd say it's something that you should be doing as part of the input filtering/validation. If you expect your input to be always shorter than 100 characters, everything that's longer should be filtered.
Also, it appears that you're getting UTF-8 strings: if you're not expecting them, you could simply filter out all characters that are not part of the standard ASCII set (even reduced, filtering all control characters away. For example $string = filter_var($input, FILTER_SANITIZE_FULL_SPECIAL_CHARS, FILTER_FLAG_STRIP_LOW).
This is not just a matter of DB performance, but also security!
PS: I hardly doubt that bot is actually Bing. Seems like a bot trying to hack your website.
Addendum: some suggestions about input validation
As I wrote above in some comments (and as others have written too), you should always validate every input. No matter what is that or where it comes from: if it comes from outside, it has to be validated.
The general idea is to validate your input accordingly to what you're expecting. With $input any input variable (anything coming from $_GET, $_POST, $_COOKIE, from external API's and from many $_SERVER variables as well - plus anything more that could be altered by a user, use your judgement and in doubt be overly cautious).
If you're requesting an integer or float number, then it's easy: just cast the input to (int) or (float)
$filtered = (int)$input;
$filtered = (float)$input;
If you're requesting a string, then it's more complicated. You should think about what kind of string you are requesting, and filter it accordingly. For example:
If you're expecting a string like a hexadecimal id (like some databases use), then filter all characters outside the 0-9A-Fa-f range: $filtered = preg_replace('/[^0-9A-Fa-f]/', '', $input);
If you're expecting an alphanumeric ID, filter it, removing all characters that are not part of that ASCII range. You can use the code posted above: $string = filter_var($input, FILTER_SANITIZE_FULL_SPECIAL_CHARS, FILTER_FLAG_STRIP_LOW);. This one removes all control characters too.
If you're expecting your input to be Unicode UTF-8, validate it. For example, see this function: https://stackoverflow.com/a/1523574/192024
In addition to this:
Always encode HTML tags. FILTER_SANITIZE_FULL_SPECIAL_CHARS will do that as well on filter_var. If you don't do that, you risk XSS (Cross-Site Scripting) attacks.
If you want to remove control characters and encode HTML entities but without removing the newline chracters (\n and \r), then you can use: $filtered = preg_replace('/[\x00-\x09\x0B\x0C\x0E-\x1F\x7F]/u', '', htmlspecialchars($input, ENT_COMPAT, 'UTF-8'));
And much more. Use your judgement always.
PS: My approach to input filtering is to prefer sanitization. That is, remove everything "dangerous" and accept the sanitized input as if that was what the user wrote. Other persons will instead argue that input should only be accepted or refused.
Personally, I prefer the "sanitize and use" approach for web applications, as your users still may want to see something more than an error web page; on desktop/mobile apps I go with the "accept or refuse" method instead.
However, that's just a matter of personal preference, backed only by what my guts tell me about UX. You're free to follow the approach you prefer.

There should be some sort of validation done on any data before it is used in a query. If you have a limit on the length of the name, then you could use that as part of the validation when checking the input. If it's over the limit, it can't be in there and then handle it accordingly. Whether it's a 404 or a page that displays an error message.
The load will go down if you are bypassing queries because a name is too long. Depending on how you are querying the database, LIKE or MATCH AGAINST and how your indexes are set up, will determine just how much load will go down.

Zend\escap data before inserting into database

I'm adding some xss protection to the website I'm working on, the platform is zendFrameWork 2 and therefor I'm using Zend\escaper. from zend documentation i knew that:
Zend\Escaper is meant to be used only for escaping data that is to be
output, and as such should not be misused for filtering input data.
For such tasks, the Zend\Filter component, HTMLPurifier.
but what are the riskes if i escaped the data before inserting it into the database, am i so wrong to do that? please explane to me as im somehow new to this topic.
thanks

When encoding data before storing it you will have to decode it before you can do anything sensible with it before outputting it. That's why I'd not do it.
Let's say you have an international application and you want to store the escaped value of a form field which might contain any NON-ASCII characters those might become escaped into HTML-Entities. So what if you have to quantify the content of that field? Like counting the characters? You will always have to de-escape the content before counting it. and then you have to re-escape it again. Much work done but nothing gained.
The same applies to search-operations in your database. You will have to escape the search-phrase the same way then your input for the database to understand what you are looking for.
I'd use one character-set throughout the application and database (I prefer UTF-8, beware of the MySQL-Connection....) and only escape content on output. Thant way I can then do whatever I like with the data and are on the safe side on output. And escaping is done in my view-layer automaticaly so I don't even have to think about it every time I handle data as it works automaticaly. That way you can't forget it.
That does not prevent me from filtering and sanitizing the input. And it doesn't prevent me from escaping the database-content using the appropriate database-escaping mechanisms like mysqli_real_escape_string or similar or using prepared statements!
But that's just my opinion, others might think otherwise!

"Output" here refers to the web page. A form field ( HTML tag) is an INPUT (from the webpage), any text is an OUTPUT (to the webpage). You need to ensure any output (to the webpage) does not contain dangerous characters that could be used to forge XSS attack vectors.
This said, if you have DANGEROUS_INPUT_X given by the user and then
$NOT_DANGEROUS_ANYMORE = ZED.HtmlPurifier(DANGEROUS_INPUT_X)
DBSave($NOT_DANGEROUS_ANYMORE)
and somewhere else
$OUTPUT = DBLoad($NOT_DANGEROUS_ANYMORE)
echo $OUTPUT
you should be fine, as long as you do not apply any additional encoding/decoding to this output. It will be displayed in the way it is saved, that was safe.
I would suggest to look at output encoding more than validation: HtmlPurifier cleans the HTML, while you could accept any kind of bad characters if you ensure your output is encoded in the page.
Here https://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet some general rules, here the PHP example
echo htmlspecialchars($DANGEROUS_INPUT_X_NOW_OUTPUT, ENT_QUOTES, "UTF-8");
Remember to set the Character Set and be consistent with the same one throughout your pages/scripts/binaries and in the database as well.

Should I check length of Name on Registration page?

I had more of a general question not exactly about my code. I was adding validation to my registration code for my site (checking for length and stripping out illegal characters). But I was wondering if I ask for the persons name first & last. Should I check the length of the characters on the name fields?
And if so what would be good min and max length's for names. I was thinking 3-20 characters. But I really dont want to limit the names if someone really does have a name longer then 20 characters. Any feedback would be great thanks :)

I find VARCHAR(32) for first name and last real name fields to be sufficient, and the only thing I would suggest doing to them is a run through htmlspecialchars() before inserting because some people legitimately have "strange" accented or non-ascii characters in their names that most people don't take into account with checks like preg_match('/[a-z]+/i', $name);, and a combination of parameterized queries [you're using those, right] and htmlspecialchars() should protect you from first and second order injection attacks.
You can use iconv() to transliterate/mangle their names into ASCII, but that's hit and miss, plus a pain in the ass, plus some people might take offense at that.
On the other hand if you're worried about space-efficiency with your field length you should know that the VAR in VARCHAR means 'variable length' and a VARCHAR(32) field containing 'Ted' will only take up 4 bytes of storage.

real_escape_string not cleaning up entered text

I thought the proper way to "sanitize" incoming data from an HTML form before entering it into a mySQL database was to use real_escape_string on it in the PHP script, like this:
$newsStoryHeadline = $_POST['newsStoryHeadline'];
$newsStoryHeadline = $mysqli->real_escape_string($newsStoryHeadline);
$storyDate = $_POST['storyDate'];
$storyDate = $mysqli->real_escape_string($storyDate);
$storySource = $_POST['storySource'];
$storySource = $mysqli->real_escape_string($storySource);
// etc.
And once that's done you could just insert the data to the DB like this:
$mysqli->query("INSERT INTO NewsStoriesTable (Headline, Date, DateAdded, Source, StoryCopy) VALUES ('".$newsStoryHeadline."', '".$storyDate."', '".$dateAdded."', '".$storySource."', '".$storyText."')");
So I thought doing this would take care of cleaning up all the invisible "junk" characters that may be coming in with your submitted text.
However, I just pasted some text I copied from a web-page into my HTML form, clicked "submit" - which ran the above script and inserted that text into my DB - but when I read that text back from the DB, I discovered that this piece of text did still have junk characters in it, such as â€“.
And those junk characters of course caused the PHP script I wrote that retrieves the information from the DB to crash.
So what am I doing wrong?
Is using real_escape_string not the way to go here? Or should I be using it in conjunction with something else?
OR, is there something I should be doing (like more escaping) when reading reading data back out from the the mySQL database?
(I should mention that I'm an Objective-C developer, not a PHP/mySQL developer, but I've unfortunately been given this task to do some DB stuff - hence my question...)
thanks!

Your assumption is wrong. mysqli_real_escape_string’s only intention is to escape certain characters so that the resulting string can be safely used in a MySQL string literal. That’s it, nothing more, nothing less.
The result should be that exactly the passed data is retained, including ‘junk’. If you don’t want that ‘junk’ in your database, you need to detect, validate, or filter it before passing to to MySQL.
In your case, the ‘junk’ seems to be due to different character encodings: You input data seems to be encoded with UTF-8 while it’s later displayed using Windows-1250. In this scenario, the character – (U+2013) would be encoded with 0xE28093 in UTF-8 which would represent the three characters â, €, and “ in Windows-1250. Properly declaring the document’s encoding would probably fix this.

Sanitization is a tricky subject, because it never means the same thing depending on the context. :)
real_escape_string just makes sure your data can be included in a request (inside quotes, of course) without having the possibility to change the "meaning" of the request.
The manual page explains what the function really does: it escapes nul characters, line feeds, carriage returns, simple quotes, double quotes, and "Control-Z" (probably the SUBSTITUTE character). So it just inserts a backslash before those characters.
That's it. It "sanitizes" the string so it can be passed unchanged in a request. But it doesn't sanitize it under any other point of view: users can still pass for instance HTML markers, or "strange" characters. You need to make rules depending on what your output format is (most of the time HTML, but HTTP isn't restricted to HTML documents), and what you want to let your users do.
If your code can't handle some characters, or if they have a special meaning in the output format, or if they cause your output to appear "corrupted" in some way, you need to escape or remove them yourself.
You will probably be interested in htmlspecialchars. Control characters generally aren't a problem with HTML. If your output encoding is the same as your input encoding, they won't be displayed and thus won't be an issue for your users (well, maybe for the W3C validator). If you think it is, make your own function to check and remove them.

rawurlencode for storing data

I have always used rawurlencode to store user entered data into my mysql databases. The main reason I do this is so that stroing foreign characters is very simple I find. I'd then use rawurldecode to retrieve and display the data.
I read somewhere that rawurlencode was not meant for this purpose. Are there any disadvantages to what I'm doing?
So let's say I have a German address with many characters like umlauts etc. What is the simplest way to store this in a mysql database with no risks of it coming out wrong and being searchable using a search script? So far rawurelencode has been excellent for our system. Perhaps the practise can be improved upon by only encoding foreign letters and not common characters like spaces etc, which is a waste of space I totally agree.

Sure there are.
Let's start with the practical: for a large class of characters you are spending 3 bytes of storage for every byte of data. The description of rawurlencode (and of course the RFC) say that those characters are
all non-alphanumeric characters except -_.~
This means that there is a total of 26 + 26 + 10 (alphanumeric) + 4 (special exceptions) = 66 characters for which you do not waste space.
Then there are also the logical drawbacks: You are not storing the data itself, but rather a representation of the data tailored to URLs. Unless the data itself is URLs, that's not what you should be doing.

Drawbacks I can think of:
Waste of disk space.
Waste of CPU cycles encoding and decoding on every read and every write.
Additional complexity (you can't even inspect data with a MySQL client).
Impossibility to use full text searches.
URL encoding is not necessarily unique (there're at least two RFCs). It may not lead to data loss but it can lead to duplicate data (e.g., unique indexes where two rows actually contain the same piece of data).
You can accidentally encode a non-string piece of data such as a date: 2012-04-20%2013%3A23%3A00
But the main consideration is that such technique is completely arbitrary and unnecessary since MySQL doesn't have the least problem storing the complete Unicode catalogue. You could also decide to swap e's and o's in all strings: Holle, werdl!. Your app would run fine but it would not provide any added value.
Update: As Your Common Sense points out, a SQL clause as basic as ORDER BYis no longer usable. It's not that international chars will be ignored; you'll basically get an arbitrary sort order based on the ASCII code of the % and hexadecimal characters. If you can't SELECT * FROM city ORDER BY city_name reliably, you've rendered your DB useless.

I am using a fork to eat a soup
I am using money bills to fire the coals for BBQ
I am using a kettle to boil eggs.
I am using a microscope to hammer the nails.
Are there any disadvantages to what I'm doing?
YES
You are using a tool not on purpose. This is always a disadvantage.
A sane human being alway using a tool that is intended for the certain job. Not some randomly picked one. Especially if there is no shortage in the right tool supply.
URL encoding is not intended to be used with database, as one can tell from the name. That's alone reason enough for the sane developer. Take a look around: find the proper tool.
There is a thing called "common sense" - a thing widely used in the regular life but for some reason always absent in the php world.
A common sense can warn us: if we're using a wrong tool, it may spoil the work. Sooner or later it will spoil it. No need to ask for the certain details - it's a general rule. We are learning this rule at about age of 5.
Why not to use it while playing with some web thingies too?
Why not to ask yourself a question:
What's wrong with storing foreign characters at all?
urlencode makes stroing foreign characters very simple
Any hardships you encountered without urlencode?
Although I feel that common sense should be enough to answer the question, people always look for the "omen", the proof. Here you are:
Database's job is not limited to just storing and retrieving data. A plain text file can handle such a primitive task as well.
Data manipulations is what we are using databases for.
Most widely used ones are sorting and filtering.
Such a quite intelligent thing as a database can sort and filter data character-insensitive, which is very handy feature. But of course it can be done only if characters being saved as is, not as some random codes.
Sorting texts also may use ordering other than just binary order in the character table. Some umlaut characters may be present at the other parts of the table but database collation will put them in the right place. Of course it can be done only if characters being saved as is, not as some random codes.
Sometimes we have to manipulate the data that already stored in the database. Say, cut some piece from the string and compare with the entered value. How it is supposed to be done with urlencoded data?

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.