rawurlencode for storing data - php

I have always used rawurlencode to store user entered data into my mysql databases. The main reason I do this is so that stroing foreign characters is very simple I find. I'd then use rawurldecode to retrieve and display the data.
I read somewhere that rawurlencode was not meant for this purpose. Are there any disadvantages to what I'm doing?
So let's say I have a German address with many characters like umlauts etc. What is the simplest way to store this in a mysql database with no risks of it coming out wrong and being searchable using a search script? So far rawurelencode has been excellent for our system. Perhaps the practise can be improved upon by only encoding foreign letters and not common characters like spaces etc, which is a waste of space I totally agree.

Sure there are.
Let's start with the practical: for a large class of characters you are spending 3 bytes of storage for every byte of data. The description of rawurlencode (and of course the RFC) say that those characters are
all non-alphanumeric characters except -_.~
This means that there is a total of 26 + 26 + 10 (alphanumeric) + 4 (special exceptions) = 66 characters for which you do not waste space.
Then there are also the logical drawbacks: You are not storing the data itself, but rather a representation of the data tailored to URLs. Unless the data itself is URLs, that's not what you should be doing.

Drawbacks I can think of:
Waste of disk space.
Waste of CPU cycles encoding and decoding on every read and every write.
Additional complexity (you can't even inspect data with a MySQL client).
Impossibility to use full text searches.
URL encoding is not necessarily unique (there're at least two RFCs). It may not lead to data loss but it can lead to duplicate data (e.g., unique indexes where two rows actually contain the same piece of data).
You can accidentally encode a non-string piece of data such as a date: 2012-04-20%2013%3A23%3A00
But the main consideration is that such technique is completely arbitrary and unnecessary since MySQL doesn't have the least problem storing the complete Unicode catalogue. You could also decide to swap e's and o's in all strings: Holle, werdl!. Your app would run fine but it would not provide any added value.
Update: As Your Common Sense points out, a SQL clause as basic as ORDER BYis no longer usable. It's not that international chars will be ignored; you'll basically get an arbitrary sort order based on the ASCII code of the % and hexadecimal characters. If you can't SELECT * FROM city ORDER BY city_name reliably, you've rendered your DB useless.

I am using a fork to eat a soup
I am using money bills to fire the coals for BBQ
I am using a kettle to boil eggs.
I am using a microscope to hammer the nails.
Are there any disadvantages to what I'm doing?
YES
You are using a tool not on purpose. This is always a disadvantage.
A sane human being alway using a tool that is intended for the certain job. Not some randomly picked one. Especially if there is no shortage in the right tool supply.
URL encoding is not intended to be used with database, as one can tell from the name. That's alone reason enough for the sane developer. Take a look around: find the proper tool.
There is a thing called "common sense" - a thing widely used in the regular life but for some reason always absent in the php world.
A common sense can warn us: if we're using a wrong tool, it may spoil the work. Sooner or later it will spoil it. No need to ask for the certain details - it's a general rule. We are learning this rule at about age of 5.
Why not to use it while playing with some web thingies too?
Why not to ask yourself a question:
What's wrong with storing foreign characters at all?
urlencode makes stroing foreign characters very simple
Any hardships you encountered without urlencode?
Although I feel that common sense should be enough to answer the question, people always look for the "omen", the proof. Here you are:
Database's job is not limited to just storing and retrieving data. A plain text file can handle such a primitive task as well.
Data manipulations is what we are using databases for.
Most widely used ones are sorting and filtering.
Such a quite intelligent thing as a database can sort and filter data character-insensitive, which is very handy feature. But of course it can be done only if characters being saved as is, not as some random codes.
Sorting texts also may use ordering other than just binary order in the character table. Some umlaut characters may be present at the other parts of the table but database collation will put them in the right place. Of course it can be done only if characters being saved as is, not as some random codes.
Sometimes we have to manipulate the data that already stored in the database. Say, cut some piece from the string and compare with the entered value. How it is supposed to be done with urlencoded data?

Related

Cassandra columns order by UTF8 encoded string

I'm not sure if this is specific question for Cassandra or this can also belong to PHP so I'm sorry for tagging PHP.
So basically i'm ordering some long row columns by their column names, which goes like this:
2012-01-01_aa_99999 | 2012-01-01_aaa | 2012-01-12_aaaaa
So this is working the way i want it to work, but i don't understand how does it actually order those string.
What is not clear to me is that first string 2012-01-01_aa_99999 seems to be way bigger then the rest two, and i'm concerned that at some point it might ignore first part of the string which is a date and put some string where they shouldn't belong.
In my case those string consist of quite a few parts so i'm really concerned about this, so basically i need some explanation how does this ordering happens internally.
i don't understand how does it actually order those string.
The strings you provided appear to be lexicographically ordered.
I had the same question as I want to construct a composite primary key index with well-understood sorting abilities. It turns out Cassandra appears to compare UTF-8 strings using a byte-by-byte binary comparison... this is indeed a completely broken sort function from a logical perspective. If you had mixed ASCII and Kanji characters in your string, for example, your sort order would be effectively random. However, as long as this sort order is known, one can design your usage patterns around it.
This could be easily fixed, of course, and it would be nearly a single-line change of code to patch in a "real" sort function. This would require a bit extra CPU time, of course.

Sanitising form content in an internationalised scenario

In the good old days when I was a web developer (using PHP), I used to run all submitted form data through a regex before commencing any processing. For most cases, I would allow alphanumerics along with a small set of punctuation characters which would satisfy 99% of people 99% of the time whilst providing a defense against SQL injection and cross site scripting (yes I used PDO prepared statements as well).
More recently I've had to deal with input in an internationalised context, specifically, where the input can be in quite a few different western and eastern European languages as well as Arabic. In these cases, I resorted to removing potentially dangerous characters and letting everything else in. The application had a very small number of users (less than 10) and was only deployed on their internal network so I wasn't overly concerned about the security of the system but I wouldn't be comfortable taking this approach on a publicly accessible website.
In summary, I would like the input to be filtered so that what is left, is "plain text" but I'm not sure how to define the concept of plain text in an internationalised context. Are there any PHP libraries that address this?
Everything is "plain text". Even "' DROP TABLE users --" is plain text. Even "<script>" is just plain text.
What you're worried about are "special characters", i.e. plain text which has special meanings in certain contexts. For that, you need to escape theses special characters to "defuse" them in the given context. For HTML, escape them to HTML entities. For SQL, SQL-escape the string (or use prepared statements to avoid this problem in general). For CSV, CSV-escape the values... You get the idea. There are always functions or libraries available which will do this for you, don't try to reinvent the wheel here.
If you want to sanitize, i.e. remove content, you need to define better what you want to remove. Removing content also always runs the risk of removing legitimate content your users may want to use. So it's usually the annoying option.
For more on this topic, see The Great Escapism (Or: What You Need To Know To Work With Text Within Text).
Give strip_tags() a try. http://php.net/manual/en/function.strip-tags.php. It has worked for me for most english cases and might work for different languages.

Is it safe to turn a UUID into a short code? (only use first 8 chars)

We use UUIDs for our primary keys in our db (generated by php, stored in mysql). The problem is that when someone wants to edit something or view their profile, they have this huge, scary, ugly uuid string at the end of the url. (edit?id=.....)
Would it be safe (read: still unique) if we only used the first 8 characters, everything before the first hyphen?
If it is NOT safe, is there some way to translate it into something else shorter for use in the url that could be translated back into the hex to use as a lookup? I know that I can base64 encode it to bring it down to 22 characters, but is there something even shorter?
EDIT
I have read this question and it said to use base64. again, anything shorter?
Shortening the UUID increases the probability of a collision. You can do it, but it's a bad idea. Using only 8 characters means just 4 bytes of data, so you'd expect a collision once you have about 2^16 IDs - far from ideal.
Your best option is to take the raw bytes of the UUID (not the hex representation) and encode it using base64. Or, just don't worry much, because I seriously doubt your users care what's in the URL.
Don't cut a single bit out of that UUID: You have no control over the algorithm that produced it, there are multiple possible implementation, algorithm implementation is subject to change (example: changed with the version of PHP you're using)
If you ask me an UUID in the address bar doesn't look scary or difficult at all, even a simple google search for "UUID" produces worst looking URL's, and everybody's used to looking at google URL's!
If you want nicer looking URL's, take a look at the address bar of this stackoverflow.com article. They're using the article ID followed by the title of the question. Only the ID part is relevant, everything else is there to make it easy on the eyes of readers (go ahead and try it, you can delete anything after the ID, you can replace it with junk - doesn't matter).
It is not safe to truncate uuid's. Also, they are designed to be globally unique, so you aren't going to have luck shortening them. Your best bet is to either assign each user a unique number, or let users pick a custom (unique) string (like a username, or nick name) that can be decoded. So you could have edit?id=.... or edit?name=blah and you then decode name into the uuid in your script.
It depends on how you're generating the UUID - if you're using PHP's uniqid then it's the right-most digits that are more "unique". However, if you're going to truncate the data, then there's no real guarantee that it'll be unique anyway.
Irrespective, I'd say that this is a somewhat sub-optimal approach - is there no way you can use a unique (and ideally meaningful) textual reference string instead of an ID in the query string? (Hard to know without more knowledge of the problem domain, but it's always a better approach in my opinion, even if SEO, etc. isn't a factor.)
If you were using this approach, you could also let MySQL generate the unique IDs, which is probably a considerably more sane approach than attempting to handle this in PHP.
If you're worried about scaring users with the UUID in the URL, why not write it out to a hidden form field instead?

Store html entities in database? Or convert when retrieved?

Quick question, is it a better idea to call htmlentities() (or htmlspecialchars()) before or after inserting data into the database?
Before: The new longer string will cause me to have to change the database to hold longer values in the field. (maxlength="800" could change to a 804 char string)
After: This will require a lot more server processing, and hundreds of calls to htmlspecialchars() could be made on every page load or AJAX load.
SOOO. Will converting when results are retrieved slow my code significantly? Should I change the DB?
I'd recommend storing the most raw form of the data in the database. That gives you the most flexibility when choosing how and where to output that data.
If you find that performance is a problem, you could cache the HTML-formatted version of this data somehow. Remember that premature optimization is a bad thing.
I have no experience of php but generally I always convert or escape nearest to output. You don't know when your output requirements will change, for example you may want to spit out data as XML, or JSON arrays and so escaping for HTML and then storing means you're limited to using the data as HTML alone.
In a php/MySQL web app, data flows in two ways
Database -> scripting language (php) -> HTML output -> browser ->screen
and
Keyboard-> browser-> $_POST -> php -> SQL statement -> database.
Data is defined as everything provided by the user.
ALWAYS ALWAYS ALWAYS....
A) process data through mysql_real_escape_string as you move it into an SQL statement, and
B) process data through htmlspecialchars as you move it into the HTML output.
This will protect you from sql injection attacks, and enable html characters and entities to display properly (unless you manage to forget one place, and then you have opened up a security hole).
Did I mention that this has to be done for every single piece of data any user could ever have touched, altered or provided via a script?
p.s. For performance reasons, use UTF-8 encoding everywhere.
It's best to store text as raw and encode it as needed, to be honest, you always need to htmlencode your data anyways when you're outputting it to the wbe page to prevent XSS hacking.
You shouldn't encode your data before you put it in the database. The main reason are:
If such data is near the column size limit, say 32 chars, if the title was "Steve & Fred blah blah" then you might go over that column limit because a 1 char & becomes a 5 char & amp;
You are assuming the data will always be displayed in a web page, in the future you never know where you'll be looking at the data and you might not want it encoded, now you have to decode it and it's possible you might not have access to PHP's decode function
It is the way of the craftsman to "measure twice, optimize once".
If you don't need high performance for your website, store it as raw data and when you output it do what you want.
If you need performance then consider storing it twice: raw data to do what you want with it and another field with the filtered data. It could be seen as redundant, but CPU is expensive, while data storage is really cheap.
The easiest way is store the data "as is" and then convert to htmlentities wherever it is needed.
The safest solution is to filter the data before it goes in into the Database as this prevents possible attacks on your server and database from the lack of security implementation, and then convert it however you need when needed. Also if you are using PDO this will happen automatically for you using prepared statements.
http://php.net/PDO
We had this debate at work recently. We decided to store the escaped values in the database, because before (when we were storing it unescaped) there were corner cases where data was being displayed without being escaped. This can lead to XSS. So we decided to store it escaped to be safe, and if you want it unescaped you have to do the work yourself.
Edit: So to everyone who disagrees, let me add some backstory for my case. Let's say you're working in a team of 50+ people... and data from the database is not guaranteed to be HTML-Encoded on the way out - there's no built-in mechanism for it so the developer has to write the code to do it. And this data is shown all over the place so it's not going through 1 developer's code it's going through 30's - most of whom have no clue about this data (or that it could even contain angle brackets which is rare) and merely want to get it shown on the page, move on, and forget about it.
Do you still think it's better to put the data, in HTML, into the database and rely on random people who are not-you to do things properly? Because frankly, while it certainly may not seem warm-fuzzy-best-practicey, I prefer to fail closed (meaning when the data comes through in a Word Doc it looks like Value<Stock rather than Value<Stock) rather than open (so the Word Doc looks right with no work, but some corner of the platform may/likely-is vulnerable to XSS). You can't have both.

HTML Data exceeds field length after being hex-sanitized

The problem is you can't tell the user how many characters are allowed in the field because the escaped value has more characters than the unescaped one.
I see a few solutions, but none looks very good:
One whitelist for each field (too much work and doesn't quite solve the problem)
One blacklist for each field (same as above)
Use a field length that could hold the data even if all characters are escaped (bad)
Uncap the size for the database field (worse)
Save the data hex-unescaped and pass the responsibility entirely to output filtering (not very good)
Let the user guess the maximum size (worst)
Are there other options? Is there a "best practice" for this case?
Sample code:
$string = 'javascript:alert("hello!");';
echo strlen($string);
// outputs 27
$escaped_string = filter_var('javascript:alert("hello!");', FILTER_SANITIZE_ENCODED);
echo strlen($escaped_string);
// outputs 41
If the length of the database field is, say, 40, the escaped data will not fit.
Don't build your application around the database - build the database for the application!
Design how you want the interface to work for the user first, work out the longest acceptable field length, and use that.
In general, don't escape before storing in the database - store raw data in the database and format it for display.
If something is going to be output many times, then store the processed version.
Remember disk space is relatively cheap - don't waste effort trying to make your database compact.
making some wild assumptions about the context here:
if the field can hold 32 characters, that is 32 unescaped characters
let the user enter 32 characters
escape/unescape is not the user's problem
why is this an issue?
if this is form data-entry it won't matter, and
if you are for some reason escaping the data and passing it back then unescape it before storage
without further context, it looks like you are fighting a problem that doesn't really exist, or that doesn't need to exist
This is an interesting problem.
I think the solution will be a problem if you assign any responsibility to them because of the sanitization. If they are responsible for guessing the maximum length, then they may well give up and pick something else (and not understand why their input was invalid).
Here's my idea: make the database field 150% the size of the input. This extra size serves as "padding" for the space of the hex-sanitization, and the maximum size shown to the user and validator is the actual desired size. Thus if you check the input length before sanitization and it is below that 66% limit on the length your sanitized data should be good to go. If they exceed that extra 34% field space for the buffer, then the input probably should not be accepted.
The only trouble is that your database tables will be larger. If you want to avoid this, well, you could always escape only the SQL sensitive characters and handle everything else on output.
Edit: Given your example, I think you're escaping far too much. Either use a smaller range of sanitization with HTMLSpecialChars() on output, or make your database fields as much as 200% of their present size. That's just bloated if you ask me.
Why are you allowing users to type in escaped characters?
If you do need to allow explicitly escaped characters, then interpolate the escaped character before sanity-checking it
You should pretty much never do any significant work on any string if it is somehow still encoded. Decode it first, then do your work.
I find some people have a tendancy to use escaping functions like addSlashes() (or whatever it is in PHP) too early, or decode stuff (like removing HTML-entities) too late. Decode first, do your stuff, then apply any encoding you need to store/output/etc.

Categories