I had more of a general question not exactly about my code. I was adding validation to my registration code for my site (checking for length and stripping out illegal characters). But I was wondering if I ask for the persons name first & last. Should I check the length of the characters on the name fields?
And if so what would be good min and max length's for names. I was thinking 3-20 characters. But I really dont want to limit the names if someone really does have a name longer then 20 characters. Any feedback would be great thanks :)
I find VARCHAR(32) for first name and last real name fields to be sufficient, and the only thing I would suggest doing to them is a run through htmlspecialchars() before inserting because some people legitimately have "strange" accented or non-ascii characters in their names that most people don't take into account with checks like preg_match('/[a-z]+/i', $name);, and a combination of parameterized queries [you're using those, right] and htmlspecialchars() should protect you from first and second order injection attacks.
You can use iconv() to transliterate/mangle their names into ASCII, but that's hit and miss, plus a pain in the ass, plus some people might take offense at that.
On the other hand if you're worried about space-efficiency with your field length you should know that the VAR in VARCHAR means 'variable length' and a VARCHAR(32) field containing 'Ted' will only take up 4 bytes of storage.
Related
Hi I am trying to validate some user input and I'm not sure what is the correct way to go about it.
I want to validate first name and last name fields for a user creation form. I am using PHP with the Zend framework so I will be writing a validator. After doing some research I think I should really be allowing all UTF8 characters, with no spaces at beginning or end. I'm not sure on the regex but I can find that out later, I will most likely be using php's preg_match.
I'm storing these details in OpenLDAP using the sn field for surname and the givenName for firstname. How should I be restricting the names? Is there a limit to the length in OpenLDAP, do I need to check the characters it accepts or does it accept all characters?
Should I even validate the first name and last name or should I just let the user input what they want?
I am using a separate field for username which will consist of "text.text", text being A-Za-z chars.
I'm not posting code as I just a need a bit of guidance, not really sure whats the best practice here.
Validating names is a bad idea. There are no universal rules for what is allowed in a name, especially when you want to allow non-western text as well.
This presents a challenge when it comes to LDAP, since there is no guarantee that the directory will understand the charset of the user input. There is an easy solution to this though: base64_encode() the values before storing them in LDAP, and base64_decode() them on the way out.
There is no need to validate the names. The directory server schema specifies the syntax of all attributes, including givenName, and sn. The server will properly encode any names that are presented. If a DirectoryString value begins or ends with a space, the server will base64 encode the value for storage.
Though there is a practical limit, LDAP clients should not assume that servers have the capability of enforcing the non-zero length of an attribute value. Some attributes syntaxes are defined to be non-zero in length (such as DirectoryString), but no upper limit is defined, therefore, clients should not assume the server enforces an upper limit.
see also
LDAP: Syntaxes and Matching Rules
I have always used rawurlencode to store user entered data into my mysql databases. The main reason I do this is so that stroing foreign characters is very simple I find. I'd then use rawurldecode to retrieve and display the data.
I read somewhere that rawurlencode was not meant for this purpose. Are there any disadvantages to what I'm doing?
So let's say I have a German address with many characters like umlauts etc. What is the simplest way to store this in a mysql database with no risks of it coming out wrong and being searchable using a search script? So far rawurelencode has been excellent for our system. Perhaps the practise can be improved upon by only encoding foreign letters and not common characters like spaces etc, which is a waste of space I totally agree.
Sure there are.
Let's start with the practical: for a large class of characters you are spending 3 bytes of storage for every byte of data. The description of rawurlencode (and of course the RFC) say that those characters are
all non-alphanumeric characters except -_.~
This means that there is a total of 26 + 26 + 10 (alphanumeric) + 4 (special exceptions) = 66 characters for which you do not waste space.
Then there are also the logical drawbacks: You are not storing the data itself, but rather a representation of the data tailored to URLs. Unless the data itself is URLs, that's not what you should be doing.
Drawbacks I can think of:
Waste of disk space.
Waste of CPU cycles encoding and decoding on every read and every write.
Additional complexity (you can't even inspect data with a MySQL client).
Impossibility to use full text searches.
URL encoding is not necessarily unique (there're at least two RFCs). It may not lead to data loss but it can lead to duplicate data (e.g., unique indexes where two rows actually contain the same piece of data).
You can accidentally encode a non-string piece of data such as a date: 2012-04-20%2013%3A23%3A00
But the main consideration is that such technique is completely arbitrary and unnecessary since MySQL doesn't have the least problem storing the complete Unicode catalogue. You could also decide to swap e's and o's in all strings: Holle, werdl!. Your app would run fine but it would not provide any added value.
Update: As Your Common Sense points out, a SQL clause as basic as ORDER BYis no longer usable. It's not that international chars will be ignored; you'll basically get an arbitrary sort order based on the ASCII code of the % and hexadecimal characters. If you can't SELECT * FROM city ORDER BY city_name reliably, you've rendered your DB useless.
I am using a fork to eat a soup
I am using money bills to fire the coals for BBQ
I am using a kettle to boil eggs.
I am using a microscope to hammer the nails.
Are there any disadvantages to what I'm doing?
YES
You are using a tool not on purpose. This is always a disadvantage.
A sane human being alway using a tool that is intended for the certain job. Not some randomly picked one. Especially if there is no shortage in the right tool supply.
URL encoding is not intended to be used with database, as one can tell from the name. That's alone reason enough for the sane developer. Take a look around: find the proper tool.
There is a thing called "common sense" - a thing widely used in the regular life but for some reason always absent in the php world.
A common sense can warn us: if we're using a wrong tool, it may spoil the work. Sooner or later it will spoil it. No need to ask for the certain details - it's a general rule. We are learning this rule at about age of 5.
Why not to use it while playing with some web thingies too?
Why not to ask yourself a question:
What's wrong with storing foreign characters at all?
urlencode makes stroing foreign characters very simple
Any hardships you encountered without urlencode?
Although I feel that common sense should be enough to answer the question, people always look for the "omen", the proof. Here you are:
Database's job is not limited to just storing and retrieving data. A plain text file can handle such a primitive task as well.
Data manipulations is what we are using databases for.
Most widely used ones are sorting and filtering.
Such a quite intelligent thing as a database can sort and filter data character-insensitive, which is very handy feature. But of course it can be done only if characters being saved as is, not as some random codes.
Sorting texts also may use ordering other than just binary order in the character table. Some umlaut characters may be present at the other parts of the table but database collation will put them in the right place. Of course it can be done only if characters being saved as is, not as some random codes.
Sometimes we have to manipulate the data that already stored in the database. Say, cut some piece from the string and compare with the entered value. How it is supposed to be done with urlencoded data?
I already have minimum character limit, but was wondering is also adding a maximum a good idea?
For example; I have a post topic form which creates a forum topic (and stores the info in a MySQL database - the data type for the columns which are effected is text) - I have a minimum character limit for both the topic title and topic body, but I've seen on several other sites they have a maximum character limit aswell?
Is there any specific reason which in my situation it would be a bad idea?, why do sites commonly have this restriction (other then the obvious - could it effect functionality?) and if so what would be a typical/average maximum character limit for a topic title and a topic body (is there a general rule of thumb to determine this)?
Thank You.
The biggest reason I can think of as to setting a maximum character limit is because if you insert data into a MySQL database and the input is larger than the maximum length the column supports, the data is simply truncated with no error.
You can set the limits on the text field in HTML, but a user with the right tools can remove this restriction, so for very important data, you may want to check the length on the server side and make sure it can fit where it is going.
For your situation, I guess the other reason would be to prevent a user from creating a thread subject that is extremely long.
The title of your post here is good, but it may not be good if you were to add much more to it i.e.: Is adding maximum character restrictions a good idea on form fields which store data in a database because I am not doing it on my site and think I should, but don't know if I should or not. So I think it serves as a limitation in that sense too. Also, since your display may truncate the subject to a certain length in order to prevent breaking a layout or looking bad, it helps the user come up with a concise subject.
You may look at some free forum software to see what limits they put on the subjects, a reasonable limit seems like maybe 128 characters. Your subject is 100.
As for limiting the body, 65,635 bytes is pretty reasonable. On forums people who make really long posts tend to break them up into multiple posts because of limitations imposed.
What is the best way to validate a string as not gibberish using PHP?
For example, if I get a string input from a user that must be at least 250 characters long, how can I tell whether they entered legitimate text (e.g. real words) or just gibberish to comply with the minimum characters (e.g. asdlfkjefksjlfkjldskfjelkef)?
I've thought about counting the number of words as one option, but the user could still space out their gibberish (e.g. asdlf kjef ksjlf kjl dskfje lkef), so it needs another kind of check on top of that.
Is there any way to check that at least half of a string contains real dictionary words, or something to that effect?
What is the best solution to this problem?
Thanks.
You cannot do that properly because Colorless green ideas sleep furiously.
You could try a Bloom filter
You can walk through your dictionary and delete all dictionary words from user input and then check the length of the rest
You could look at Markov Chains. Simply put the idea is this algorithm determines whether sequences of characters look like they belong together. It won't necessarily tell you it's not gibberish, but it should catch out things like "ksjhglah etc".
See Markov text generators
The problem is you can't tell the user how many characters are allowed in the field because the escaped value has more characters than the unescaped one.
I see a few solutions, but none looks very good:
One whitelist for each field (too much work and doesn't quite solve the problem)
One blacklist for each field (same as above)
Use a field length that could hold the data even if all characters are escaped (bad)
Uncap the size for the database field (worse)
Save the data hex-unescaped and pass the responsibility entirely to output filtering (not very good)
Let the user guess the maximum size (worst)
Are there other options? Is there a "best practice" for this case?
Sample code:
$string = 'javascript:alert("hello!");';
echo strlen($string);
// outputs 27
$escaped_string = filter_var('javascript:alert("hello!");', FILTER_SANITIZE_ENCODED);
echo strlen($escaped_string);
// outputs 41
If the length of the database field is, say, 40, the escaped data will not fit.
Don't build your application around the database - build the database for the application!
Design how you want the interface to work for the user first, work out the longest acceptable field length, and use that.
In general, don't escape before storing in the database - store raw data in the database and format it for display.
If something is going to be output many times, then store the processed version.
Remember disk space is relatively cheap - don't waste effort trying to make your database compact.
making some wild assumptions about the context here:
if the field can hold 32 characters, that is 32 unescaped characters
let the user enter 32 characters
escape/unescape is not the user's problem
why is this an issue?
if this is form data-entry it won't matter, and
if you are for some reason escaping the data and passing it back then unescape it before storage
without further context, it looks like you are fighting a problem that doesn't really exist, or that doesn't need to exist
This is an interesting problem.
I think the solution will be a problem if you assign any responsibility to them because of the sanitization. If they are responsible for guessing the maximum length, then they may well give up and pick something else (and not understand why their input was invalid).
Here's my idea: make the database field 150% the size of the input. This extra size serves as "padding" for the space of the hex-sanitization, and the maximum size shown to the user and validator is the actual desired size. Thus if you check the input length before sanitization and it is below that 66% limit on the length your sanitized data should be good to go. If they exceed that extra 34% field space for the buffer, then the input probably should not be accepted.
The only trouble is that your database tables will be larger. If you want to avoid this, well, you could always escape only the SQL sensitive characters and handle everything else on output.
Edit: Given your example, I think you're escaping far too much. Either use a smaller range of sanitization with HTMLSpecialChars() on output, or make your database fields as much as 200% of their present size. That's just bloated if you ask me.
Why are you allowing users to type in escaped characters?
If you do need to allow explicitly escaped characters, then interpolate the escaped character before sanity-checking it
You should pretty much never do any significant work on any string if it is somehow still encoded. Decode it first, then do your work.
I find some people have a tendancy to use escaping functions like addSlashes() (or whatever it is in PHP) too early, or decode stuff (like removing HTML-entities) too late. Decode first, do your stuff, then apply any encoding you need to store/output/etc.