I was checking for an answer to only accept certain type of characters and ignoring some.
I found a link in stack itself, here
But, even though that prevents from entering the data in text box, anyone with a little knowledge of how to inspect the content and delete the pattern from client side will allow it to enter any detail it wants and send to database.
Is there any way I can prevent it for my website? I don't want certain characters like space, semicolon, etc.
Frontend in html & php
Backend in MySQL
It could be done with pattern attribute on clients side, but HTML5 textarea element does not support the pattern attribute.
The another way is to use JavaScript (as you mentioned)
var text = $('textarea').val();
var expres = /[^a-zA-Z0-9\!\#\#\$\%\^\*\|]+/;
if(expres.test(text)){
//... some action
}
If you would like to replace invalid characters, you can use jQuery function replace() which is similar to preg_replace() in PHP and it accepts regular expresions.
You could check validity on server side (in PHP) too (before inserting the values to db)
preg_replace("/[^A-Za-z0-9 ]/", '', $string);
First of all,
You can define a type for each column in your database, for example, if you set type as integer - you won't be able to insert anything else.
Let us assume you're looking to "filter" or manipulate strings in your database - I Would make a regex to replace any unwanted character from the string and the insert it into database.
Related
I have a dynamic PHP web app which gets input params in the url (no surprise here). However, bingbot sometimes requests etremely long URLs from the site. E.g. > 10000 characters long urls. One of the inputs is an UTF name and bingbot somehow submits sketchy input names, thousands of characters long like this: \xc2\x83\xc3\x86... (goes on for thousands of characters).
Obviously, it gets a 404, because there is no such name in the database (and therefore no such page), but it occurred to me whether it's worth it to check the input length before querying the db (e.g. a name cannot be more than 100 characters long) and return a 404 instantly if it's too long. Is it standard practice? Or it's not worth the trouble, because the db handles it?
I'm thinking of not putting extra load on the db unnecessarily. Is this long input submitted as is by the db client interface (two calls: first a prepare for sanitizing the input and then the actual query) or the php db client knows the column size and truncates the input string before sending it down the wire?
Not only what you're asking is more than legit, but I'd say it's something that you should be doing as part of the input filtering/validation. If you expect your input to be always shorter than 100 characters, everything that's longer should be filtered.
Also, it appears that you're getting UTF-8 strings: if you're not expecting them, you could simply filter out all characters that are not part of the standard ASCII set (even reduced, filtering all control characters away. For example $string = filter_var($input, FILTER_SANITIZE_FULL_SPECIAL_CHARS, FILTER_FLAG_STRIP_LOW).
This is not just a matter of DB performance, but also security!
PS: I hardly doubt that bot is actually Bing. Seems like a bot trying to hack your website.
Addendum: some suggestions about input validation
As I wrote above in some comments (and as others have written too), you should always validate every input. No matter what is that or where it comes from: if it comes from outside, it has to be validated.
The general idea is to validate your input accordingly to what you're expecting. With $input any input variable (anything coming from $_GET, $_POST, $_COOKIE, from external API's and from many $_SERVER variables as well - plus anything more that could be altered by a user, use your judgement and in doubt be overly cautious).
If you're requesting an integer or float number, then it's easy: just cast the input to (int) or (float)
$filtered = (int)$input;
$filtered = (float)$input;
If you're requesting a string, then it's more complicated. You should think about what kind of string you are requesting, and filter it accordingly. For example:
If you're expecting a string like a hexadecimal id (like some databases use), then filter all characters outside the 0-9A-Fa-f range: $filtered = preg_replace('/[^0-9A-Fa-f]/', '', $input);
If you're expecting an alphanumeric ID, filter it, removing all characters that are not part of that ASCII range. You can use the code posted above: $string = filter_var($input, FILTER_SANITIZE_FULL_SPECIAL_CHARS, FILTER_FLAG_STRIP_LOW);. This one removes all control characters too.
If you're expecting your input to be Unicode UTF-8, validate it. For example, see this function: https://stackoverflow.com/a/1523574/192024
In addition to this:
Always encode HTML tags. FILTER_SANITIZE_FULL_SPECIAL_CHARS will do that as well on filter_var. If you don't do that, you risk XSS (Cross-Site Scripting) attacks.
If you want to remove control characters and encode HTML entities but without removing the newline chracters (\n and \r), then you can use: $filtered = preg_replace('/[\x00-\x09\x0B\x0C\x0E-\x1F\x7F]/u', '', htmlspecialchars($input, ENT_COMPAT, 'UTF-8'));
And much more. Use your judgement always.
PS: My approach to input filtering is to prefer sanitization. That is, remove everything "dangerous" and accept the sanitized input as if that was what the user wrote. Other persons will instead argue that input should only be accepted or refused.
Personally, I prefer the "sanitize and use" approach for web applications, as your users still may want to see something more than an error web page; on desktop/mobile apps I go with the "accept or refuse" method instead.
However, that's just a matter of personal preference, backed only by what my guts tell me about UX. You're free to follow the approach you prefer.
There should be some sort of validation done on any data before it is used in a query. If you have a limit on the length of the name, then you could use that as part of the validation when checking the input. If it's over the limit, it can't be in there and then handle it accordingly. Whether it's a 404 or a page that displays an error message.
The load will go down if you are bypassing queries because a name is too long. Depending on how you are querying the database, LIKE or MATCH AGAINST and how your indexes are set up, will determine just how much load will go down.
I thought the proper way to "sanitize" incoming data from an HTML form before entering it into a mySQL database was to use real_escape_string on it in the PHP script, like this:
$newsStoryHeadline = $_POST['newsStoryHeadline'];
$newsStoryHeadline = $mysqli->real_escape_string($newsStoryHeadline);
$storyDate = $_POST['storyDate'];
$storyDate = $mysqli->real_escape_string($storyDate);
$storySource = $_POST['storySource'];
$storySource = $mysqli->real_escape_string($storySource);
// etc.
And once that's done you could just insert the data to the DB like this:
$mysqli->query("INSERT INTO NewsStoriesTable (Headline, Date, DateAdded, Source, StoryCopy) VALUES ('".$newsStoryHeadline."', '".$storyDate."', '".$dateAdded."', '".$storySource."', '".$storyText."')");
So I thought doing this would take care of cleaning up all the invisible "junk" characters that may be coming in with your submitted text.
However, I just pasted some text I copied from a web-page into my HTML form, clicked "submit" - which ran the above script and inserted that text into my DB - but when I read that text back from the DB, I discovered that this piece of text did still have junk characters in it, such as –.
And those junk characters of course caused the PHP script I wrote that retrieves the information from the DB to crash.
So what am I doing wrong?
Is using real_escape_string not the way to go here? Or should I be using it in conjunction with something else?
OR, is there something I should be doing (like more escaping) when reading reading data back out from the the mySQL database?
(I should mention that I'm an Objective-C developer, not a PHP/mySQL developer, but I've unfortunately been given this task to do some DB stuff - hence my question...)
thanks!
Your assumption is wrong. mysqli_real_escape_string’s only intention is to escape certain characters so that the resulting string can be safely used in a MySQL string literal. That’s it, nothing more, nothing less.
The result should be that exactly the passed data is retained, including ‘junk’. If you don’t want that ‘junk’ in your database, you need to detect, validate, or filter it before passing to to MySQL.
In your case, the ‘junk’ seems to be due to different character encodings: You input data seems to be encoded with UTF-8 while it’s later displayed using Windows-1250. In this scenario, the character – (U+2013) would be encoded with 0xE28093 in UTF-8 which would represent the three characters â, €, and “ in Windows-1250. Properly declaring the document’s encoding would probably fix this.
Sanitization is a tricky subject, because it never means the same thing depending on the context. :)
real_escape_string just makes sure your data can be included in a request (inside quotes, of course) without having the possibility to change the "meaning" of the request.
The manual page explains what the function really does: it escapes nul characters, line feeds, carriage returns, simple quotes, double quotes, and "Control-Z" (probably the SUBSTITUTE character). So it just inserts a backslash before those characters.
That's it. It "sanitizes" the string so it can be passed unchanged in a request. But it doesn't sanitize it under any other point of view: users can still pass for instance HTML markers, or "strange" characters. You need to make rules depending on what your output format is (most of the time HTML, but HTTP isn't restricted to HTML documents), and what you want to let your users do.
If your code can't handle some characters, or if they have a special meaning in the output format, or if they cause your output to appear "corrupted" in some way, you need to escape or remove them yourself.
You will probably be interested in htmlspecialchars. Control characters generally aren't a problem with HTML. If your output encoding is the same as your input encoding, they won't be displayed and thus won't be an issue for your users (well, maybe for the W3C validator). If you think it is, make your own function to check and remove them.
In php I have the following reqular expression:
$regexp = "/^([-a-z0-9.,!#'?_-\s])+$/i";
Im trying to validate my websites contact form (specifically the message field) to ensure no nasty code has been entered. The problem I am having is that certain normal punctuation and characters I need to allow, but I'm worried they could be used to insert malicious code.
For any character not obeying the expression above, I would like to replace it to make it safe. Two questions:
How do I do the replacement?
What should I replace the character with? For example I am not allowing parenthesis ( ). would it be best practice to replace like this "(" ")" or maybe \( \)?
EDIT
The data will be sent to an email address and saved to a database
Mmh why don't you just allow every character to be inserted in the contact form, converting them all with htmlentities as soon as they reach the php script after form submit? That way your users will be able to say what they want, and you won't have any problem with "malicious code" :)
And do not forget to use a proper database wrapper (PDO)
or at least escape when inserting into the database.
– knittl
EDIT: added Knittl's quote to stress it again :)
Use the filter extension. More specifically, use the filter_input() function with a sanitizing filter. For example:
$message = filter_input(INPUT_POST, 'message', FILTER_SANITIZE_STRING);
This will make sure that tags are stripped out of the message and that it is safer to handle.
However, it does not mean that you should treat it as 100% safe. You still need to take precautions when saving the message to the database (such as using the database driver's escape method, and removing unwanted/unneeded/suspicious stuff from the message), as well as making sure that it is safe to output to the client.
I am using codeigniter in an app. There is a form. In the textarea element, I wrote something including
%Features%
However, when I want to echo this by $this->input->post(key), I get something like
�atures%
The '%Fe' are vanished.
In main index.php file of CI, I tried var_dump($_POST) and I see the above word is fully ok. but when I am fetching it with the input library (xss filtering is on) I get the problem.
When the XSS filtering is off, it appears ok initially. however, if I store it in database and show next time, I see same problem (even the xss filtering is off).
%Fe happens to look like a URL-encoded sequence %FE, representing character 254. It's being munched into the Unicode "I have no idea what that sequence means" glyph, �.
It's clear that the "XSS filter" is being over-zealous when decoding the field on submission.
It's also very likely that a URL-decode is being run again at some point later in the process, when you output the result from the database. Check the database to make sure that the actual string is being represented properly.
First: Escape the variables before storing them into db. % has special meaning in SQL.
Second: % also has special meaning in URLs eg. %20 is %FE will map to some character which will be decoded by input()
PLATFORM:
PHP & mySQL
For my experimentation purposes, I have tried out few of the XSS injections myself on my own website. Consider this situation where I have my form textarea input. As this is a textarea, I am able to enter text and all sorts of (English) characters. Here are my observations:
A). If I apply only strip_tags and mysql_real_escape_string and do not use htmlentities on my input just before inserting the data into the database, the query is breaking and I am hit with an error that shows my table structure, due to the abnormal termination.
B). If I am applying strip_tags, mysql_real_escape_string and htmlentities on my input just before inserting the data into the database, the query is NOT breaking and I am able to successfully able to insert data from the textarea into my database.
So I do understand that htmentities must be used at all costs but unsure when exactly it should be used. With the above in mind, I would like to know:
When exactly htmlentities should be used? Should it be used just before inserting the data into DB or somehow get the data into DB and then apply htmlentities when I am trying to show the data from the DB?
If I follow the method described in point B) above (which I believe is the most obvious and efficient solution in my case), do I still need to apply htmlentities when I am trying to show the data from the DB? If so, why? If not, why not? I ask this because it's really confusing for me after I have gone through the post at: http://shiflett.org/blog/2005/dec/google-xss-example
Then there is this one more PHP function called: html_entity_decode. Can I use that to show my data from DB (after following my procedure as indicated in point B) as htmlentities was applied on my input? Which one should I prefer from: html_entity_decode and htmlentities and when?
PREVIEW PAGE:
I thought it might help to add some more specific details of a specific situation here. Consider that there is a 'Preview' page. Now when I submit the input from a textarea, the Preview page receives the input and shows it html and at the same time, a hidden input collects this input. When the submit button on the Preview button is hit, then the data from the hidden input is POST'ed to a new page and that page inserts the data contained in the hidden input, into the DB. If I do not apply htmlentities when the form is initially submitted (but apply only strip_tags and mysql_real_escape_string) and there's a malicious input in the textarea, the hidden input is broken and the last few characters of the hidden input visibly seen as " /> on the page, which is undesirable. So keeping this in mind, I need to do something to preserve the integrity of the hidden input properly on the Preview page and yet collect the data in the hidden input so that it does not break it. How do I go about this? Apologize for the delay in posting this info.
Thank you in advance.
Here's the general rule of thumb.
Escape variables at the last possible moment.
You want your variables to be clean representations of the data. That is, if you are trying to store the last name of someone named "O'Brien", then you definitely don't want these:
O'Brien
O\'Brien
.. because, well, that's not his name: there's no ampersands or slashes in it. When you take that variable and output it in a particular context (eg: insert into an SQL query, or print to a HTML page), that is when you modify it.
$name = "O'Brien";
$sql = "SELECT * FROM people "
. "WHERE lastname = '" . mysql_real_escape_string($name) . "'";
$html = "<div>Last Name: " . htmlentities($name, ENT_QUOTES) . "</div>";
You never want to have htmlentities-encoded strings stored in your database. What happens when you want to generate a CSV or PDF, or anything which isn't HTML?
Keep the data clean, and only escape for the specific context of the moment.
Only before you are printing value(no matter from DB or from $_GET/$_POST) into HTML. htmlentities have nothing to do with database.
B is overkill. You should mysql_real_escape_string before inserting to DB, and htmlentities before printing to HTML. You don't need to strip tags, after htmlentities tags will be displayed on screen as < b r / > e.t.c
Theoretically you may do htmlentities before inserting to DB, but this might make further data processing harder, if you would need original text.
3. See above
In essence, you should use mysql_real_escape_string prior to database insertion (to prevent SQL injection) and then htmlentities, etc. at the point of output.
You'll also want to apply sanity checking to all user input to ensure (for example) that numerical values are really numeric, etc. Functions such as is_int, is_float, etc. are useful at this point. (See the variable handling functions section of the PHP manual for more information on these functions and other similar ones.)
I've been through this before and learned two important things:
If you're getting values from $_POST/$_GET/$_REQUEST and plan to add to DB, use mysql_real_escape_string function to sanitize the values. Do not encode them with htmlentities.
Why not just encode them with htmlentities and put them in database? Well, here's the thing - the goal is to make data as meaningful and clean as possible and when you encode the data with htmlentities like Jeff's Dog becomes Jeff"s Dog ... that will cause the context of data to lose its meaning. And if you decide to implement REST servcies and you fetch that string from DB and put it in JSON - it'll come up like Jeff"s Dog which isn't pretty. You'd have to add another function to decode as well.
Suppose you want to search for "Jeff's Dog" using SQL "select * from table where field='Jeff\'s Dog'", you won't find it since "Jeff's Dog" does not match "Jeff"s Dog." Bad, eh?
To output alphanumeric strings (from CHAR type) to a webpage, use htmlentities - ALWAYS!