I have a PHP page that accepts input from a form post, but instead of directing that input to a database it is being used to retrieve a file from the file system. What is a good method for escaping a string destined for the file system rather then a database? Is mysql_real_escape_string() appropriate?
If you're using user-provided input to specify a filename directory, you'll have to make sure that the provided filename/path isn't trying to break "out" of your site's playground.
e.g. having something like
readfile($_GET['filepath']);
will send out ANYTHING on your server that the attack knows the path for. Even something like
readfile('/path/to/your/site/download/' . $_GET['filepath']);
accomplishes the same, if the user specifies enough '../../../' to get to whatever file they want.
mysql_real_escape_string() is NOT appropriate for this, as you're not doing a database operation. Use appropriate tools for appropriate jobs. In a goofy way, m_r_e_s() is a banana, and you need a giraffe. Something like
readfile('/path/to/your/site/download/' . basename($_GET['filepath']));
would be relatively save as basename() will extract only the filename portion of the user-provided file, so even if they pass in ../../../../../etc/passwd, basename will return only passwd.
You always only need to escape characters that are otherwise interpreted by your target system. For databases you usually make sure to escape quotes so you use mysql_real_escape_string or others. If your target is html, you usually use htmlspecialchars to make sure you get rid of html special characters (namely <, > and &). If your target is CSV, you basically only need to make sure line breaks and the CSV separator are escaped.
So depending on your target you can either reuse an existing escape function, define your own, or even go without one. If all you do is dump the input in a single file, then there is not much you need to take care of, as long as you specify the filename and that file is never used (or interpreted) by anything else than your application.
So think of what kind of special characters your target format requires for it to work, and simply escape those. You can usually ignore the rest.
edit:
If you want to use the input as the file path or file name, you can simply decide yourself how gracious you are, and what characters you want to support. A simple method would be to replace everything except latin characters and numbers (and maybe some special characters like _ and -) by something else. For example:
preg_replace( '/[^A-Za-z0-9_-]/', '_', $text );
Related
I can't seem to find a reference. I am assuming the PHP function file_exists uses system calls on linux and that these are safe for any string that does not contain a \0 character, but I would like to be sure.
Does anyone have (preferably non-anecdotal) information regarding this? Is is vulnerable to injection if I don't check the strings first?
I guess you need to, because the user may enter something like :
../../../somewhere_else/some_file and access a file that he is not allowed to access .
I suggest that you generate the absolute path of the file independently in your php code and just get the file name from user by basename()
or exclude any input containing ../ like :
$escaped_input = str_replace("../","",$input);
It depends on what you're trying to protect against.
file_exists doesn't do any writing to disk, which means that the worst that can happen is that someone gains some information about your file system or the existence of files that you have.
In practice however, if you're doing something later on with the same file that was previously checked with file_exists, such as includeing it, you may wish to perform more stringent checks.
I'm assuming that you may be passing arbitrary values, possibly sourced from user input, into this function.
If that is the case, it somewhat depends on why you actually need to use file_exists in the first place. In general, for any filesystem function that the user can pass values directly into, I'd try to filter out the string as much as possible. This is really just being pedantic and on the safe side, and may be unnecessary in practice.
So, for example, if you only ever need to check the existence of a file in a single directory, you should probably strip out directory delimiters of all sorts.
From personal experience, I've only ever passed user input into a file_exists call for mapping to a controller file, in which case, I'd just strip out any non-alphanumeric + underscore character.
UPDATE: reading your comments recently added, no there aren't special characters as this isn't executed in a shell. Even \0 should be fine, at least on newer PHP versions (I believe older ones would cut the string before the \0 when sent to underlying filesystem calls).
I thought the proper way to "sanitize" incoming data from an HTML form before entering it into a mySQL database was to use real_escape_string on it in the PHP script, like this:
$newsStoryHeadline = $_POST['newsStoryHeadline'];
$newsStoryHeadline = $mysqli->real_escape_string($newsStoryHeadline);
$storyDate = $_POST['storyDate'];
$storyDate = $mysqli->real_escape_string($storyDate);
$storySource = $_POST['storySource'];
$storySource = $mysqli->real_escape_string($storySource);
// etc.
And once that's done you could just insert the data to the DB like this:
$mysqli->query("INSERT INTO NewsStoriesTable (Headline, Date, DateAdded, Source, StoryCopy) VALUES ('".$newsStoryHeadline."', '".$storyDate."', '".$dateAdded."', '".$storySource."', '".$storyText."')");
So I thought doing this would take care of cleaning up all the invisible "junk" characters that may be coming in with your submitted text.
However, I just pasted some text I copied from a web-page into my HTML form, clicked "submit" - which ran the above script and inserted that text into my DB - but when I read that text back from the DB, I discovered that this piece of text did still have junk characters in it, such as –.
And those junk characters of course caused the PHP script I wrote that retrieves the information from the DB to crash.
So what am I doing wrong?
Is using real_escape_string not the way to go here? Or should I be using it in conjunction with something else?
OR, is there something I should be doing (like more escaping) when reading reading data back out from the the mySQL database?
(I should mention that I'm an Objective-C developer, not a PHP/mySQL developer, but I've unfortunately been given this task to do some DB stuff - hence my question...)
thanks!
Your assumption is wrong. mysqli_real_escape_string’s only intention is to escape certain characters so that the resulting string can be safely used in a MySQL string literal. That’s it, nothing more, nothing less.
The result should be that exactly the passed data is retained, including ‘junk’. If you don’t want that ‘junk’ in your database, you need to detect, validate, or filter it before passing to to MySQL.
In your case, the ‘junk’ seems to be due to different character encodings: You input data seems to be encoded with UTF-8 while it’s later displayed using Windows-1250. In this scenario, the character – (U+2013) would be encoded with 0xE28093 in UTF-8 which would represent the three characters â, €, and “ in Windows-1250. Properly declaring the document’s encoding would probably fix this.
Sanitization is a tricky subject, because it never means the same thing depending on the context. :)
real_escape_string just makes sure your data can be included in a request (inside quotes, of course) without having the possibility to change the "meaning" of the request.
The manual page explains what the function really does: it escapes nul characters, line feeds, carriage returns, simple quotes, double quotes, and "Control-Z" (probably the SUBSTITUTE character). So it just inserts a backslash before those characters.
That's it. It "sanitizes" the string so it can be passed unchanged in a request. But it doesn't sanitize it under any other point of view: users can still pass for instance HTML markers, or "strange" characters. You need to make rules depending on what your output format is (most of the time HTML, but HTTP isn't restricted to HTML documents), and what you want to let your users do.
If your code can't handle some characters, or if they have a special meaning in the output format, or if they cause your output to appear "corrupted" in some way, you need to escape or remove them yourself.
You will probably be interested in htmlspecialchars. Control characters generally aren't a problem with HTML. If your output encoding is the same as your input encoding, they won't be displayed and thus won't be an issue for your users (well, maybe for the W3C validator). If you think it is, make your own function to check and remove them.
I have a PHP script that stores my code snippets.
To insert, I use:
$snippet_code = mysqli_real_escape_string($conn,trim($_POST['snippet_code']));
To display, I use the following which is wrapped in a pre tag:
$snippet_code = htmlentities($row['SnippetText']);
I notice that sometimes I get a lot of escape characters like \\\\ when the snippet is displayed on the page. The escape characters are present wherever single or double quotes appear in the code. The problem seems to be more severe in non-English language browsers.
How can I properly do this? How can I properly store and display code on a page?
Assuming you mean slash escape sequences like \", and not HTML escape sequences like & try this:
$snippet_code = htmlentities(stripslashes($row['SnippetText']));
If it is actually HTML escapes causing you trouble, just omit the htmlentities call.
If you are getting ' converted to \', your server is probably configured with a legacy option called Magic Quotes. You can read about it in the PHP manual. My advice is to disable them if possible.
Also, check your database. It's possible that your current data is corrupted. If so, you can write a small script thay uses stripslashes() to fix it.
From your comments, it seems that you are in fact talking about slashes found before quotes.
It's not clear from the limited information you've given us why non-English browsers would show more of these.
However, it is likely that these slashes should not be present in the first place. Perhaps you are running mysql_real_escape_string several times, instead of just once... but, again, nothing you've shown us indicates that.
Either way, you should fix the data in the database and not just hack around the issue on display.
The csv file was created correctly but the name and address fields contain every piece of punctuation there is available. So when you try to import into mysql you get parsing errors. For example the name field could look like this, "john ""," doe". I have no control over the data I receive so I'm unable to stop people from inputting garbage data. From the example above you can see that if you consider the outside quotes to be the enclosing quotes then it is right but of course mysql, excel, libreoffice, and etc see a whole new field. Is there a way to fix this problem? Some fields I found even have a backslash before the last enclosing quote. I'm at a loss as I have 17 million records to import.
I have windows os and linux so whatever solution you can think of please let me know.
This may not be a usable answer but someone needs to say it. You shouldn't have to do this. CSV is a file format with an expected data encoding. If someone is supplying you a CSV file then it should be delimited and escaped properly, otherwise its a corrupted file and you should reject it. Make the supplier re-export the file properly from whatever data store it was exported from.
If you asked someone to send you JPG and they send what was a proper JPG file with every 5th byte omitted or junk bytes inserted you wouldnt accept that and say "oh, ill reconstruct it for you".
You don't say if you have control over the creation of the CSV file. I am assuming you do, as if not, the CVS file is corrupt and cannot be recovered without human intervention, or some very clever algorithms to "guess" the correct delimiters vs the user entered ones.
Convert user entered tabs (assuming there are some) to spaces and then export the data using TABS separator.
If the above is not possible, you need to implement an ESC sequence to ensure that user entered data is not treated as a delimiter.
Your title asks: What is an easy way to clean an unparsable csv file
If it is unparseable, that means that you can't correctly break it up into fields. So you can't clean it.
Your first sentence states: The csv file was created correctly but the name and address fields contain every piece of punctuation there is available.
If the csv file was created correctly, then you can split it into fields correctly. So you can clean it.
Only punctuation? You are lucky. Unvalidated text fields in databases commonly contain nasties like tab, carriage return, line feed, and even Ctrl-Z.
Who says it's "unparsable"? On what grounds? What is their definition of "parsable"?
Who says it was "created correctly"? On what grounds? What is their definition of "correct"?
Could you perhaps show us the relevant parts of say 5 or so lines that are causing you grief? Edit your question and format the examples as code, to make them easier to read. Make it obvious where previous/next fields stop/start e.g.
...,"john ""," doe",...
By the way, the above is NOT "right" under any interpretation; it can't be right, with an ODD number of quote characters none of which is escaped.
My definition of correct: Here is how to emit a CSV field that can be parsed no matter what is in the database [caveat: Python csv module barfs on `\x00']:
if '"' in field:
output = '"' + field.replace('"', '""') + '"'
elif any of comma, line feed, carriage return in field: # pseudocode
output = '"' + field + '"'
else:
output = field
That's a really tough issue. I don't know of any real way to solve it, but maybe you could try splitting on ",", cleaning up the items in the resulting array (unicorns :) ) and then re-joining the row?
MySQL import has many parameters including escape characters. Given the example, I think the quotes are escaped by putting a quote in the front. So an import with esaped by '"' would work.
First of all - find all kinds of mistake. And then just replace them with empty strings. Just do it! If you need this corrupted data - only you can recover it.
I am creating a forum software using php and mysql backend, and want to know what is the most secure way to escape user input for forum posts.
I know about htmlentities() and strip_tags() and htmlspecialchars() and mysql_real_escape_string(), and even javascript's escape() but I don't know which to use and where.
What would be the safest way to process these three different types of input (by process, I mean get, save in a database, and display):
A title of a post (which will also be the basis of the URL permalink).
The content of a forum post limited to basic text input.
The content of a forum post which allows html.
I would appreciate an answer that tells me how many of these escape functions I need to use in combination and why.
Thanks!
When generating HTLM output (like you're doing to get data into the form's fields when someone is trying to edit a post, or if you need to re-display the form because the user forgot one field, for instance), you'd probably use htmlspecialchars() : it will escape <, >, ", ', and & -- depending on the options you give it.
strip_tags will remove tags if user has entered some -- and you generally don't want something the user typed to just disappear ;-)
At least, not for the "content" field :-)
Once you've got what the user did input in the form (ie, when the form has been submitted), you need to escape it before sending it to the DB.
That's where functions like mysqli_real_escape_string become useful : they escape data for SQL
You might also want to take a look at prepared statements, which might help you a bit ;-)
with mysqli - and with PDO
You should not use anything like addslashes : the escaping it does doesn't depend on the Database engine ; it is better/safer to use a function that fits the engine (MySQL, PostGreSQL, ...) you are working with : it'll know precisely what to escape, and how.
Finally, to display the data inside a page :
for fields that must not contain HTML, you should use htmlspecialchars() : if the user did input HTML tags, those will be displayed as-is, and not injected as HTML.
for fields that can contain HTML... This is a bit trickier : you will probably only want to allow a few tags, and strip_tags (which can do that) is not really up to the task (it will let attributes of the allowed tags)
You might want to take a look at a tool called HTMLPUrifier : it will allow you to specify which tags and attributes should be allowed -- and it generates valid HTML, which is always nice ^^
This might take some time to compute, and you probably don't want to re-generate that HTML each time is has to be displayed ; so you can think about storing it in the database (either only keeping that clean HTML, or keeping both it and the not-clean one, in two separate fields -- might be useful to allow people editing their posts ? )
Those are only a few pointers... hope they help you :-)
Don't hesitate to ask if you have more precise questions !
mysql_real_escape_string() escapes everything you need to put in a mysql database. But you should use prepared statements (in mysqli) instead, because they're cleaner and do any escaping automatically.
Anything else can be done with htmlspecialchars() to remove HTML from the input and urlencode() to put things in a format for URL's.
There are two completely different types of attack you have to defend against:
SQL injection: input that tries to manipulate your DB. mysql_real_escape_string() and addslashes() are meant to defend against this. The former is better, but parameterized queries are better still
Cross-Site scripting (XSS): input that, when displayed on your page, tries to execute JavaScript in a visitor's browser to do all kinds of things (like steal the user's account data). htmlspecialchars() is the definite way to defend against this.
Allowing "some HTML" while avoiding XSS attacks is very, very hard. This is because there are endless possibilities of smuggling JavaScript into HTML. If you decided to do this, the safe way is to use BBCode or Markdown, i.e. a limited set of non-HTML markup that you then convert to HTML, while removing all real HTML with htmlspecialchars(). Even then you have to be careful not to allow javascript: URLs in links. Actually allowing users to input HTML is something you should only do if it's absolutely crucial for your site. And then you should spend a lot of time making sure you understand HTML and JavaScript and CSS completely.
The answer to this post is a good answer
Basically, using the pdo interface to parameterize your queries is much safer and less error prone than escaping your inputs manually.
I have a tendency to escape all characters that would be problematic in page display, Javascript and SQL all at the same time. It leaves it readable on the web and in HTML eMail and at the same time removes any problems with the code.
A vb.NET Line Of Code Would Be:
SafeComment = Replace( _
Replace(Replace(Replace( _
Replace(Replace(Replace( _
Replace(Replace(Replace( _
Replace(Replace(Replace( _
HttpUtility.HtmlEncode(Trim(strInput)), _
":", ":"), "-", "-"), "|", "|"), _
"`", "`"), "(", "("), ")", ")"), _
"%", "%"), "^", "^"), """", """), _
"/", "/"), "*", "*"), "\", "\"), _
"'", "'")
First of all, general advice: don't escape variables literally when inserting in the database. There are plenty of solutions that let you use prepared statements with variable binding. The reason to not do this explicitly is because it is only a matter of time then before you forget it just once.
If you're inserting plain text in the database, don't try to clean it on insert, but instead clean it on display. That is to say, use htmlentities to encode it as HTML (and pass the correct charset argument). You want to encode on display because then you're no longer trusting that the database contents are correct, which isn't necessarily a given.
If you're dealing with rich text (html), things get more complicated. Removing the "evil" bits from HTML without destroying the message is a difficult problem. Realistically speaking, you'll have to resort to a standardized solution, like HTMLPurifier. However, this is generally too slow to run on every page view, so you'll be forced to do this when writing to the database. You'll also have to ensure that the user can see their "cleaned up" html and correct the cleaned up version.
Definitely try to avoid "rolling your own" filter or encoding solution at any step. These problems are notoriously tricky, and you run a large risk of overlooking some minor detail that has big security implications.
I second Joeri, do not roll your own, go here to see some of the the many possible XSS attacks
http://ha.ckers.org/xss.html
htmlentities() -> turns text into html, converting characters to entities. If using UTF-8 encoding then use htmlspecialchars() instead as the other entities are not needed. This is the best defence against XSS. I use it on every variable I output regardless of type or origin unless I intend it to be html. There is only a tiny performance cost and it is easier than trying to work out what needs escaping and what doesn't.
strip_tags() - turns html into text by removing all html tags. Use this to ensure that there is nothing nasty in your input as a adjunct to escaping your output.
mysql_real_escape_string() - escapes a string for mysql and is your defence against SQL injections from little Bobby tables (better to use mysqli and prepare/bind as escaping is then done for you and you can avoid lots of messy string concatenations)
The advice given obve re avoiding HTML input unless it is essential and opting for BBCode or similar (make your own up if needs be) is very sound indeed.