I have been searching everywhere to try and find a solution to this. I have recently been running scans on our websites to find any vulnerabilities to XSS and SQL Injection. Some items have been brought to my attention.
Any data which is user inputted is now validated and sanitized using filter_var().
My issue now is with XSS and persons manipulating the URL. The simple one which seems to be everywhere is:
http://www.domainname.com/script.php/">< script>alert('xss');< /script >
This then changes some of the $_SERVER variables and causes all of my relative paths to CSS, links, images, etc.. to be invalid and the page doesn't load correctly.
I clean any variables that are used within the script, but I am not sure how I get around removing this unwanted data in the URL.
Thanks in advance.
Addition:
This then causes a simple link in a template file:
Link
to actually link to:
"http://www.domainname.com/script.php/">< script>alert('xss');< /script >/anotherpage.php
This then changes some of the $_SERVER variables and causes all of my relative paths to CSS, links, images, etc.. to be invalid and the page doesn't load correctly.
This sounds you made a big mistake with your website and should re-think how you inject link-information from the input into your output.
Filtering input alone does not help here, you need to filter the output as well.
Often it's more easy if your application recieves a request that does not match the superset of allowed requests to return a 404 error.
I am not sure how I get around removing this unwanted data in the URL.
Actually, the request has been already send, so the URL is set. You can't "change" it. It's just the information what was requested.
It's now your part to deal upon it, not to blindly pass it around any longer, e.g. into your output (and then your links are broken).
Edit: You now wrote more specifically what you're concerned about. I would go in one with dqhendricks here: Who cares?
If you really feel uncomfortable with the fact that a user is just using her browser and enters any URL she feels free to do so, well, the technically correct response is:
400 Bad Request (ref)
And return a page with no or only fully-qualified URIs (absolute URIs) or a redefinition of the Base-URI, otherwise the browser will take the URI entered into it's address bar as the Base-URI. See Uniform Resource Identifier (URI): Generic Syntax RFC 3986; Section 5. Reference ResolutionĀSpecs.
first, if someone adds that crap to their url, who cares if the page doesn't load images correctly? also if the request isn't valid, why would it load any page? why are you using SERVER vars to get paths anyways?
second, you should also be escaping any user submitted database input with the appropriate method for your particular database to avoid sql injection. filter_var generally will not help.
third, xss is simple to protect from. Any user submitted data that is to be displayed on any page needs to be escaped with htmlspecialchars(). this is easier to ensure if you use a view class that you can build this escaping in to.
To your concern about XSS: The altered URL won't get into your page unless you blindly use the related $_SERVER variables. The fact that the relative links seem to include the URL injected script is a browser behavior that risks only breaking your relative links. Since you are not blinding using the $_SERVER variables, you don't have to worry.
To your concern about your relative paths breaking: Don't use relative paths. Reference all your resources with at least a root-of-domain path (starting with a slash) and this sort of URL corruption will not break your site in the way you described.
Related
I am concerned about the safety of fetching content from unknown url in PHP.
We will basically use cURL to fetch html content from user provided url and look for Open Graph meta tags, to show the links as content cards.
Because the url is provided by the user, I am worried about the possibility of getting malicious code in the process.
I have another question: does curl_exec actually download the full file to the server? If yes then is it possible that viruses or malware be downloaded when using curl?
Using cURL is similar to using fopen() and fread() to fetch content from a file.
Safe or not, depends on what you're doing with the fetched content.
From your description, your server works as some kind of intermediary that extracts specific subcontent from a fetched HTML content.
Even if the fetched content contains malicious code, your server never executes it, so no harm will come to your server.
Additionally, because your server only extracts specific subcontent (Open Graph meta tags, as you say),
everything else that is not what you're looking for in the fetched content is ignored,
which means your users are automatically protected.
Thus, in my opinion, there is no need to worry.
Of course, this relies on the assumption that the content extraction process is sound.
Someone should take a look at it and confirm it.
does curl_exec actually download the full file to the server?
It depends on what you mean by "full file".
If you mean "the entire HTML content", then yes.
If you mean "including all the CSS and JS files that the feched HTML content may refer to", then no.
is it possible that viruses or malware be downloaded when using curl?
The answer is yes.
The fetched HTML content may contain malicious code, however, if you don't execute it, no harm will come to you.
Again, I'm assuming that your content extraction process is sound.
Short answer is file_get_contents is safe you retrieve data, even curl is. It is up to you what you do with that data.
Few Guidelines:
1. Never Run eval on that data.
2. Don't save it to database without filtering.
3. Don't even use file_get_contents or curl.
Use: get_meta_tags
array get_meta_tags ( string $filename [, bool $use_include_path = false ] )
// Example
$tags = get_meta_tags('http://www.example.com/');
You will have all meta tags parsed, filtered in an array.
you can use httpclient.class instead of file_get_content or curl. because it connect's the page through the socket.After download the data you can take the meta data using preg_match.
Expanding on the answer made by Ray Radin.
Tips on precautionary measures
He is correct that if you use sound a sound process to search the fetched resource there should be no problem in fetching whatever url is provided. Some examples here are:
Don't store the file in a public facing directory on your webserver. Then you expose yourself to this being executed.
Don't store it in a database, this might lead to a second order sql injection attack
In general, don't store anything from the resource you are requesting, if you have to do this use a specific whitelist of what you are searching for
Check the header information
Even though there is no foolprof way of validating what you are requesting with a specific url. There are ways you can make your life easier and prevent some potential issues.
For example a url might point to a large binary, large image file or something similar.
Make a HEAD request first to get the header information. Then look at the Content-type and Content-length headers to see if the content is a plain text html file
You should however not trust these since they can be spoofed. Doing this will hovewer make sure that even non-malicous content won't crash your script. Requesting image files is presumably something you don't want users to do.
Guzzle
I recommend using Guzzle to do your request since it is in my opinion provides some functionallity that should make this easier
It is safe but you will need to do a proper data check before using it. As you should with any data input anyway.
I'm setting up a website where visitors will be greeted by a splash screen where they will choose a color scheme for the actual website; based on their selection, the actual website will load with a different stylesheet. I gather this can be done by concatenating a flag to the URL, then reading the flagged URL on the next page to determine the stylesheet to be loaded (for example, to load the dark theme, the url would become http://www.mywebsite.com/index-dark; clicking the light theme link would make the URL http://www.mywebsite.com/index-light. Problem is, I don't know how to add a flag to a URL, or how to read this flag on a different page. I've tried Googling the issue, but have found little practical information. How can this be done?
EDIT: I'd like to avoid using two separate pages, as I'll have multiple themes; that would mean copying basically every HTML page in my root multiple times, taking up space. I like the idea of a concealed $_GET variable, though.
Without more information, I can only give some general advice.
So I'm going to assume that you are building a page in PHP, so you could have two different urls and use mod_rewrite to convert /index-dark to /index?style=dark but that's crappy.
What you probably want is to use a cookie or a session. Basically you check a cookie, or session value, for the theme setting and then pick the appropriate CSS file when you generate the page.
This has several advantages:
Doesn't require using url rewriting, an error prone endeavour at the best of times
Allows for persistent setting (if you use a cookie) and doesn't involve complicated urls.
Allows for adding more themes without changing mountains of code, just add the setting to theme selector and the new CSS file.
GET variables are generally only useful for specific data sent with that request, a bit like POST variables are mostly for forms and submitted data. If you want persistent settings, then a session/cookie is the best option.
The "flags" you're mentioning are probably actually $_GET variables that have been disguised using mod_rewrite. What you can do is edit your .htaccess file to add in rewrite rules that change, say, www.mywebsite.com/index.php?style=index-dark to www.mywebsite.com/index-dark (unfortunately I don't have experience in how exactly to do this; I just know that it can be done) and have your PHP catch $_GET['style'].
I am thinking of secure ways to serve HTML and JSON to JavaScript. Currently I am just outputting the JSON like:
ajax.php?type=article&id=15
{
"name": "something",
"content": "some content"
}
but I do realize this is a security risk -- because the articles are created by users. So, someone could insert script tags (just an example) for the content and link to his article directly in the AJAX API. Thus, I am now wondering what's the best way to prevent such issues. One way would be to encode all non alphanumerical characters from the input, and then decode in JavaScript (and encode again when put in somewhere).
Another option could be to send some headers that force the browser to never render the response of the AJAX API requests (Content-Type and X-Content-Type-Options).
If you set the Content-Type to application/json then NO Browser will execute JavaScript on that page. This is apart of RFC-4627, and Google uses this to protect them selves. Other Application/ Content types follow similar rules.
You still have to worry about DOM Based XSS, however this would be a problem with your JavaScript, not really the content of the json. Another more exotic security concern with Json is information leakage like this vulnerability in gmail.
Make sure to always test your code. There is the Sitewatch free xss scanner, or the open source Skipfish and finally you could test this manually with a simple <script>alert(/xss/)</script>.
Instead of worrying about how you could encode the malicious code when you return it, you should probably take care that it does not even get into your database. A quick google search about preventing cross-site scripting and input validation might help you here. Cheers
If the user has to be logged in to view the web page then secure the ajax.php with the same authorization mechanism. Then a client that's not logged in cannot access ajax.php directly to retrieve the data.
I don't think your question is about validating user input, as others pointed out. You don't want to provide your JSON api to other people... right?
If this is the case then there isn't much you can do... in fact, even if you were serving HTML instead of JSON, people would still be doing HTML scraping to get what they wanted from your site (this is how Search Engine spiders work).
A good way to prevent scraping is to allow only a specific amount of downloads from an IP address. This way if someone is requesting http://yoursite.com/somejson.json more than 100 times a day, you probably know it's a scraper, and not someone visiting your page for 100 times in 1 day.
Insertion of script tags (or SQL) is only a problem if you fail to ensure it isn't at the point that it could be a problem.
A <script> tag in the middle of a comment that somebody submits will not hurt your server and it won't hurt your database. What it would hurt, if you fail to take appropriate measures, would be a page that includes the comment when you subsequently serve it up and it reaches a client browser. In order to prevent that from happening, your code that prepares the page must make sure that user-supplied content is always scrubbed before it is exposed to an unaware interpreter. In this case, that unaware interpreter is a client web browser. In fact, your client web browser really involves two unaware interpreters: the HTML parser & layout engine and the Javascript interpreter.
Another important example of an unaware interpreter is your database server. Note that a <script> tag is (almost certainly) harmless to your database, because "" doesn't mean anything in SQL. It's other sorts of input that cause problems for SQL, like quotes in strings (which are harmless to your HTML pages!).
Stackoverflow would be pretty lame if I couldn't put <script> tags in my answers, as I'm doing now. Same goes for examples of SQL Injection attacks. Recently somebody linked a page from some prominent US bank, where a big <textarea> was footnoted by a warning not to include the characters "<" or ">" in whatever you typed. Predictably, the bank was ridiculed over hundreds of Reddit comments, and rightly so.
Exactly how you "scrub" user-supplied content depends on the unaware interpreter to which you're delivering it. If it's going to be dropped in the middle of HTML markup, then you have to make sure that the "<", ">", and "&" characters are all encoded as HTML entitites. (You might want to do quote characters too, if the content might end up in an HTML element attribute value.) If the content is to be dropped into Javascript, however, you may not need to worry about HTML escaping, but you do need to worry about quotes, and possibly Unicode characters outside the 7-bit range.
For outputting safe html from php, I recommend http://htmlpurifier.org/
For example, I'd like to have my registration, about and contact pages resolve to different content, but via hash tags:
three links one each to the registration, contact and about page -
www.site.com/index.php#about
www.site.com/index.php#registration
www.site.com/index.php#contact
Is there a way using Javascript or PHP to resolve these pages to the separated content?
The hash is not sent to the server, so you can only do it in Javascript.
Check the value of location.hash.
There's no server-side way to do it. You could work with AJAX, but this will break the site for non-javascript users. The best way would probably be to have server-side content URLs (index.php?page=<page_id>) and rewrite these locally with JavaScript (to #<page_id>) and handle the content loading with AJAX then. That way you can have your hash-URLs for JS-enabled devices and everybody else can still use the site.
It does however require a bit of redundance because you need to provide the same content twice, once for inclusion via AJAX and once with the proper layout and everything via PHP.
If you just want hash URLs for aesthetic reasons, but don't want to rely on JS, you're out of luck. The semantics of URLs are against you: fragment IDs shouldn't really affect the content the URL is referring to, merely the fragment within that content. AJAX URLs are changing those semantics, but there's no good reason to do that if you don't have to.
I suppose you probably have a good reason, but can I ask, why would you do this? It breaks the widely understood standard of how hashs in URLs are supposed to work, and its just begging for trouble for interoperability with other clients, down the road.
You can use PHP's Global $_REQUEST variables to grab the requested URL and parse out the hashtag...
What would be the safest way to include pages with $_GET without puttingt allowed pages in an array/use switch etc. I have many pages so no thank you.
$content = addslashes($_GET['content']);
if (file_exists(PAGE_PATH."$content.html")) {
include(PAGE_PATH."$content.html");
}
How safe is that?
Thanks.
This is very bad practice. You should setup a controller to handle dispatching to the code that needs to be executed or retrieved rather than trying to directly include it from a variable supplied by a user. You shouldn't trust user input when including files, ever. You have nothing to prevent them from including things you do not want included.
You'll sleep safer if you check the input for a valid pattern. e.g. suppose you know the included files never have a subdirectory and are always alphanumeric
if (preg_match('/^[a-z0-9]+$/', $_GET['page']))
{
$file=PAGE_PATH.$_GET['page'].".html";
if (file_exists($file))
{
readfile($file);
}
}
I've used readfile, as if the .html files are just static, there's no need to use include.
The possible flaw with your approach is that you can engineer a path to any HTML file in the system, and have it executed as PHP. If you could find some way to get an HTML file of your own devising on the filesystem, you can execute it through your script.
Match it against a regex that only accepts "a-zA-Z-".
edit: I don't think that blocking specific patterns is a good idea. I'd rather do, like I said, a regex that only accepts chars that we know that won't cause exploits.
Assuming the "allowed" set of pages all exist under PAGE_PATH, then I might suggest something like the following:
Trim page name
Reject page names which start with a slash (could be an attempt at an absolute path)
Reject page names which contain .. (could be an attempt at path traversal)
Explicitly prefix PAGE_PATH and include the (hopefully) safe path
If your page names all follow some consistent rules, e.g. alphanumeric characters, then you could in theory use a regular expression to validate, rejecting "bad" page names.
There's some more discussion of these issues on the PHP web site.
It looks generally safe as in you are checking that the page actually exists before displaying it. However, you may wish to create a blacklist of pages that people should not be able to view with the valid $_SESSION credentials. This can be done either with an array/switch or you can simply have all special pages in a certain directory, and check for that.
You could scan the directory containing all HTML templates first and cache all template names in an array, that you can validate the GET parameter against.
But even if you cache this array it still creates some kind of overhead.
Don't. You'll never anticipate all possible attacks and you'll get hacked.
If you want to keep your code free of arrays and such, use a database with two columns, ID and path. Request the page by numeric ID. Ignore all requests for ids that are not purely numeric and not in your range of valid IDs. If you're concerned about SEO you can add arbitrary page names after the numeric id in your links, just like Stack Overflow does.
The database need not be heavy-duty. You can use SQLite, for example.
The safest method involves cleaning up the request a bit.
Strip out any ../
Strip out ^\/
From there, make sure that you check to see if the file they're requesting exists, and can be read. Then, just include it.
You should use at least something like that to prevent XSS attacks.
$content = htmlentities($_GET['page'], ENT_QUOTES, 'UTF-8');
And addslashes won't protect you from SQL Injections.