How to prevent html or javascript injection with server-side php

How to prevent html or javascript injection with server-side php - php

Hoping this isn't a duplicate, I couldn't find an original question on the topic. If you have an area for users to input data, how do you store and retrieve the data without them inserting javascript or html?
As an example, say a user is making a forum post. They decide to write an html list or javascript function that runs when the post is viewed. How do you mitigate this when you receive their input on the server-side? Specifically a server'side of PHP.
Remove parts of their string data based on patterns?
Use an html tag around their entry like ?
Thanks

All you have to do, going for the bare minimum, is replace < with <.

I use HTML Purifier to strip out the bits I don't want and leave in the bits I do. The default rules are pretty good, but it offers enormous flexibility if you need it.

You have to remove or translate the offending parts of their post. You can do it once as the post is coming in, and save the translated post in the database, or you can do it every time you display the post, and store the raw post in the database. Both approaches have their good and bad points.
As to how to strip the bad stuff, using simple matching to replace all < and > with < and > goes a long way -- but there's plenty more to do besides that.

There are lots of tutorials out there on preventing code injections. Microsoft's is pretty comprehensive found here.
For html injects depending on how thorough you want to be you can usually just put in a string parser to check for <> and remove them without given exceptions.

Related

How dangerous is it to output certain content without escaping it first

Following on from a question I asked about escaping content when building a custom cms I wanted to find out how dangerous not escaping content from the db can be - assume the data ha been filtered/validated prior to insertion in the db.
I know it's a best practice to escape output but I'm just not sure how easy or even possible it is for someone to 'inject' a value into page content that is to be displayed.
For example let's assume this content with HTML markup is displayed using a simple echo statement:
<p>hello</p>
Admittedly it won't win any awards as far as content writing goes ;)
My question is can someone alter that for evil purposes assuming filtered/validated prior to db insertion?

Always escape for the appropriate context; it doesn't matter if it's JSON or XML/HTML or CSV or SQL (although you should be using placeholders for SQL and a library for JSON), etc.
Why? Because it's consistent. And being consistent is also a form of being lazy: you don't need to ponder if the data is "safe for HTML" because it shouldn't matter. And being lazy (in a good way) is a valuable programming trait. (In this case it's also being lazy about avoiding having to fix "bugs" due to changes in the future.)
Don't omit escaping "because it will never contain data that needs to be escaped" .. because, one day, over a course of a number of situations, that assumption will be wrong.

If you do not escape your HTML output, one could simply insert scripts into the HTML code of your page - running in the browser of every client that visits your page. It is called Cross-site scripting (XSS).
For example:
<p>hello</p><script>alert('I could run any other Javascript code here!');</script>
In the place of the alert(), you can use basically anything: access cookies, manipulate the DOM, communicate with other servers, et cetera.
Well, this is a very easy way of inserting scripts, and strip_tags can protect against this one. But there are hundreds of more sophisticated tricks, that strip_tags simply won't protect against.
If you really want to store and output HTML, HTMLPurifier could be your solution:
Hackers have a huge arsenal of XSS vectors hidden within the depths of
the HTML specification. HTML Purifier is effective because it
decomposes the whole document into tokens and removing non-whitelisted
elements, checking the well-formedness and nesting of tags, and
validating all attributes according to their RFCs. HTML Purifier's
comprehensive algorithms are complemented by a breadth of knowledge,
ensuring that richly formatted documents pass through unstripped.

It could be, for example, also problem linked with some other vulnerabilities like e.g. sql injection. Then someone would b e able to ommit filtering/validation prior adding to db and display whatever he can.

If you are pulling the word hello from the database and displaying it nothing will happen. If the content contains the <script> tags though then it is dangerous because a users cookies can be stolen then and used to hijack their session.

Write links in a natural and optimized way using JavaScript and/or PHP

The admin users of a module that I'm developing want to add a functionality of automatically write links in the textarea(s) they fill.
For example, if they write:
Please visit our page http://page.com
They want that http://page.com automatically is converted in a link:
http://page.com
I want to do this in the best possible way in order of usability and performance.
I can't change the type of field (textarea) but I can do modifications with PHP and JavaScript that always is active (No Frameworks).
The users frequently edit the fields and the links are only important when they "publish" the forms, because the content of those textarea(s) are displayed inside an HTML table.
A textarea input could have more than one link.
I appreciate your opinions and points of view to resolve this common situation.

In my opinion, you should handle this situation:
using PHP,
after reading the textarea contents from DB where it was stored,
before sending the HTML output
I don't know the details of your application context and its users, but when you output any user input as HTML, you must take care of security issues as XSS attacks, and others.
If $textarea_contents is the variable where the textarea contents are (read from the DB), I would apply the htmlspecialchars function first:
$output = htmlspecialchars( $textarea_contents );
After this, you can parse the output string or use a regular expression to transform the URLs in anchor elements. You choice depends on the level of precision you want. A couple of choices are:
http://code.iamcal.com/php/lib_autolink/lib_autolink.phps
http://jmrware.com/articles/2010/linkifyurl/linkify.html
And it is good to know this recommended reading about the complex problem of linkifying strings (from the creator of Stack Overflow website):
http://www.codinghorror.com/blog/2008/10/the-problem-with-urls.html
Good luck!

$code = preg_replace('/((https?|ftp):\/\/(?:[A-Z0-9-]+.)+[A-Z]{2,6}([\/?].+)?)/i','$1',$code);
(Regex Source)

This RegEx is better since take care of the parameters passed in the URL and finish when the URL finish and don't take spaces or other following words.
(https?|ftp)://([-A-Z0-9.]+)(/[-A-Z0-9+&##/%=~_|!:,.;]*)?(\?[A-Z0-9+&##/%=~_|!:,.;]*)?
Any other suggestion to face up this situation? Use JavaScript or PHP? Any idea?

TinyMCE, PHP and MySQL: security and escaping questions

I'm implementing TinyMCE for a client so they can edit front-end content via a simple, familiar interface in their site's admin panel.
I have never used TinyMCE before but notice that you are able to insert whatever markup you want and it will be happily saved off to the MySQL database, assuming you don't escape the contents of the TinyMCE before running it through your query.
You can even insert single quotes and have it break your SQL query entirely.
But of course, when I do escape the contents, benign presentational stuff like paragraph tags get converted to HTML entities and so the whole point of the WYSIWYG editor is defeated, because the entities are spat back out when it comes to displaying the stored content on the front-end.
So is there a way I can "selectively escape" content from TinyMCE, to keep the innocent tags like P and BR but get rid of dangerous ones like SCRIPT, IFRAME, etc.? I really don't want to have to manually encode and decode them using str_replace() or whatever, but I'd rather not give my client a gaping security hole either.
Thanks.

Have you tried htmlpurifier? works wonders. Its caveats; big and slow, but the best you can have.
http://htmlpurifier.org .

Sorry Dude, I'd say this a question for the authors of TinyMCE, so I suggest you ask at: http://tinymce.moxiecode.com/enterprise/support.php ... I'm sure they'll be only to happy to answer (for a small fee), and I suspect this may even be one of there FAQ's.
It's just that I'd guess you'd be very lucky if you hit another TinyMCE-user (let alone an authorative one) on stack-overflow, a "general programming forum"... although I notice there are currently 837 questions tagged "tinymce" on this forum; have you tried searching through them? Maybe there's a pointer in one of those?
Cheers. Keith.
EDIT: Yep, Making user-made HTML templates safe is more or less the same question posed in different words, and it has (what looks to ignorant me) a couple of answers which posit practical solutions. I just searched stack overflow for "Tiny MCE html security".

That's like complaining that you can write naughty words in Microsoft Word, and that Word should filter them for you. Or complain to GM that they build cars that then get used as escape vehicles in bank robberies. TinyMCE's job is to be an online editor, not to be the content police.
If you need to ban certain tags, then remove them when the document's submitted by using strip_tags(). Or better yet, HTMLpurifier for a more bullet-proof sanitization. If embedded quotes are breaking your SQL, then why weren't you passing the submitted document through mysql_real_escape_string() or using PDO prepared queries first? MCE has no idea what the server-side handling is going to be, nor should it care at all. It's up to you to decide how to handle the data, because only you know what its ultimate purpose is going to be.
In any case, remember that all those editors work on the client side. You can make TinyMCE as bulletproof and as strict an editor as you want, but it's still running on the client. Nothing says a malicious user can't bypass it entirely and submit all the embedded quotes and bad tags they want. The ultimate responsibility for cleaning the data HAS to fall on your code running on the server, as it's the last line of defense, and the only one that can ensure the database remains pristine. Anything else is lipstick on a pig.

How to protect yourself from XSS when you allow people to post RAW embed codes?

Tumblr and other blogging websites allows people to post embeded codes of videos from youtube and all video networks.
but how they filter only the flash object code and remove any other html or scripts? and even they have an automated code that informes you this is not a valid video code.
Is this done using REGEX expressions? And Is there a PHP class to do that?
Thanks

Generally speaking, using regex is not a good way to deal with HTML : HTML is not regular enough for regular expressions : there are too many variations permitted in the standards... And browsers even accept HTML that's not valid !
In PHP, as your question is tagged as php, a great solution that exists to filter user input is the HTMLPurifier tool.
A couple of interesting things are :
It allows you specify which specific tags are allowed
For each tag, you can define which specific attributes are allowed
Basically, the idea is to only keep what you specify (white-list), instead of trying to remove bad stuff using a black-list (which will never be quite complete).
And if you only specify a list of tags and attributes that can do no harm, only those will be kept -- and the risks of injections are lowered a lot.
Quoting HTMLPurifier's home page :
HTML Purifier is a standards-compliant
HTML filter library written in PHP.
HTML Purifier will not only remove
all malicious code (better known as
XSS) with a thoroughly audited,
secure yet permissive whitelist, it
will also make sure your documents are
standards compliant, something only
achievable with a comprehensive
knowledge of W3C's specifications.
Yes, another great thing is that the code you get as output is valid.
Of course, this will only allow you to clean / filter / purify the HTML input ; it will not allow you to validate that the URL used by the user is both :
correct ; i.e. points to a real content
"OK" as defined by your website ; i.e. for example no nudity, ...
About the second point, there's not much one can do about it : the best solution will be to either :
Have a moderator accept / reject the contents before they're put online
Give the website's users a way to flag some content as inappropriate, so a moderator takes actions.
Basically, to check the content itself of the video, there is not much choice but have a human being say "ok" or "not ok".
About the first point, though, there's hope : some services that host content have APIs that you might want / be able to use.
For instance, Youtube provides an API -- see Developer's Guide: PHP.
In your case, the Retrieving a specific video entry section looks promising : if you send an HTTP request to an URL that looks like this :
http://gdata.youtube.com/feeds/api/videos/videoID
(Replacing "videoID" by the ID of the video, of course)
You'll get some ATOM feed if the video is valid ; and "Invalid id" if it's not
This might help you validate at least some URL to contents -- even if you'll have to develop some specific code for each possible content-hosting service that your users like...
Now, to extract the identifier of the video from your HTML string... If you're thinking about using regex, you are wrong ;-)
The best solution to extract a portion of data from an HTML string is generally to :
Load the HTML using a DOM parser ; DOMDocument::loadHTML is generally pretty helpful, here
Go though the document using DOM methods ; either, depending on your situation :
DOMDocument::getElementsByTagName, if you need to iterate over all elements that have a specific tag name ; might be great to iterate over all <object> or <embed> tags, for instance
Or, if you need something more complex, you could do an XPath query, using the DOMXPath class and its DOMXPath::query method.
And using DOM will also allow you to modify the HTML document using a standard API -- which might help, in case you want to add some message next to the video, or any other thing like that.

Take a look at htmlpurifier to start.
http://htmlpurifier.org/

I have implemented an algorithm for this for the company i work for. It works just fine. BUT, it was quite complicated to implement.
I would definitely check out HTMLPurifier to see if that works in an easy way for you. If you insist on doing it the old-school-way like I did, this is the basic steps:
1.
First of ==> get friendly with stripos()
2.
You have to make an recursive function to identify the start and stop tags for the widget, that includes all combinations of <embed></embed> or <embed/> (selfclosing) or <object></object> ... or <object><params>...<embed/></object>
3.
After this, you have to parse out all attributes and params.
4.
Now, all <object> tags should have <param> tags as child elements. You have to parse all of these to get all the data you need for finally generating a new embed or object tag. Escpecially the params and attributes that holds with, height, data source are important.
5.
Now, you don't know if the attributes are enclosed by single or double-quotes, so your code has to be lenient in this way. Also, you dont know if the code is valid or well formed. So, It should be able to handle nested embed/object tags, embed tags that are not enclosed correctly etc etc... As it is user generatede content, you can't really know and trust the input. You will see that there are lots of combinations.
6.
If you manage to parse the embeded element with all its attributes (or object element and its child params), the whitelisting of domains is easy...
My code ended up to be about 800 lines of code, which is quite large, and it was filled with recursive methods, finding correct stop and end tags etc. My alghorithm also removed all the SEO-text that often are included in the cut&paste embed-code, like links back to the site holding the widget.
Its a good excercise, but If i where you... Don't start walking this road.
Recommendation: Try find something ready made, open source!

This will never be safe. Browsers have those funny little functionalities that help people display content of their pages even if html is messy. There are endless opportunities to get something through :)
check here to see the tip of the iceberg
What You need to do is use a single input for just a link and aditional inputs for width and height and filter those. THEN generate the object tag Yourself.
This might be safe.

http://php.net/manual/en/function.strip-tags.php
and allow certain tags?

The most simple and elegant solution: Allowing HTML and Preventing XSS # shiflett.org.
Using all sorts of "HTML purifier" is more than pointless. Sorry but I don't get people who like to use these bloated libraries when a much simpler solution is in hand.

If you're looking make your site "safe" from vulnerabilities, a white list approach is the (only) way to go. I would recommend safely escaping all user generated content, and white listing only markup you know is safe and works on your site. This means not only <B> tags, but also the flash embeddings.
For example, if you want to allow any youtube to be embedded, write a validation RegEx that looks for the embed code they generate. Refuse to accept any others (or simply display it as escaped markup). This is testable. Forget all this parsing nonsense.
If you also want to add vimeo videos, then look at the embed code they provide and accept that as well.
Ugh? I know this seems like a pain, but in reality it's much easier to write than some algorithm that tries to detect "bad" content in some sort of generic fashion.
After getting the simple version of the algorithm working, you could go back and make it nicer. You could "provisionally" accept content with URLs, scripts, etc. that don't pass your white list, and have an admin process to add approved regexes to your output escaping routine. This way legitimate users aren't left out in the cold, but you don't open your self up to attacks of this nature.

How to safely allow embed content?

I run a website (sorta like a social network) that I wrote myself. I allow the members to send comments to each other. In the comment; i take the comment and then call this line before saving it in db..
$com = htmlentities($com);
When I want to display it; I call this piece of code..
$com = html_entity_decode($com);
This works out well most of the time. It allows the users to copy/paste youtube/imeem embed code and send each other videos and songs. It also allows them to upload images to photobucket and copy/paste the embed code to send picture comments.
The problem I have is that some people are basically putting in javascript code there as well that tends to do nasty stuff such as open up alert boxes, change location of webpage and things like that.. I am trying to find a good solution to solving this problem once and for all.. How do other sites allow this kind of functionality?
Thanks for your feedback

First: htmlentities or just htmlspecialchars should be used for escaping strings that you embed into HTML. You shouldn't use it for escaping string when you insert them into a SQL query - Use mysql_real_escape_string (For MySql) or better yet - use prepared statements, which have bound parameters. Make sure that magic_quotes are turned off or disabled otherwise, when you manually escape strings.
Second: You don't unescape strings when you pull them out again. Eg. there is no mysql_real_unescape_string. And you shouldn't use stripslashes either - If you find that you need, then you probably have magic_quotes turned on - turn them off instead, and fix the data in the database before proceeding.
Third: What you're doing with html_entity_decode completely nullifies the intended use of htmlentities. Right now, you have absolutely no protection against a malicious user injecting code into your site (You're vulnerable to cross site scripting aka. XSS). Strings that you embed into a HTML context, should be escaped with htmlspecialchars (or htmlentities). If you absolutely have to embed HTML into your page, you have to run it through a cleaning-solution first. strip_tags does this - in theory - but in practise it's very inadequate. The best solution I currently know of, is HtmlPurifier. However, whatever you do, it is always a risk to let random user embed code into your site. If at all possible, try to design your application such that it isn't needed.

I so hope you are scrubbing the data before you send it to the database. It sounds like you are a prime target for a SQl injection attack. I know this is not your question, but it is something that you need to be aware of.

Yes, this is a problem. A lot of sites solve it by only allowing their own custom markup in user fields.
But if you really want to allow HTML, you'll need to scrub out all "script" tags. I believe there are libraries available that do this. But that should be sufficient to prevent JS execution in user-entered code.

This is how Stackoverflow does it, I think, over at RefacterMyCode.

You may want to consider Zend Filter, it offers a lot more than strip_tags and you do not have to include the entire Zend Framework to use it.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.