I'm allowing users to embed content from youtube, vimeo, scribd, flickr, slideshare, etc. and therefore i'm allowing them to paste the embed code in a textbox.
I'm having a hard time figuring out how to:
(a) validate that its indeed a correctly formed embed code and
(b) whether its not any malicious code that the user is trying to get my
system to display.
This is a php website.
I've used htmlpurifier in the past. There are some others, but this one worked the best for me. You can whitelist all allowed code constructs and make the html code standard compliant. It's a good first line of defense against XXS attacks.
The library is quite big and can slow down your code if you don't install it correctly, so read the install docs carefully.
We will be implementing a system where we ask the user to specify the direct URL and we go and subsequently fetch appropriate data from that page.
Related
I am working on a PHP site that allows users to post a listing for their business related to the sites theme. This includes a single link URL, some text, and an optional URL for an image file.
Example:
<img src="http://www.somesite.com" width="40" />
ABC Business
<p>
Some text about how great abc business is...
</p>
The HTML in the text is filtered using the class from htmlpurifier.org and the content is checked for bad words, so I feel pretty good about that part.
The image file URL is always placed inside a <img src="" /> tag with a fixed width and validated to be an actual HTTP URL, so that should be Ok.
The dangerous part is the link.
Question:
How can I be sure that the link does not point to some SPAM, unsafe, or porn site (using code)?
I can check headers for 404, etc... but is there a quick and easy way to validate a sites content from a link.
EDIT:
I am using a CAPTCHA and do require registration before posting is allowed.
Its going to be very hard to try and determine this yourself by scraping the site URL's in question. You'll probably want to rely on some 3rd party API which can check for you.
http://code.google.com/apis/safebrowsing/
Check out that API, you can send it a URL and it will tell you what it thinks. This one is mainly checking for malware and phishing... not so much porn and spam. There are others that do the same thing, just search around on google.
is there a quick and easy way to validate a sites content from a link.
No. There is no global white/blacklist of URLs which you can use to somehow filter out "bad" sites, especially since your definition of a "bad" site is so unspecific.
Even if you could look at a URL and tell whether the page it points to has bad content, it's trivially easy to disguise a URL these days.
If you really need to prevent this, you should moderate your content. Any automated solution is going to be imperfect and you're going to wind up manually moderating anyways.
Manual moderation, perhaps. I can't think of any way to automate this other than using some sort of blacklist, but even then that is not always reliable as newer sites might not be on the list.
Additionally, you could try using cURL and downloading the index page and looking for certain keywords that would raise a red flag, and then perhaps hold those for manual validation.
I would suggest having a list of these keywords in array (porn, sex, etc). If the index page that you downloaded with cURL has any of those keywords, reject or flag for moderation.
This is not reliable nor is it the most optimized way of approving links.
Ultimately, you should have manual moderation regardless, but if you wish to automate it, this is a possible route for you to take.
you can create a little monitoring system that will transfer this content created by user
to an approval queue that only administrators can access to approve the content that should
displayed at the site
I have a php application in which we allow every user to have a "public page" which shows their linked video. We are having an input textbox where they can specify the embed video's html code. The problem we're running into is that if we take that input and directly display it on the page as it is, all sorts of scripts can be inserted here leading into a very insecure system.
We want to allow embed code from all sites, but since they differ in how they're structured, it becomes difficult to keep tabs on how each one is structured.
What are the approaches folks have taken to tackle this scenario? Are there third-party scripts that do this for you?
Consider using some sort of pseudo-template which takes advantage of oEmbed. oEmbed is a safe way to link to a video (as the content authority, you're not allowing direct embed, but rather references to embeddable content).
For example, you might write a parser that searches for something like:
[embed]http://oembed.link/goes/here[/embed]
You could then use one of the many PHP oEmbed libraries to request the resource from the provided link and replace the pseudo-embed code with the real embed code.
Hope this helps.
I would have the users input the URL to the video. From there you can insert the proper code yourself. It's easier for them and safer for you.
If you encounter an unknown URL, just log it, and add the code needed to support it.
The best approach would be to have a white list tag that are allowed and remove everything else. It would also be necessary to filter all the attribute of those tag to remove the "onsomething" attribute.
In order to do a proper parsing, you need to use a XML parser. XMLReader and XMLWriter would works nicely to do that. You read the data from XMLReader, if the tag is in the white list, you write it in the XMLWriter. At the end of the process, you have your parsed data in the XMLWritter.
A code example of this would be this script. It has in the white list the tag test and video. If you give it the following input :
<z><test attr="test"></test><img />random text<video onclick="evilJavascript"><test></test></video></z>
It will output this :
<div><test attr="test"></test>random text<video><test></test></video></div>
Tumblr and other blogging websites allows people to post embeded codes of videos from youtube and all video networks.
but how they filter only the flash object code and remove any other html or scripts? and even they have an automated code that informes you this is not a valid video code.
Is this done using REGEX expressions? And Is there a PHP class to do that?
Thanks
Generally speaking, using regex is not a good way to deal with HTML : HTML is not regular enough for regular expressions : there are too many variations permitted in the standards... And browsers even accept HTML that's not valid !
In PHP, as your question is tagged as php, a great solution that exists to filter user input is the HTMLPurifier tool.
A couple of interesting things are :
It allows you specify which specific tags are allowed
For each tag, you can define which specific attributes are allowed
Basically, the idea is to only keep what you specify (white-list), instead of trying to remove bad stuff using a black-list (which will never be quite complete).
And if you only specify a list of tags and attributes that can do no harm, only those will be kept -- and the risks of injections are lowered a lot.
Quoting HTMLPurifier's home page :
HTML Purifier is a standards-compliant
HTML filter library written in PHP.
HTML Purifier will not only remove
all malicious code (better known as
XSS) with a thoroughly audited,
secure yet permissive whitelist, it
will also make sure your documents are
standards compliant, something only
achievable with a comprehensive
knowledge of W3C's specifications.
Yes, another great thing is that the code you get as output is valid.
Of course, this will only allow you to clean / filter / purify the HTML input ; it will not allow you to validate that the URL used by the user is both :
correct ; i.e. points to a real content
"OK" as defined by your website ; i.e. for example no nudity, ...
About the second point, there's not much one can do about it : the best solution will be to either :
Have a moderator accept / reject the contents before they're put online
Give the website's users a way to flag some content as inappropriate, so a moderator takes actions.
Basically, to check the content itself of the video, there is not much choice but have a human being say "ok" or "not ok".
About the first point, though, there's hope : some services that host content have APIs that you might want / be able to use.
For instance, Youtube provides an API -- see Developer's Guide: PHP.
In your case, the Retrieving a specific video entry section looks promising : if you send an HTTP request to an URL that looks like this :
http://gdata.youtube.com/feeds/api/videos/videoID
(Replacing "videoID" by the ID of the video, of course)
You'll get some ATOM feed if the video is valid ; and "Invalid id" if it's not
This might help you validate at least some URL to contents -- even if you'll have to develop some specific code for each possible content-hosting service that your users like...
Now, to extract the identifier of the video from your HTML string... If you're thinking about using regex, you are wrong ;-)
The best solution to extract a portion of data from an HTML string is generally to :
Load the HTML using a DOM parser ; DOMDocument::loadHTML is generally pretty helpful, here
Go though the document using DOM methods ; either, depending on your situation :
DOMDocument::getElementsByTagName, if you need to iterate over all elements that have a specific tag name ; might be great to iterate over all <object> or <embed> tags, for instance
Or, if you need something more complex, you could do an XPath query, using the DOMXPath class and its DOMXPath::query method.
And using DOM will also allow you to modify the HTML document using a standard API -- which might help, in case you want to add some message next to the video, or any other thing like that.
Take a look at htmlpurifier to start.
http://htmlpurifier.org/
I have implemented an algorithm for this for the company i work for. It works just fine. BUT, it was quite complicated to implement.
I would definitely check out HTMLPurifier to see if that works in an easy way for you. If you insist on doing it the old-school-way like I did, this is the basic steps:
1.
First of ==> get friendly with stripos()
2.
You have to make an recursive function to identify the start and stop tags for the widget, that includes all combinations of <embed></embed> or <embed/> (selfclosing) or <object></object> ... or <object><params>...<embed/></object>
3.
After this, you have to parse out all attributes and params.
4.
Now, all <object> tags should have <param> tags as child elements. You have to parse all of these to get all the data you need for finally generating a new embed or object tag. Escpecially the params and attributes that holds with, height, data source are important.
5.
Now, you don't know if the attributes are enclosed by single or double-quotes, so your code has to be lenient in this way. Also, you dont know if the code is valid or well formed. So, It should be able to handle nested embed/object tags, embed tags that are not enclosed correctly etc etc... As it is user generatede content, you can't really know and trust the input. You will see that there are lots of combinations.
6.
If you manage to parse the embeded element with all its attributes (or object element and its child params), the whitelisting of domains is easy...
My code ended up to be about 800 lines of code, which is quite large, and it was filled with recursive methods, finding correct stop and end tags etc. My alghorithm also removed all the SEO-text that often are included in the cut&paste embed-code, like links back to the site holding the widget.
Its a good excercise, but If i where you... Don't start walking this road.
Recommendation: Try find something ready made, open source!
This will never be safe. Browsers have those funny little functionalities that help people display content of their pages even if html is messy. There are endless opportunities to get something through :)
check here to see the tip of the iceberg
What You need to do is use a single input for just a link and aditional inputs for width and height and filter those. THEN generate the object tag Yourself.
This might be safe.
http://php.net/manual/en/function.strip-tags.php
and allow certain tags?
The most simple and elegant solution: Allowing HTML and Preventing XSS # shiflett.org.
Using all sorts of "HTML purifier" is more than pointless. Sorry but I don't get people who like to use these bloated libraries when a much simpler solution is in hand.
If you're looking make your site "safe" from vulnerabilities, a white list approach is the (only) way to go. I would recommend safely escaping all user generated content, and white listing only markup you know is safe and works on your site. This means not only <B> tags, but also the flash embeddings.
For example, if you want to allow any youtube to be embedded, write a validation RegEx that looks for the embed code they generate. Refuse to accept any others (or simply display it as escaped markup). This is testable. Forget all this parsing nonsense.
If you also want to add vimeo videos, then look at the embed code they provide and accept that as well.
Ugh? I know this seems like a pain, but in reality it's much easier to write than some algorithm that tries to detect "bad" content in some sort of generic fashion.
After getting the simple version of the algorithm working, you could go back and make it nicer. You could "provisionally" accept content with URLs, scripts, etc. that don't pass your white list, and have an admin process to add approved regexes to your output escaping routine. This way legitimate users aren't left out in the cold, but you don't open your self up to attacks of this nature.
Is there any way to disable or encrypt "View Source" for my site so that I can secure my code?
Fero,
Your question doesn't make much sense. The "View Source" is showing the HTML source—if you encrypt that, the user (and the browser) won't be able to read your content anymore.
If you want to protect your PHP source, then there are tools like Zend Guard. It would encrypt your source code and make it hard to reverse engineer.
If you want to protect your JavaScript, you can minify it with, for example, YUI Compressor. It won't prevent the user from using your code since, like the user, the browser needs to be able to read the code somehow, but at least it would make the task more difficult.
If you are more worried about user privacy, you should use SSL to make sure the sensitive information is encrypted when on the wire.
Finally, it is technically possible to encrypt the content of a page and use JavaScript to decrypt it, but since this relies on JavaScript, an experienced user could defeat this in a couple of minutes. Plus all these problems would appear:
Search engines won't be able to index your pages...
Users with JavaScript disabled would see the encrypted page
It could perform really poorly depending the amount of content you have
So I don't advise you to use this solution.
You can't really disable that because eventually the browser will still need to read and parse the source in order to output.
If there is something SO important in your source code, I recommend you hide it on server side.
Even if you encrypt or obfuscate your HTML source, eventually we still can eval and view it. Using Firebug for instance, we can see source code no matter what.
If you are selling PHP software, you can consider Software as a Service (SaaS).
So you want to encrypt your HTML source. You can encrypt it using some javascript tool, but beware that if the user is smart enough, he will always be able to decrypt it doing the same thing that the browser should do: run the javascript and see the generated HTML.
EDIT: See this HTML scrambler as an example on how to encrypt it:
http://www.voormedia.com/en/tools/html-obfuscate-scrambler.php
EDIT2: And .. see this one for how to decrypt it :)
http://www.gooby.ca/decrypt/
Short answer is not, html is an open text format what ever you do if the page renders people will be able to see your source code. You can use javascript to disable the right click which will work on some browsers but any one wanting to use your code will know how to avoid this. You can also have javascrpit emit the html after storing this encoded, this will have bad impacts on development, accessibility, and speed of load. After all that any one with firebug installed will still be able to see you html code.
There is also very really a lot of value in your html, your real ip is in your server code which stays safe and sound on your server.
This is fundamentally impossible. As (almost) everybody has said, the web browser of your user needs to be able to read your html and Javascript, and browsers exist to serve their users -- not you.
What this means is that no matter what you do there is eventually going to be something on a user's machine that looks like:
<html>
<body>
<div id="my secret page layout trick"> ...
</div>
</body>
</html>
because otherwise there is nothing to show the user. If that exists on the client-side, then you have lost control of it. Even if you managed to convince every browser-maker on the planet to not make that available through a "view source" option -- which is, you know, unlikely -- the text will still exist on that user's machine, and somebody will figure out how to get to it. And that will never happen, browsers will always exist to serve their users before all others. (Hopefully)
The same thing is true for all of your Javascript. Let me say it again: nothing that you send to a user is secure or secret from that user. The encryption via Javascript hack is stupid and cannot work in any meaningful sense.
(Well, actually, Flash and Silverlight ship binaries, but I don't think that they're encrypted. So they are at the least irritating to get data out of.)
As others have said, the only way to keep something secret from your users is to not give it to them: put the logic in your server and make sure that it is never sent. For example, all of the code that you write in PHP (or Python/Ruby/Perl/Java/C...) should never be seen by your users. This is e.g. why Google still has a business. What they give you is fundamentally uninteresting compared to what they never send to you. And, because they realize this, they try to make most things that they send you as open as useful as possible. Because it's the infrastructure -- the Terrabyte-huge maps database and pathfinding software, as opposed to the snazzy map that you can click and drag -- that you are trading your privacy for.
Another example: I'm not sure if you remember how many tricks people employed in the early days of the web to try and keep people from saving images to disk. When was the last time you ran across one of those? Know why? Because once data is on your user's machine, she controls it. Not you.
So, in short: if you want to keep something secret from your user, don't give it to her.
You cant. The browser needs the source to render the page. If the user user wishes the user may have the browser show the source. Firefox can also show you the DOM of the page. You can obfuscate the source but not encrypt or lock the user out.
Also why would you want this, it seem like a lame ass thing to do :P
I don't think there is a way to do this. Because if you encrypt how the browser will understand the HTML?
No. The browsers offer no ability for the HTML/javascript to disable that feature (thankfully). Plus even if you could the HTML is still transmitted in plain text ready for a HTTP sniffer to read.
Best you could do would be to somehow obscure the HTML/javascript to make it hard to read. But then debuggers like Firebug and IE 8's debugger will reconstruct it from the DOM making it easy to read,
You can, in fact, disable the right click function. It is useless to do so, however, as most browsers now have built in inspector tools which show the source anyway. Not to mention that other workarounds (such as saving the page, then opening the source, or simply using hotkeys) exist for viewing the html source. Tutorials for disabling the right click function abound across the web, so a quick google search will point you in the right direction if you fell an overwhelming urge to waste your time.
There is no full proof way.
But You can fool many people using simple Hack using below methods:
"window.history.pushState()" and
adding oncontextmenu="return false" in body tag as attribute
Detail here - http://freelancer.usercv.com/blog/28/hide-website-source-code-in-view-source-using-stupid-one-line-chinese-hack-code
You can also use “javascript obfuscation” to further complicate things, but it won’t hide it completely.
“Inspect Element” can reveal everything beyond view-source.
Yes, you can have your whole website being rendered dynamically via javascript which would be encrypted/packed/obfuscated like there is no tomorrow.
I currently run several Wordpress MU installations.
My users are asking for the ability to post video (not just Youtube, but from our own Flash Media Server).
By default, Wordpress strips out <embed> tags.
Now, I would never allow users to include PHP or JavaScript in their posts, do I have to worry about Flash vulnerabilities?
How dangerous is the embed tag and should I worry about giving them the ability?
Thanks
Generally speaking, Flash has come a long way in terms of preventing exploits like key trapping, etc.
The safest thing you could do would be to obfuscate the embedding code and have them only supply a SWF URL, that way they couldn't pull anything fancy in the embed object like allowing cross scripting, etc...
In particular, you want to watch out for things like potential hackers trying to call JS functions from your blog JS files by using AS3's ExternalInterface.call() function... that would definitely be bad. However I think you can use embed techniques to turn this off.
Make sure you set allowScriptAccess="never" in the object/embed tag to deny scripting powers to third party SWFs.
I would suggest that Flash is only as secure as the content it is presenting; and that including a Youtube video is no more or less dangerous than going to visit the same video on Youtube's website.
Flash is pretty secure. A lot of websites big and small are using it for 10 years now. Of course exploits are found, as in every piece of software. No web system is 100% secure. A lot of people are using flash and a lot of developers are working to make it secure. If you really sensitive information don't put them on web in the first place. The security depends more on the developer that writes a piece of code than the type of code ( actionscript, javascript, php or java ). Languages permit errors and developers sometimes make errors.
My recommandation is to use it if you need it.