I have problem with PHP and JavaScript/CSS.
I have database with table. The table has a descriptions of articles. I want to echo the descriptions of the articles from database. Unfortunately many of them has a JavaScript or CSS included ( Then some article text), so when I use echo, it shows all of that code (and after that text). Is there any way to not show the JavaScript/CSS part and show only the text? For example with str_replace and regular expression? If yes, can somebody write me how it should look like?
Thanks for help and let me know if u need more info (code etc.)
Use HTMLPurifier - it will remove the scripts, css and any harmfull content from your articles. Since it is a CPU-intensive operations, it's better to run article trough HTMLPurifer before saving in the database, then to run it each time you are showing the article.
If you're trying to remove tags from a user's post, you can call strip_tags. This will get rid of css links, script tags, etc. It will not get rid of the style attribute, but if you get rid of div, span, p, etc. that won't matter -- there will be no tag for it to reside on.
As has been stated by others, it is generally best to sanitize your input (data from user before it goes into the DB), than it is to sanitize your output.
If you're trying to simply hide the JS and CSS from users, you can use Packer to obfusicate Javascript from less-savvy users, use Packer and use base 62 encoding. The JS will still work but will look like jiberish. Be aware that more knowledgeable users can attempt to unobfusicate the code, so any critical security risks in the JS still exists. Don't think any JS that accesses your databases directly will be safe; instead remove database access from the Javascript for security. If the JS is just to do fancy things like move elements around the page it's probably fine to just obfuscate it.
Only consider this if YOU have complete control and awareness of all JS included with the articles. If this is something your anonmous or otherwise not 120% trusted users can upload, you need to kill that functionality and use HTML Purifier to remove any JS they might add. It is not safe to output user entered JS, for you or your users.
For the CSS, I'm not sure why you want to hide it, and CSS can't be obfuscated quite like JS can; the styles will still be in plain English, best you can do is butcher the class/id names and whitespace; outputting CSS that YOU generated isn't a real security risk though, and even if people reverse engineer it I wouldn't be that afraid.
Again, if this is something anonymous/non trusted users can ADD to your site on their own, you don't want this at all, so remove the ability to upload CSS with an article using the HTML Purifier Darhazer mentioned.
You can try the following regex to remove the script and css:
"<script[\d\D]*?>[\d\D]*?</script>"
"<style[\d\D]*?>[\d\D]*?</style>"
It should help, but it cannot remove all the scripts. Like onclick="javascript:alert(1)".
Related
I know similar questions have been asked but I am struggling to work out how to do it.
I am building a CMS, rather primitive right now, but it's as a learning exercise; in a production site, I would use an existing solution for sure.
I would like to take user input, which can be styled in a WYSIWYG editor. I would also like them to be able to insert images inline.
I understand I can store HTML in the database but how can I safely re-render this. I know there is no problem with the HTML being stored but it is my understanding that XSS become an issue if I were to just simply dump the user-generated code onto a layout template.
So the question put simply, is how can I store and safely rerender user content in cms? I am using Laravel and PHP. I also have a little knowledge of javascript if its required.
For a CMS where you want to allow some tags but not others, then you want something like HTML Purifier. This will take HTML and run it against a whitelist and regenerate HTML that is safe to display back to the user.
A good and cheap way to avoid cross-site scripting is to get your php program to entitize everything from your users' input before storing it in the database. That is, you want to take this entry from a user
Hi there sucker! I just hacked your site.
<script>alert('You have been pwned!')</script>
and convert it to this before putting it into your database.
Hi there sucker! I just hacked your site.
<script>alert('You have been pwned!')</script>
When you pass < to a browser, it renders it as <, but it doesn't do anything else with it.
The htmlentities() function can do this for you. And, php's htmlspecialchars_decode() can reverse it if you need to. But you shouldn't reverse the operation unless you absolutely must do so, for example to load the document into an embedded editor for changes.
You can also choose to entitize user-furnished text after you retrieve it from your database and before you display it. If you get to the point where several people work on your code, you may want to do both for safety.
You can also render user-provided input inside <pre>content</pre> tags, which tells the brower to just render the text and do nothing else with it.
(Use right-click Inspect on this very page to see how Stack Overflow handles my malicious example.)
I have a database of emails I've collected from our gmail account which I'm trying to render out to an internal page.
This is working, however the occasional email comes in that causes problems because of missing/not closed tags. There might be some CSS thrown in there that I don't want rendered on the whole page.
I could use iFrames, but they seem outdated, and just not the right approach.
What would the suggested method be to render blocks of HTML from the database, but without them effecting the rest of the page?
Firstly, you need to load and interpret that HTML into something not broken. To do this, you use a DOM parser. http://php.net/manual/en/domdocument.loadhtml.php
I could use iFrames, but they seem outdated, and just not the right approach.
No, this is wrong. Until we have good shadow DOM support (and likely even after), iframes are the right way to isolate something in its own context. Make sure you use the sandboxproperty.
Note that you could do this without iframes, but it's going to be a lot more work.
But curious how something like 'Google' can render it without affecting the whole page.
Google doesn't just accept anything that comes through that e-mail, and neither should you, even if you use the iframe method. You need to make a whitelist of what elements you will support, and filter out everything outside of that. Next, you need to figure out what CSS properties you will support. Finally, you need to transform that whole DOM document into something useful and output it as HTML. Check out HTML Purifier for your whitelisting.
None of this is an easy task. You're stepping into an awful lot of hassle. There is no real standard for HTML e-mail. Each provider and mail client has a different set of what they support, and with varying results.
For the CSS part, you may add for example an id for the div u have email content on then add the id as father to all the css selectors! like this
And you may check the tags to be closed correctly which both of these ideas may take some processing power but not much, And you can prevent recalculation by preproccessing and storing the result.
iFrames I wouldn't say are dated and I would consider them a valid output in your case.
But curious how something like 'Google' can render it without affecting the whole page.
If you inspect, they use iFrames.
100% agree with #brad with running it through a parser, then output it to an iFrame.
I am working on an applet that allows the user to input a URL to a news article or other webpage (in Japanese) and view the contents of that page within an iFrame in my page. The idea is that once the content is loaded into the page, the user can highlight words using their cursor, which stores the selected text in an array (for translating/adding to a personal dictionary of terms) and surrounds the text in a red box (div) according to a stylesheet defined on my domain. To do this, I use cURL to retrieve the HTML of the external page and dump it into the source of the iFrame.
However, I keep running into major formatting problems with the retrieved HTML. The big problem is preserving style sheets, and to fix this, I've used DOMDocument to add tags to the section of the retrieved HTML. This works for some pages/URLs, but there are still lots of style problems with the output HTML for many others. For example, div layers crash into each other, alignments are off, and backgrounds are missing. This is made a bit more problematic as I need to embed the output HTML into a new in order to make the onClick javascript function for passing text selections in the embedded content to work, which means the resulting source ends up looking like this:
<div onclick="parent.selectionFunction()" id ="studyContentn">
<!-- HTML of output from cURL, including doctype declarations and <html>,<head> tags -->
</div>
It seems like for the most part a lot of the formatting issues I keep running into are largely arbitrary. I've tried using php Tidy to clean output from HTML, but that also only works for some pages but not many others. I've got a slight suspicion it may have to do with CDATA declarations that get parsed oddly when working with DOMDocument, but I am not certain.
Is there a way I can guarantee that HTML output from cURL will be rendered correctly and faithfully in all instances? Or is there perhaps a better way of going about doing this? I've tried a bunch of different ways of approaching this issue, and each gets closer to a solution but brings its own new problems as well.
Thanks -- let me know if I can clarify anything.
If I understand correctly you are trying to pull the html of a complete web page and display it under your domain, in your html. This is always going to be tricky, a lot of java script will break, relative url's will be wrong and as you mentioned, styles as well. Your probably also changing the dimensions that the page is displayed in. These can all be worked around but your going to be fighting an uphill battle with each new site, or if a current site change design
I'd probably take a different approach to the problem. You might want to write a browser plugin as the interface to the external web site instead. Then your applet can sit on top of the functional and tested (hopefully) site. Then you can focus on what you need to do for your applet rather than a never ending list of fiddly html issues.
I am trying to do a similar thing. It is very difficult to conserve the formatting, and the JS scripts in webpage complicated the thing. I finally gave up the complete the idea of completely displaying the original format, but do it with a workaround:
Select only headers, links, lists, paragraph which you are interested at.
Add the domain path of your ownsite to links.
You may wrap the headers, links etc. items by your own class.
Display it
in your case you want to select text and store it, which is another topic. What I did is to parse the HTMl in two levels, and then it is easy to do the selection. Keep in mind IE and Firefox/Chrome needs to be dealt with separately.
I'm building an PHP email mailbox script.
How would I make html emails display cleanly as they do in gmail/hotmail.
If I just echo it out it affects the whole page layout.
I could use iframes but surely that isn't the best solution.
If you are looking for the 'best solution' get on board with another open source email library that is doing the same thing you are. Maintaining an email renderer on your own that is safe against script injection and other hacks will simply be too much work for one person.
One example: https://github.com/afterlogic/webmail-lite
Another: http://trac.roundcube.net/
You get the benefit of other developers who use the library maintaining the code base, so if something is broken, all you have to do is pull the latest update (hopefully) and you get the fix. If you find something that needs improving, you can fix it or build it, and make the code better for everyone. I'm really just pitching open source libraries here, however in any commercial context, building your own email renderer without a big team, is a bad idea.
As Marc B stated, I believe an IFrame would be your best bet... but please realize that if you just dump any email HTML code you risk exposing yourself to viruses, Trojans, and malicious HTML/JavaScript code - Your opening Pandora's box on your computer unless you find a good way to sandbox/strip that HTML.
Here's a simple Regex to clean JavaScript at least :
"(?s)<script.*?(/>|</script>)"
Consider the use of some HTML Tidy library (i.e.: PHP.Tidy).
You can pass the text through the library to get well formatted html.
A good practice would be to define a CSS standard behaviour for most tags in the div you're using.
Create a DIV container that you assign width (and height if needed) to, and make sure you add an overflow property to match your design. This should keep your email HTML from interfering with your layout.
UPDATE
A DIV container still assures you that you can constrain the size of the display box and with appropriate CSS acts similar to an iframe without all the baggage.
If you are worried about the code in the email, strip_tags would seem a better solution than the regex. You can define a list of tags to leave alone and still be confident of stripping the rest.
Tumblr and other blogging websites allows people to post embeded codes of videos from youtube and all video networks.
but how they filter only the flash object code and remove any other html or scripts? and even they have an automated code that informes you this is not a valid video code.
Is this done using REGEX expressions? And Is there a PHP class to do that?
Thanks
Generally speaking, using regex is not a good way to deal with HTML : HTML is not regular enough for regular expressions : there are too many variations permitted in the standards... And browsers even accept HTML that's not valid !
In PHP, as your question is tagged as php, a great solution that exists to filter user input is the HTMLPurifier tool.
A couple of interesting things are :
It allows you specify which specific tags are allowed
For each tag, you can define which specific attributes are allowed
Basically, the idea is to only keep what you specify (white-list), instead of trying to remove bad stuff using a black-list (which will never be quite complete).
And if you only specify a list of tags and attributes that can do no harm, only those will be kept -- and the risks of injections are lowered a lot.
Quoting HTMLPurifier's home page :
HTML Purifier is a standards-compliant
HTML filter library written in PHP.
HTML Purifier will not only remove
all malicious code (better known as
XSS) with a thoroughly audited,
secure yet permissive whitelist, it
will also make sure your documents are
standards compliant, something only
achievable with a comprehensive
knowledge of W3C's specifications.
Yes, another great thing is that the code you get as output is valid.
Of course, this will only allow you to clean / filter / purify the HTML input ; it will not allow you to validate that the URL used by the user is both :
correct ; i.e. points to a real content
"OK" as defined by your website ; i.e. for example no nudity, ...
About the second point, there's not much one can do about it : the best solution will be to either :
Have a moderator accept / reject the contents before they're put online
Give the website's users a way to flag some content as inappropriate, so a moderator takes actions.
Basically, to check the content itself of the video, there is not much choice but have a human being say "ok" or "not ok".
About the first point, though, there's hope : some services that host content have APIs that you might want / be able to use.
For instance, Youtube provides an API -- see Developer's Guide: PHP.
In your case, the Retrieving a specific video entry section looks promising : if you send an HTTP request to an URL that looks like this :
http://gdata.youtube.com/feeds/api/videos/videoID
(Replacing "videoID" by the ID of the video, of course)
You'll get some ATOM feed if the video is valid ; and "Invalid id" if it's not
This might help you validate at least some URL to contents -- even if you'll have to develop some specific code for each possible content-hosting service that your users like...
Now, to extract the identifier of the video from your HTML string... If you're thinking about using regex, you are wrong ;-)
The best solution to extract a portion of data from an HTML string is generally to :
Load the HTML using a DOM parser ; DOMDocument::loadHTML is generally pretty helpful, here
Go though the document using DOM methods ; either, depending on your situation :
DOMDocument::getElementsByTagName, if you need to iterate over all elements that have a specific tag name ; might be great to iterate over all <object> or <embed> tags, for instance
Or, if you need something more complex, you could do an XPath query, using the DOMXPath class and its DOMXPath::query method.
And using DOM will also allow you to modify the HTML document using a standard API -- which might help, in case you want to add some message next to the video, or any other thing like that.
Take a look at htmlpurifier to start.
http://htmlpurifier.org/
I have implemented an algorithm for this for the company i work for. It works just fine. BUT, it was quite complicated to implement.
I would definitely check out HTMLPurifier to see if that works in an easy way for you. If you insist on doing it the old-school-way like I did, this is the basic steps:
1.
First of ==> get friendly with stripos()
2.
You have to make an recursive function to identify the start and stop tags for the widget, that includes all combinations of <embed></embed> or <embed/> (selfclosing) or <object></object> ... or <object><params>...<embed/></object>
3.
After this, you have to parse out all attributes and params.
4.
Now, all <object> tags should have <param> tags as child elements. You have to parse all of these to get all the data you need for finally generating a new embed or object tag. Escpecially the params and attributes that holds with, height, data source are important.
5.
Now, you don't know if the attributes are enclosed by single or double-quotes, so your code has to be lenient in this way. Also, you dont know if the code is valid or well formed. So, It should be able to handle nested embed/object tags, embed tags that are not enclosed correctly etc etc... As it is user generatede content, you can't really know and trust the input. You will see that there are lots of combinations.
6.
If you manage to parse the embeded element with all its attributes (or object element and its child params), the whitelisting of domains is easy...
My code ended up to be about 800 lines of code, which is quite large, and it was filled with recursive methods, finding correct stop and end tags etc. My alghorithm also removed all the SEO-text that often are included in the cut&paste embed-code, like links back to the site holding the widget.
Its a good excercise, but If i where you... Don't start walking this road.
Recommendation: Try find something ready made, open source!
This will never be safe. Browsers have those funny little functionalities that help people display content of their pages even if html is messy. There are endless opportunities to get something through :)
check here to see the tip of the iceberg
What You need to do is use a single input for just a link and aditional inputs for width and height and filter those. THEN generate the object tag Yourself.
This might be safe.
http://php.net/manual/en/function.strip-tags.php
and allow certain tags?
The most simple and elegant solution: Allowing HTML and Preventing XSS # shiflett.org.
Using all sorts of "HTML purifier" is more than pointless. Sorry but I don't get people who like to use these bloated libraries when a much simpler solution is in hand.
If you're looking make your site "safe" from vulnerabilities, a white list approach is the (only) way to go. I would recommend safely escaping all user generated content, and white listing only markup you know is safe and works on your site. This means not only <B> tags, but also the flash embeddings.
For example, if you want to allow any youtube to be embedded, write a validation RegEx that looks for the embed code they generate. Refuse to accept any others (or simply display it as escaped markup). This is testable. Forget all this parsing nonsense.
If you also want to add vimeo videos, then look at the embed code they provide and accept that as well.
Ugh? I know this seems like a pain, but in reality it's much easier to write than some algorithm that tries to detect "bad" content in some sort of generic fashion.
After getting the simple version of the algorithm working, you could go back and make it nicer. You could "provisionally" accept content with URLs, scripts, etc. that don't pass your white list, and have an admin process to add approved regexes to your output escaping routine. This way legitimate users aren't left out in the cold, but you don't open your self up to attacks of this nature.