html_entity_decode Terminate? - php

I'm using html_entity_decode($row['Content']) to display some JSON data that contains HTML in a PHP document. Problem is that some of the data being returned has open HTML tags such as <strong> which then carry on to the content displayed after.
Is there some way to terminate the HTML?

If you ever accept raw HTML from an outside source to embed into your site, you should always, always, reformat and whitelist it. You have no idea what that 3rd party HTML may contain, and you have no guarantee that it's valid; yet on your site you presumably want guaranteed valid HTML with certain limits on its content (or do you really want to enable the embedding of arbitrary <script> tags...?!).
That means you want to:
parse the HTML and extract whatever structural information is in it
filter that structure to allow only approved elements and then
produce your own HTML from that which you can guarantee is syntactically valid.
Supposedly the best PHP library which does that is HTML Purifier. Without using a library, you would use a lenient HTML parser, something like DOMDocument to inspect and filter the content, and then the built-in DOMDocument::saveXML to produce the new sanitised HTML.

Related

Using markdown properly in a PHP application

I work on a web application that uses Markdown as its syntax, the only issue I am facing is how to validate the user input on the server side so that it is actually Markdown and not some XSS attack that could be injected using a POST request or by disabling javascript.
I know StackOverflow does this but how do they do it and allow certain HTML tags including images that are prone to XSS attacks? Any open source package that can help (examples appreciated).
Becaue I heard that StackOverflow uses it, I will be trying out Pagedown as client side validator.
You need to invest ca. one to two weeks of proper coding and get some tagsoup parser / handler finsihed that can sanitze the incomming HTML (via Markdown).
I highly suggest a three pass validation and processing scheme:
Mix-Mode: Whitelist incomming HTML tags that are part of the Markdown document.
Markdown Parser: Transform Markdown into HMTL
HTML-Mode: Whitelist HTML tags that are the HTML document.
You can then output. Store both, the Markdown source and the "backed" HTML data so you don't need to do this for every display operation.
Markdown allows arbitrary HTML to be included in it. Since this includes <script> elements, you can have valid Markdown that is also an XSS attack.
Run the incoming data through a Markdown parser to get HTML, then treat it like any other user submitted HTML (pass it through an HTML parser that applies a whitelist to the elements and attributes).

Cleaning up HTML formatted content for display within Flash?

I want to display HTML formatted content from various sources inside a Flash Flex application. Flash supports HTML formatting in its text fields, however it is very limited compared to a web browser. Are there any scripts out there that will convert common HTML formatted text into a format that Flash can handle? My particular use cases are:
Displaying HTML formatted emails inside Flash
Displaying RTF files inside Flash (after running an RTF2HTML conversion on the server)
Displaying random HTML content copied and pasted from other sources into Flash
I'm open to code that runs either on the client or the server, but server is probably preferable.
The subset of html tags supported is quite poor and has not changed in forever:
<a>, <b>, <br>, <font>, <img>, <i>, <li>, <p>, <textformat>, <u>
This means is that regardless of conversion quality, html cannot be rendered as fully intended; you could also be giving up a significant portion of css styling if you replace unsupported tags with more basic ones.
That being said, http://simplehtmldom.sourceforge.net/ (PHP) would work with some tweaks and it's competent enough to cope with invalid markup as well (seeing how you're after processing content from various sources, I'd say this feature alone would save a lot of pain in the long run) - than replace
<h1>,...,<h6> => <b>
<strong> => <b>
<em> => <i>
and plaintext the rest of it into paragraphs you'd be surprised at how readable it would still be. You could be a bit fancy too like so:
<h1> => <b class="header1">
and add some css as appropriate (although flash css support is pretty limited too)
I've been saving this one for desert - you'll either love it or hate it but it would do the trick. Assuming your app is deployed in-browser (if not and I misread you, save me the embarrassment and stop reading right here) you could use an iframe to display your html, seriously.
JS<->AS communication is fairly straightforward and you could have it positioned over a predetermined area of your app, giving the illusion that it's part of it; just remember to set windowmode on the flash object/embed correctly so it does not render on top of other page elements, then increase the iframe z-index.
I would not be surprised if this is seen as an "ugly" approach, but it's beautiful on the inside - you'll end up with verbatim html and real css support. As for user interactions, you could even intercept link clicks etc. in the iframe and request an action from the movieclip.
You can use HTMLPurifier and specify a whitelist of tags that you want to support.
AS3 HTML Parser Library is not quite what I'm looking for, since it does not convert the HTML but instead renders it within Flash, meaning that it wont be editable. But it may be useful in some cases that I only want to display and not edit text.
Another option is to look at several sample HTML that I'd like to be able to display, and then write regex to convert them to the format Flash/TLF expects. But I feel like that may be a huge endeavor, due to the wide range of HTML out there.

White list of HTML tags I should allow from user generated content?

All,
I am building a small site using PHP. In the site, I receive user-generated text content. I want to allow some safe HTML tag (e.g., formatting) as well as MathML. How do I go about compiling a white list for a strip_tags() function? Is there a well accepted white list I can use?
The standard strip_tags function is not enough for security, since it doesn't validate attributes at all. Use a more complete library explicitly for the purpose of completely sanitizing HTML like HTML Purifier.
If your aim is to not allow javascript through, then your whitelist of tags is going to be pretty close to the empty set.
Remember that pretty much all tags can have event attributes that contain javascript code to be executed when the specified event occurs.
If you don't want to go down the HTMLPurifier kind of route, consider a different language, such as markdown (that this site uses) or some other wiki-like markup language; however, be sure to disable any use of passthrough HTML that may be allowed.

PHP: How can I disallow HTML content in user-generated content?

I run a niche social network site. I would like to disallow HTML content in user posted messages; such as embedded videos etc. what option is there in php to clean this up before I insert into the db.
There are three basic solutions:
Strip all HTML tags from the post. In PHP you can do this using the strip_tags() function.
Encode all the characters, so that if a user types <b>hello</b> it shows up as <b>hello</b> in the HTML, or <b>hello</b> on the page itself. In PHP this is the htmlspecialchars() function. (Note: in this situation you would generally store the content in the database as-is, and use htmlspecialchars wherever you output the content.)
Use a HTML sanitizer such as HTML Purifier. This allows users to use certain HTML formatting such as bold/italic, but blocks malicious Javascript and any other tags you wish (i.e. <object> in your case). You may or may not wish to do this before storing in the database, but you must always do it before output in either case.
You could use the strip_tags() function.

Alternative of html purifier

I want to accept to accept the html input from user and post it on my site also want to make sure that it don't create problem with my site template due to dirty html code.
I was using html purifier in the past but Html purifier is not working on one of my server. So I am searching for best alternative.
Which is purely written in php.
which can fix the dirty html code like
</div> it is dirty code as div is closed without opening.
Simple solution without third-party libraries: create a DOMDocument and call loadHTML on it with your input. Surrounded the input with <html> and <body> tags if you are only parsing a little snippet. You'll probably want to suppress warnings too, as you'll get them spat out for common bad HTML.
Then simply walk over the resulting document tree, removing any elements and attributes you've not included in a known-good list. You should also check allowed URL attributes to ensure they use known-good schemes like http:, and not potentially troublesome schemes like javascript:. If you want to go the extra mile you can check that only allowed combinations of elements are nested inside each other (this is easier the smaller number of elements you're allowing).
Finally, serialise the snippet's node again using saveHTML. Because you're creating new markup from a DOM, not maintaining the original—potentially malformed—markup, that's a whole class of odd-markup injection techniques you're blocking.
You can try PHP Tidy, which is the Tidy library in PHP.
I believe Tidy will help close your tags, but it isn't as comprehensive as HTML Purifier which can remove valid but unwanted tags or attributes (i.e. JavaScript onclick events, that kind of thing).
Be aware that Tidy requires libtidy to be installed on your server, so it's not just straight PHP.
I know Pádraic Brady has been working on an alternative to HTML Purifier for Zend Framework, though I think its just experimental code at this time
http://framework.zend.com/wiki/pages/viewpage.action?pageId=25002168
http://github.com/padraic/wibble
Do also consider HTMLawed at https://www.bioinformatics.org/phplabware/internal_utilities/htmLawed/
From that page;
use to filter, secure & sanitize HTML in blog comments or forum posts, generate XML-
compatible feed items from web-page excerpts, convert HTML to XHTML, pretty-print
HTML, scrape web-pages, reduce spam, remove XSS code, etc.
Note that Tidy/HTML Tiday is NOT a anti XSS solution. It is a clean and repair utility which allows you to clean HTML, XHTML, and XML markup.
HTMLawed is a 55kb single php file whilst HTML Purifer is a 3 MB folder.

Categories