HTML escape only some characters and combination of characters?

HTML escape only some characters and combination of characters? - php

I'm managing a blog where a select few people can submit their own articles and entries. I want them to be able to embed video via HTML (and bold, italicize, etc text at their choosing). How do I do this while maintaining site security?
If I don't HTML escape the actual article space, an open comment will ruin my site. Is there a way to selectively escape some combination of characters?
edit; hopefully without writing my own parser. I just want simple things like <b>, <i>, etc tags unescaped, as well as video and link embedding.

I use what SO uses. it is opensource and has parsers for many languages.
The name is WMD and the question "Where's the WMD editor open source project?" has some QA material outlining this editor.
The question "running showdown.js serverside to conver Markdown to HTML (in PHP)" has some QA material outlining some Markdown libraries in PHP.

You can safely HTML escape everything. URL's for your videos will be unaffected by whatever escaping you want to do.

The simplest way to do this that most sites (such as SO) use is to introduce your own special markup, which is then translated into the features that you want.
For example, SO uses asterisks (*) to italicize and (**) bold (Edit: next to the HTML tags <b></b> itself, see source of this answer).
Other sites use [b] and [i] tags. You could have a [video=http://myvideo.com] tag, which your PHP then translates into the appropriate HTML entity.

Related

Use WYSIWYG Editor with PHP escape Method

I am building a small/test CMS using Php and Mysql.
Everything is working amazingly on the adding, editing, deleting and displaying level, but after finishing my code, I wanted to add a WYSIWYG editor in the Admin back end.
My problem is that I am using escape method to hopefully make my form a bit more secure and try to escape injections, therefore when adding a styled text, image or any other HTML code in my Editor I am getting them printed as line codes on my page(Which is completely right to avoid attacks).
MY ESCAPE METHOD:
function e($text) {
return htmlspecialchars($text, ENT_QUOTES, 'UTF-8');}
Is there any way to work around my escape method (which is think it should not be done because if I can do it every attacker could).
Or should I change my escape method to another method?

If I understand you correctly you are going to allow your users to put some formatting into the text they are going to create. For this you are going to add some WYSISWYG editor. But the question is how to distinguish the formatting and special characters which are allowed from what is not allowed. You need to clean up the text and leave only valid allowed formatting (HTML tags) and remove all malicious JavaScript or HTML.
This is not an easy task like it might sound at the first moment. I can see several approaches here.
Easiest solution to use strip_tags and specify what tags are allowed.
But please keep in mind that strip_tags is not perfect. Let me quote the manual here.
Because strip_tags() does not actually validate the HTML, partial or
broken tags can result in the removal of more text/data than expected.
This function does not modify any attributes on the tags that you
allow using allowable_tags, including the style and onmouseover
attributes that a mischievous user may abuse when posting text that
will be shown to other users.
This is a known issue. And libraries exist which do a better cleanup of HTML and JS to prevent breaks.
A bit more complicated solution would be to use some advanced library to cleanup the HTML code. For example this might be HTML Purifier
Quote from the documentation
HTML Purifier will not only remove all malicious code (better known as
XSS) with a thoroughly audited, secure yet permissive whitelist, it
will also make sure your documents are standards compliant, something
only achievable with a comprehensive knowledge of W3C's
specifications.
The other libraries exist which solve the same task. You can check for example this article where libraries are compared. And finally you might choose the best one.
Completely different approach is to avoid users from writing HTML tags. Ask them to write some other markup instead like this is done on StackOverflow or Basecamp or GitHub. Markdown might be a good approach.
Using simple markup for text allows you to complete avoid issues with broken HTML and JavaScript cause you can escape everything and build HTML markup on your own.
The editor might look like the one I'm using to write this message :)

You can use strip_tags() to remove the unwanted tags. Read about it on this manual:
http://php.net/manual/en/function.strip-tags.php
Example 1 (Based on the manual)
<?php
$text = '<p>Test paragraph, With link.</p>';
# Output: Test paragraph, With link. (Tags are stripped)
echo strip_tags($text);
echo "\n";
# Allow <p> and <a>
#Output: <p>Test paragraph, With link.</p>
echo strip_tags($text, '<p><a>');
?>
I hope this will help you!

Cleaning up HTML formatted content for display within Flash?

I want to display HTML formatted content from various sources inside a Flash Flex application. Flash supports HTML formatting in its text fields, however it is very limited compared to a web browser. Are there any scripts out there that will convert common HTML formatted text into a format that Flash can handle? My particular use cases are:
Displaying HTML formatted emails inside Flash
Displaying RTF files inside Flash (after running an RTF2HTML conversion on the server)
Displaying random HTML content copied and pasted from other sources into Flash
I'm open to code that runs either on the client or the server, but server is probably preferable.

The subset of html tags supported is quite poor and has not changed in forever:
<a>, <b>, <br>, <font>, <img>, <i>, <li>, <p>, <textformat>, <u>
This means is that regardless of conversion quality, html cannot be rendered as fully intended; you could also be giving up a significant portion of css styling if you replace unsupported tags with more basic ones.
That being said, http://simplehtmldom.sourceforge.net/ (PHP) would work with some tweaks and it's competent enough to cope with invalid markup as well (seeing how you're after processing content from various sources, I'd say this feature alone would save a lot of pain in the long run) - than replace
<h1>,...,<h6> => <b>
<strong> => <b>
<em> => <i>
and plaintext the rest of it into paragraphs you'd be surprised at how readable it would still be. You could be a bit fancy too like so:
<h1> => <b class="header1">
and add some css as appropriate (although flash css support is pretty limited too)
I've been saving this one for desert - you'll either love it or hate it but it would do the trick. Assuming your app is deployed in-browser (if not and I misread you, save me the embarrassment and stop reading right here) you could use an iframe to display your html, seriously.
JS<->AS communication is fairly straightforward and you could have it positioned over a predetermined area of your app, giving the illusion that it's part of it; just remember to set windowmode on the flash object/embed correctly so it does not render on top of other page elements, then increase the iframe z-index.
I would not be surprised if this is seen as an "ugly" approach, but it's beautiful on the inside - you'll end up with verbatim html and real css support. As for user interactions, you could even intercept link clicks etc. in the iframe and request an action from the movieclip.

You can use HTMLPurifier and specify a whitelist of tags that you want to support.

AS3 HTML Parser Library is not quite what I'm looking for, since it does not convert the HTML but instead renders it within Flash, meaning that it wont be editable. But it may be useful in some cases that I only want to display and not edit text.
Another option is to look at several sample HTML that I'd like to be able to display, and then write regex to convert them to the format Flash/TLF expects. But I feel like that may be a huge endeavor, due to the wide range of HTML out there.

WYSIWYG editor, markdown or bbcode or wiki or what

I'm finding some editors that say they use bbcode or markdown or wiki, etc.
Can someone explain what this is all about?

It's a simplified set of mark-up language tags to help with formatting of content in situations where you do not trust the data being entered, it will prevent users having to enter direct html that may otherwise break a page or permit exploits from working their way into the page's output. These tags will be processed and formatted into HTML by the software.
Typing this answer into Stackoverflow required me to use a form of markup/markdown language :)
eg
[b]Bold[/b]
[url=http://www.google.com]Link Text[/url]
[img]http://www.domain.com/image.jpg[/img]
For wikipedia entries on each type of markup:
http://en.wikipedia.org/wiki/BBCode
http://en.wikipedia.org/wiki/Help:Wiki_markup
http://en.wikipedia.org/wiki/Markdown

BBCode, Markdown and wiki are markup languages used for writing the code on web pages that support user-entered content.
This is quite often to include one or more of these markups instead of enabling the whole WYSIWYG editor, especially when usually users enter text without any markup or with limited features.
Googling does not hurt.

How to parse HTML for minification in PHP?

I'm looking to write an algorithm to compress HTML output for a CMS I'm writing in PHP, written with the CodeIgniter framework.
I was thinking of trying to remove whitespace between any angle brackets, except the <script>, <pre>, and <style> elements, and simply ignoring those elements for simplicity. I should clarify that this is whitespace between consecutive tags, with no text between them.
How should I go about parsing the HTML to find the whitespace I want to remove?
Edit:
To start off, I want to remove all tab characters that are not in <pre> tags. This can be done with regex, I'm sure, but what are the alternatives?

Don't. Whitespace is negligible. Better to be using output compression, with zlib or here for example

Is there something wrong with the existing HTML minification solutions?
Minify does HTML (as well as CSS and JS).
(That second link goes to the source code, which comments the steps it takes - should be a good leg up if you did want to create your own - it's BSD licensed.)
Also, as Pete says, you'll benefit much more by using gzip compression for your HTML (and CSS/JS/etc), and wont get tripped up by problems such as Gordon mentioned in his comment.

Is there a way to decode html e-mails?

I am writing support software and I figured for highlighting stuff it would be great to have HTML support.
Looking at Outlooks "HTML" I want to crawl up into the fetal position and cry!
Is there a php class to unscramble HTML emails to support basic HTML? I don't want to display the E-Mails in a frame because I want to work with the data and analyse it. I also don't want to support stupid things like changing font since its a webapp I want my webapp to say what the font is and not have some hippie who sends the support team e-mails in comic sans and yellow color. I want to support bold, italic, underlined, streched out and lists (http://dl.getdropbox.com/u/5910/Jing/2009-02-23_2100.png).
I also don't quite know the difference between rich-text and html since I always thought rich-text only allowed the functions I wanted but I seem to be able to do everything in rich-text which I can do in Html.
Also I should add I am using the Zend Framework because of the fabulous Zend_Mail

You can pipe it through htmltidy and then further filter it with something like HtmlPurifier, but of course you may strip out something that is essential to understanding the contents. That's the problem with a visual format, like html.

You can use PHP's strip_tags() function, and it's optional "allowable_tags" parameter. This will allow you to strip out all the tags that are not <em> <b> <strong> <u> etc.
About RTF vs. HTML, my understanding is that when Outlook and Exchange communicate with non-RTF compliant systems they convert RTF to HTML. I'm not sure this is always true, or how consistent that function is, but that might explain why messages sent RTF appear to be HTML.

I'm pretty sure you'll have to write your own class... there is no real class like that in the PHP documents I've seen..

Or you could use the plain-text variant attached to the e-mail. If there is no plain-text variant you could use a stripped version of the html. I think using these steps you would have a nice result:
Remove newlines
Turn </p> and <br/> into newline
Strip all html tags

Pulling out the HTML from an Outlook mail may seem scary at first, but it's only HTML tags - just a whole lot of them!
So if you just locate to a "<" and then find the next ">" you have a tag. If it is not something you want to have, like "</strong>" just throw it away and repeat Simple as that.
(I have done exactly this in a spelling and grammar checker which not only pulls out plain text from Outlook and checks it - it can then push all the user's changes back into the HTML without destroying any tags. The latter was not easy, though! ;-)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.