I am writing support software and I figured for highlighting stuff it would be great to have HTML support.
Looking at Outlooks "HTML" I want to crawl up into the fetal position and cry!
Is there a php class to unscramble HTML emails to support basic HTML? I don't want to display the E-Mails in a frame because I want to work with the data and analyse it. I also don't want to support stupid things like changing font since its a webapp I want my webapp to say what the font is and not have some hippie who sends the support team e-mails in comic sans and yellow color. I want to support bold, italic, underlined, streched out and lists (http://dl.getdropbox.com/u/5910/Jing/2009-02-23_2100.png).
I also don't quite know the difference between rich-text and html since I always thought rich-text only allowed the functions I wanted but I seem to be able to do everything in rich-text which I can do in Html.
Also I should add I am using the Zend Framework because of the fabulous Zend_Mail
You can pipe it through htmltidy and then further filter it with something like HtmlPurifier, but of course you may strip out something that is essential to understanding the contents. That's the problem with a visual format, like html.
You can use PHP's strip_tags() function, and it's optional "allowable_tags" parameter. This will allow you to strip out all the tags that are not <em> <b> <strong> <u> etc.
About RTF vs. HTML, my understanding is that when Outlook and Exchange communicate with non-RTF compliant systems they convert RTF to HTML. I'm not sure this is always true, or how consistent that function is, but that might explain why messages sent RTF appear to be HTML.
I'm pretty sure you'll have to write your own class... there is no real class like that in the PHP documents I've seen..
Or you could use the plain-text variant attached to the e-mail. If there is no plain-text variant you could use a stripped version of the html. I think using these steps you would have a nice result:
Remove newlines
Turn </p> and <br/> into newline
Strip all html tags
Pulling out the HTML from an Outlook mail may seem scary at first, but it's only HTML tags - just a whole lot of them!
So if you just locate to a "<" and then find the next ">" you have a tag. If it is not something you want to have, like "</strong>" just throw it away and repeat Simple as that.
(I have done exactly this in a spelling and grammar checker which not only pulls out plain text from Outlook and checks it - it can then push all the user's changes back into the HTML without destroying any tags. The latter was not easy, though! ;-)
Related
I am building a small/test CMS using Php and Mysql.
Everything is working amazingly on the adding, editing, deleting and displaying level, but after finishing my code, I wanted to add a WYSIWYG editor in the Admin back end.
My problem is that I am using escape method to hopefully make my form a bit more secure and try to escape injections, therefore when adding a styled text, image or any other HTML code in my Editor I am getting them printed as line codes on my page(Which is completely right to avoid attacks).
MY ESCAPE METHOD:
function e($text) {
return htmlspecialchars($text, ENT_QUOTES, 'UTF-8');}
Is there any way to work around my escape method (which is think it should not be done because if I can do it every attacker could).
Or should I change my escape method to another method?
If I understand you correctly you are going to allow your users to put some formatting into the text they are going to create. For this you are going to add some WYSISWYG editor. But the question is how to distinguish the formatting and special characters which are allowed from what is not allowed. You need to clean up the text and leave only valid allowed formatting (HTML tags) and remove all malicious JavaScript or HTML.
This is not an easy task like it might sound at the first moment. I can see several approaches here.
Easiest solution to use strip_tags and specify what tags are allowed.
But please keep in mind that strip_tags is not perfect. Let me quote the manual here.
Because strip_tags() does not actually validate the HTML, partial or
broken tags can result in the removal of more text/data than expected.
This function does not modify any attributes on the tags that you
allow using allowable_tags, including the style and onmouseover
attributes that a mischievous user may abuse when posting text that
will be shown to other users.
This is a known issue. And libraries exist which do a better cleanup of HTML and JS to prevent breaks.
A bit more complicated solution would be to use some advanced library to cleanup the HTML code. For example this might be HTML Purifier
Quote from the documentation
HTML Purifier will not only remove all malicious code (better known as
XSS) with a thoroughly audited, secure yet permissive whitelist, it
will also make sure your documents are standards compliant, something
only achievable with a comprehensive knowledge of W3C's
specifications.
The other libraries exist which solve the same task. You can check for example this article where libraries are compared. And finally you might choose the best one.
Completely different approach is to avoid users from writing HTML tags. Ask them to write some other markup instead like this is done on StackOverflow or Basecamp or GitHub. Markdown might be a good approach.
Using simple markup for text allows you to complete avoid issues with broken HTML and JavaScript cause you can escape everything and build HTML markup on your own.
The editor might look like the one I'm using to write this message :)
You can use strip_tags() to remove the unwanted tags. Read about it on this manual:
http://php.net/manual/en/function.strip-tags.php
Example 1 (Based on the manual)
<?php
$text = '<p>Test paragraph, With link.</p>';
# Output: Test paragraph, With link. (Tags are stripped)
echo strip_tags($text);
echo "\n";
# Allow <p> and <a>
#Output: <p>Test paragraph, With link.</p>
echo strip_tags($text, '<p><a>');
?>
I hope this will help you!
I'm building email functionality into an application. Most all email clients mark previous messages in the reply text with a vertical line (perhaps just the "|" character?) along the entire left hand side of the message.
Does anyone know a function/utility (preferably in python), however I can adopt from anything) that would format HTML and text content in this way? It sounds like a pretty easy problem but it's actually quite complex.
For plain text you can use the textwrap module. Just specify '| ' as the initial_indent and subsequent_indent arguments.
Doing it with HTML requires a different approach, since HTML is normally formatted by the browser. I guess it should be possible using CSS, but I've never done it myself, so I don't know the details.
I want to display HTML formatted content from various sources inside a Flash Flex application. Flash supports HTML formatting in its text fields, however it is very limited compared to a web browser. Are there any scripts out there that will convert common HTML formatted text into a format that Flash can handle? My particular use cases are:
Displaying HTML formatted emails inside Flash
Displaying RTF files inside Flash (after running an RTF2HTML conversion on the server)
Displaying random HTML content copied and pasted from other sources into Flash
I'm open to code that runs either on the client or the server, but server is probably preferable.
The subset of html tags supported is quite poor and has not changed in forever:
<a>, <b>, <br>, <font>, <img>, <i>, <li>, <p>, <textformat>, <u>
This means is that regardless of conversion quality, html cannot be rendered as fully intended; you could also be giving up a significant portion of css styling if you replace unsupported tags with more basic ones.
That being said, http://simplehtmldom.sourceforge.net/ (PHP) would work with some tweaks and it's competent enough to cope with invalid markup as well (seeing how you're after processing content from various sources, I'd say this feature alone would save a lot of pain in the long run) - than replace
<h1>,...,<h6> => <b>
<strong> => <b>
<em> => <i>
and plaintext the rest of it into paragraphs you'd be surprised at how readable it would still be. You could be a bit fancy too like so:
<h1> => <b class="header1">
and add some css as appropriate (although flash css support is pretty limited too)
I've been saving this one for desert - you'll either love it or hate it but it would do the trick. Assuming your app is deployed in-browser (if not and I misread you, save me the embarrassment and stop reading right here) you could use an iframe to display your html, seriously.
JS<->AS communication is fairly straightforward and you could have it positioned over a predetermined area of your app, giving the illusion that it's part of it; just remember to set windowmode on the flash object/embed correctly so it does not render on top of other page elements, then increase the iframe z-index.
I would not be surprised if this is seen as an "ugly" approach, but it's beautiful on the inside - you'll end up with verbatim html and real css support. As for user interactions, you could even intercept link clicks etc. in the iframe and request an action from the movieclip.
You can use HTMLPurifier and specify a whitelist of tags that you want to support.
AS3 HTML Parser Library is not quite what I'm looking for, since it does not convert the HTML but instead renders it within Flash, meaning that it wont be editable. But it may be useful in some cases that I only want to display and not edit text.
Another option is to look at several sample HTML that I'd like to be able to display, and then write regex to convert them to the format Flash/TLF expects. But I feel like that may be a huge endeavor, due to the wide range of HTML out there.
I'm looking to write an algorithm to compress HTML output for a CMS I'm writing in PHP, written with the CodeIgniter framework.
I was thinking of trying to remove whitespace between any angle brackets, except the <script>, <pre>, and <style> elements, and simply ignoring those elements for simplicity. I should clarify that this is whitespace between consecutive tags, with no text between them.
How should I go about parsing the HTML to find the whitespace I want to remove?
Edit:
To start off, I want to remove all tab characters that are not in <pre> tags. This can be done with regex, I'm sure, but what are the alternatives?
Don't. Whitespace is negligible. Better to be using output compression, with zlib or here for example
Is there something wrong with the existing HTML minification solutions?
Minify does HTML (as well as CSS and JS).
(That second link goes to the source code, which comments the steps it takes - should be a good leg up if you did want to create your own - it's BSD licensed.)
Also, as Pete says, you'll benefit much more by using gzip compression for your HTML (and CSS/JS/etc), and wont get tripped up by problems such as Gordon mentioned in his comment.
What text to HTML converter for PHP would you recommend?
One of the examples would be Markdown, which is used here at SO. User just types some text into the text-box with some natural formatting: enters at the end of line, empty line at the end of paragraph, asterisk delimited bold text, etc. And this syntax is converted to HTML tags.
The simplicity is the main feature we are looking for, there does not need to be a lot of possibilities but those basic that are there should be very intuitive (automatic URL conversion to link, emoticons, paragraphs).
A big plus would be if there is WYSIWYG editor for it. Half-wysiwig just like here at SO would be even better.
Extra points would be if it would fit with Zend Framework well.
Take your pick at http://en.wikipedia.org/wiki/Lightweight_markup_language.
As for Markdown, there's one PHP parser that I've been using called PHP Markdown, and I especially like the Extra extension.
I have actually taken a stab at extending it with my own (undocumented) features. It's available at GitHub (remember that it's the extra branch I've fixed, not the masteR), if you're interested. I've intended on making it a 'proper fork' for a while, but that's another, largely offtopic, story.
The Zend Framework has a WYSIWYG editor bundled with it's Dojo integration.
http://framework.zend.com/manual/en/zend.dojo.form.html#zend.dojo.form.elements.editor
... Bring on the extra points!
There's always textile. It is widely implemented, and has a few basic similarities with Markdown. However, I have never seen a WYSIWYG editor for Textile.
You might find upflow useful.
If you want WYSIWYG, I'm a big fan of FCKeditor. It converts user input to HTML before submitting the form, not after, but has a nice PHP library for using it, and a PHP connector for handling file uploading/browsing (along with several other languages).
If you want something that can be read as plain-text but output as HTML, I vote for Markdown.
I will stick with my original idea of adopting Texy.
None of the products mentioned here actually beats it. I had problem with Texys syntax but it seems to be quite standard and is present in other products too.
It is very lightweith, supports very natural syntax and has great "half" wysiwyg editor Texyla (wiki is in Czech only)