Cleaning up HTML formatted content for display within Flash?

Cleaning up HTML formatted content for display within Flash? - php

I want to display HTML formatted content from various sources inside a Flash Flex application. Flash supports HTML formatting in its text fields, however it is very limited compared to a web browser. Are there any scripts out there that will convert common HTML formatted text into a format that Flash can handle? My particular use cases are:
Displaying HTML formatted emails inside Flash
Displaying RTF files inside Flash (after running an RTF2HTML conversion on the server)
Displaying random HTML content copied and pasted from other sources into Flash
I'm open to code that runs either on the client or the server, but server is probably preferable.

The subset of html tags supported is quite poor and has not changed in forever:
<a>, <b>, <br>, <font>, <img>, <i>, <li>, <p>, <textformat>, <u>
This means is that regardless of conversion quality, html cannot be rendered as fully intended; you could also be giving up a significant portion of css styling if you replace unsupported tags with more basic ones.
That being said, http://simplehtmldom.sourceforge.net/ (PHP) would work with some tweaks and it's competent enough to cope with invalid markup as well (seeing how you're after processing content from various sources, I'd say this feature alone would save a lot of pain in the long run) - than replace
<h1>,...,<h6> => <b>
<strong> => <b>
<em> => <i>
and plaintext the rest of it into paragraphs you'd be surprised at how readable it would still be. You could be a bit fancy too like so:
<h1> => <b class="header1">
and add some css as appropriate (although flash css support is pretty limited too)
I've been saving this one for desert - you'll either love it or hate it but it would do the trick. Assuming your app is deployed in-browser (if not and I misread you, save me the embarrassment and stop reading right here) you could use an iframe to display your html, seriously.
JS<->AS communication is fairly straightforward and you could have it positioned over a predetermined area of your app, giving the illusion that it's part of it; just remember to set windowmode on the flash object/embed correctly so it does not render on top of other page elements, then increase the iframe z-index.
I would not be surprised if this is seen as an "ugly" approach, but it's beautiful on the inside - you'll end up with verbatim html and real css support. As for user interactions, you could even intercept link clicks etc. in the iframe and request an action from the movieclip.

You can use HTMLPurifier and specify a whitelist of tags that you want to support.

AS3 HTML Parser Library is not quite what I'm looking for, since it does not convert the HTML but instead renders it within Flash, meaning that it wont be editable. But it may be useful in some cases that I only want to display and not edit text.
Another option is to look at several sample HTML that I'd like to be able to display, and then write regex to convert them to the format Flash/TLF expects. But I feel like that may be a huge endeavor, due to the wide range of HTML out there.

Related

HTML escape only some characters and combination of characters?

I'm managing a blog where a select few people can submit their own articles and entries. I want them to be able to embed video via HTML (and bold, italicize, etc text at their choosing). How do I do this while maintaining site security?
If I don't HTML escape the actual article space, an open comment will ruin my site. Is there a way to selectively escape some combination of characters?
edit; hopefully without writing my own parser. I just want simple things like <b>, <i>, etc tags unescaped, as well as video and link embedding.

I use what SO uses. it is opensource and has parsers for many languages.
The name is WMD and the question "Where's the WMD editor open source project?" has some QA material outlining this editor.
The question "running showdown.js serverside to conver Markdown to HTML (in PHP)" has some QA material outlining some Markdown libraries in PHP.

You can safely HTML escape everything. URL's for your videos will be unaffected by whatever escaping you want to do.

The simplest way to do this that most sites (such as SO) use is to introduce your own special markup, which is then translated into the features that you want.
For example, SO uses asterisks (*) to italicize and (**) bold (Edit: next to the HTML tags <b></b> itself, see source of this answer).
Other sites use [b] and [i] tags. You could have a [video=http://myvideo.com] tag, which your PHP then translates into the appropriate HTML entity.

Formatting HTML correctly from cURL requests

I am working on an applet that allows the user to input a URL to a news article or other webpage (in Japanese) and view the contents of that page within an iFrame in my page. The idea is that once the content is loaded into the page, the user can highlight words using their cursor, which stores the selected text in an array (for translating/adding to a personal dictionary of terms) and surrounds the text in a red box (div) according to a stylesheet defined on my domain. To do this, I use cURL to retrieve the HTML of the external page and dump it into the source of the iFrame.
However, I keep running into major formatting problems with the retrieved HTML. The big problem is preserving style sheets, and to fix this, I've used DOMDocument to add tags to the section of the retrieved HTML. This works for some pages/URLs, but there are still lots of style problems with the output HTML for many others. For example, div layers crash into each other, alignments are off, and backgrounds are missing. This is made a bit more problematic as I need to embed the output HTML into a new in order to make the onClick javascript function for passing text selections in the embedded content to work, which means the resulting source ends up looking like this:
<div onclick="parent.selectionFunction()" id ="studyContentn">

</div>
It seems like for the most part a lot of the formatting issues I keep running into are largely arbitrary. I've tried using php Tidy to clean output from HTML, but that also only works for some pages but not many others. I've got a slight suspicion it may have to do with CDATA declarations that get parsed oddly when working with DOMDocument, but I am not certain.
Is there a way I can guarantee that HTML output from cURL will be rendered correctly and faithfully in all instances? Or is there perhaps a better way of going about doing this? I've tried a bunch of different ways of approaching this issue, and each gets closer to a solution but brings its own new problems as well.
Thanks -- let me know if I can clarify anything.

If I understand correctly you are trying to pull the html of a complete web page and display it under your domain, in your html. This is always going to be tricky, a lot of java script will break, relative url's will be wrong and as you mentioned, styles as well. Your probably also changing the dimensions that the page is displayed in. These can all be worked around but your going to be fighting an uphill battle with each new site, or if a current site change design
I'd probably take a different approach to the problem. You might want to write a browser plugin as the interface to the external web site instead. Then your applet can sit on top of the functional and tested (hopefully) site. Then you can focus on what you need to do for your applet rather than a never ending list of fiddly html issues.

I am trying to do a similar thing. It is very difficult to conserve the formatting, and the JS scripts in webpage complicated the thing. I finally gave up the complete the idea of completely displaying the original format, but do it with a workaround:
Select only headers, links, lists, paragraph which you are interested at.
Add the domain path of your ownsite to links.
You may wrap the headers, links etc. items by your own class.
Display it
in your case you want to select text and store it, which is another topic. What I did is to parse the HTMl in two levels, and then it is easy to do the selection. Keep in mind IE and Firefox/Chrome needs to be dealt with separately.

CKEditor output rules

Is there a way to add rules for the changes ckeditor makes to html?
Like I would like to use <br /> instead of it being output as <p>$nbsp;</p>
, to not wrap <style></style> in <p> tags
, and have it not modify the white space and leave all the carriage returns as they are put in.
Most of all I'm looking for some way to allow php to be added. The CMS I am using it on needs php on some pages. I write all the code but the client has the ability to go in and edit the text, but she doesn't know html, hence ckeditor, and changes pages with php in it over to ckeditor sometimes and it completely garbles the code.
Is there any way to do any of this?

CKEditor offers a powerful and flexible output formatting system. It
gives developers full control over what the HTML code produced by the
editor will look like.
http://docs.cksource.com/CKEditor_3.x/Developers_Guide/Output_Formatting
Most of all I'm looking for some way to allow php to be added
PHP can be added you just need to open the file in a plain textarea tag for writing and make sure its handled properly when saving, or if content is held in database, use eval() but not recommended.
http://php.net/manual/en/function.eval.php
If your client dose not understand basic html then opening up the page to more syntax errors will only cause you greater head pain.

Find important text in arbitrary HTML using PHP?

I have some random HTML layouts that contain important text I would like to extract. I cannot just strip_tags() as that will leave a bunch of extra junk from the sidebar/footer/header/etc.
I found a method built in Python and I was wondering if there is anything like this in PHP.
The concept is rather simple: use
information about the density of text
vs. HTML code to work out if a line of
text is worth outputting. (This isn’t
a novel idea, but it works!) The basic
process works as follows:
Parse the HTML code and keep track of the number of bytes processed.
Store the text output on a per-line, or per-paragraph basis.
Associate with each text line the number of bytes of HTML required to
describe it.
Compute the text density of each line by calculating the ratio of text
t> o bytes.
Then decide if the line is part of the content by using a neural network.
You can get pretty good results just
by checking if the line’s density is
above a fixed threshold (or the
average), but the system makes fewer
mistakes if you use machine learning -
not to mention that it’s easier to
implement!
Update: I started a bounty for an answer that could pull main content from a random HTML template. Since I can't share the documents I will be using - just pick any random blog sites and try to extract the body text from the layout. Remember that the header, sidebar(s), and footer may contain text also. See the link above for ideas.

phpQuery is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library.
UPDATE 2
DEMO: http://so.lucafilosofi.com/find-important-text-in-arbitrary-html-using-php/
tested on a casual blogs list taken from Technorati Top 100 and Best Blogs of 2010
many blogs make use of CMS;
blogs html structure is the same almost the time.
avoid common selectors like #sidebar, #header, #footer, #comments, etc..
avoid any widget by tag name script, iframe
clear well know content like:
/\d+\scomment(?:[s])/im
/(read the rest|read more).*/im
/(?:.*(?:by|post|submitt?)(?:ed)?.*\s(at|am|pm))/im
/[^a-z0-9]+/im
search for well know classes and ids:
typepad.com .entry-content
wordpress.org .post-entry .entry .post
movabletype.com .post
blogger.com .post-body .entry-content
drupal.com .content
tumblr.com .post
squarespace.com .journal-entry-text
expressionengine.com .entry
gawker.com .post-body
Ref: The blog platforms of choice among the top 100 blogs
$selectors = array('.post-body','.post','.journal-entry-text','.entry-content','.content');
$doc = phpQuery::newDocumentFile('http://blog.com')->find($selectors)->children('p,div');
search based on common html structure that look like this:
<div>
<h1|h2|h3|h4|a />
<p|div />
</div>
$doc = phpQuery::newDocumentFile('http://blog.com')->find('h1,h2,h3,h4')->parent()->children('p,div');

Domdocument can be used to parse html documents, which can then be queried through PHP.
Edit: wikied

I worked on a similar project a while back. It's not as complex as the Python script but it will do a good job. Check out the Simple HTML PHP Parser
http://simplehtmldom.sourceforge.net/

Depending on your HTML structure and if you have id's or classes in place you can get a little complicated and use preg_match() to specifically get any information between a certain start and end tag. This means that you should know how to write regular expressions.
You can also look into a browser emulation PHP class. I've done this for page scraping and it works well enough depending on how well formatted the DOM is. I personally like SimpleBrowser
http://www.simpletest.org/api/SimpleTest/WebTester/SimpleBrowser.html

I have developed a HTML parser and filter PHP package that can be used for that purpose.
It consists of a set of classes that can be chained together to perform a series of parsing, filtering and transformation operations in HTML/XML code.
It was meant to deal with real world pages, so it can deal with malformed tag and data structures, so it can preserve as much as the original document as possible.
One of the filter classes it comes with can do DTD validation. Another can discard insecure HTML tags and CSS to prevent XSS attacks. Another can simply extract all document links.
All those filter classes are optional. You can chain them together the way you want, if you need any at all.
So, to solve your problem, I do not think there is already a specific solution for that in PHP anywhere, but a special filter class could be developed for it. Take a look at the package. It is thoroughly documented.
If you need help, just check my profile and mail me and I may even develop the filter that does exactly what you need, eventually inspired in any solutions that exist for other languages.

Is there a way to decode html e-mails?

I am writing support software and I figured for highlighting stuff it would be great to have HTML support.
Looking at Outlooks "HTML" I want to crawl up into the fetal position and cry!
Is there a php class to unscramble HTML emails to support basic HTML? I don't want to display the E-Mails in a frame because I want to work with the data and analyse it. I also don't want to support stupid things like changing font since its a webapp I want my webapp to say what the font is and not have some hippie who sends the support team e-mails in comic sans and yellow color. I want to support bold, italic, underlined, streched out and lists (http://dl.getdropbox.com/u/5910/Jing/2009-02-23_2100.png).
I also don't quite know the difference between rich-text and html since I always thought rich-text only allowed the functions I wanted but I seem to be able to do everything in rich-text which I can do in Html.
Also I should add I am using the Zend Framework because of the fabulous Zend_Mail

You can pipe it through htmltidy and then further filter it with something like HtmlPurifier, but of course you may strip out something that is essential to understanding the contents. That's the problem with a visual format, like html.

You can use PHP's strip_tags() function, and it's optional "allowable_tags" parameter. This will allow you to strip out all the tags that are not <em> <b> <strong> <u> etc.
About RTF vs. HTML, my understanding is that when Outlook and Exchange communicate with non-RTF compliant systems they convert RTF to HTML. I'm not sure this is always true, or how consistent that function is, but that might explain why messages sent RTF appear to be HTML.

I'm pretty sure you'll have to write your own class... there is no real class like that in the PHP documents I've seen..

Or you could use the plain-text variant attached to the e-mail. If there is no plain-text variant you could use a stripped version of the html. I think using these steps you would have a nice result:
Remove newlines
Turn </p> and <br/> into newline
Strip all html tags

Pulling out the HTML from an Outlook mail may seem scary at first, but it's only HTML tags - just a whole lot of them!
So if you just locate to a "<" and then find the next ">" you have a tag. If it is not something you want to have, like "</strong>" just throw it away and repeat Simple as that.
(I have done exactly this in a spelling and grammar checker which not only pulls out plain text from Outlook and checks it - it can then push all the user's changes back into the HTML without destroying any tags. The latter was not easy, though! ;-)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.