How to parse HTML for minification in PHP?

How to parse HTML for minification in PHP? - php

I'm looking to write an algorithm to compress HTML output for a CMS I'm writing in PHP, written with the CodeIgniter framework.
I was thinking of trying to remove whitespace between any angle brackets, except the <script>, <pre>, and <style> elements, and simply ignoring those elements for simplicity. I should clarify that this is whitespace between consecutive tags, with no text between them.
How should I go about parsing the HTML to find the whitespace I want to remove?
Edit:
To start off, I want to remove all tab characters that are not in <pre> tags. This can be done with regex, I'm sure, but what are the alternatives?

Don't. Whitespace is negligible. Better to be using output compression, with zlib or here for example

Is there something wrong with the existing HTML minification solutions?
Minify does HTML (as well as CSS and JS).
(That second link goes to the source code, which comments the steps it takes - should be a good leg up if you did want to create your own - it's BSD licensed.)
Also, as Pete says, you'll benefit much more by using gzip compression for your HTML (and CSS/JS/etc), and wont get tripped up by problems such as Gordon mentioned in his comment.

Related

Use WYSIWYG Editor with PHP escape Method

I am building a small/test CMS using Php and Mysql.
Everything is working amazingly on the adding, editing, deleting and displaying level, but after finishing my code, I wanted to add a WYSIWYG editor in the Admin back end.
My problem is that I am using escape method to hopefully make my form a bit more secure and try to escape injections, therefore when adding a styled text, image or any other HTML code in my Editor I am getting them printed as line codes on my page(Which is completely right to avoid attacks).
MY ESCAPE METHOD:
function e($text) {
return htmlspecialchars($text, ENT_QUOTES, 'UTF-8');}
Is there any way to work around my escape method (which is think it should not be done because if I can do it every attacker could).
Or should I change my escape method to another method?

If I understand you correctly you are going to allow your users to put some formatting into the text they are going to create. For this you are going to add some WYSISWYG editor. But the question is how to distinguish the formatting and special characters which are allowed from what is not allowed. You need to clean up the text and leave only valid allowed formatting (HTML tags) and remove all malicious JavaScript or HTML.
This is not an easy task like it might sound at the first moment. I can see several approaches here.
Easiest solution to use strip_tags and specify what tags are allowed.
But please keep in mind that strip_tags is not perfect. Let me quote the manual here.
Because strip_tags() does not actually validate the HTML, partial or
broken tags can result in the removal of more text/data than expected.
This function does not modify any attributes on the tags that you
allow using allowable_tags, including the style and onmouseover
attributes that a mischievous user may abuse when posting text that
will be shown to other users.
This is a known issue. And libraries exist which do a better cleanup of HTML and JS to prevent breaks.
A bit more complicated solution would be to use some advanced library to cleanup the HTML code. For example this might be HTML Purifier
Quote from the documentation
HTML Purifier will not only remove all malicious code (better known as
XSS) with a thoroughly audited, secure yet permissive whitelist, it
will also make sure your documents are standards compliant, something
only achievable with a comprehensive knowledge of W3C's
specifications.
The other libraries exist which solve the same task. You can check for example this article where libraries are compared. And finally you might choose the best one.
Completely different approach is to avoid users from writing HTML tags. Ask them to write some other markup instead like this is done on StackOverflow or Basecamp or GitHub. Markdown might be a good approach.
Using simple markup for text allows you to complete avoid issues with broken HTML and JavaScript cause you can escape everything and build HTML markup on your own.
The editor might look like the one I'm using to write this message :)

You can use strip_tags() to remove the unwanted tags. Read about it on this manual:
http://php.net/manual/en/function.strip-tags.php
Example 1 (Based on the manual)
<?php
$text = '<p>Test paragraph, With link.</p>';
# Output: Test paragraph, With link. (Tags are stripped)
echo strip_tags($text);
echo "\n";
# Allow <p> and <a>
#Output: <p>Test paragraph, With link.</p>
echo strip_tags($text, '<p><a>');
?>
I hope this will help you!

Strip only valid html

I'm trying to strip HTML tags from a piece of text. However the trouble is that whatever I use - regex, strip_tags etc.. Comes up across the same problem: It will also strip text which is not HTML but looks like it.
Some <foo#bar.com> Content--> Some Content
Some <Content which looks like this --> Some
Is there a way I can get around this?

A fully correct solution would be a full-fledged HTML parser. See this legendary question for a full discussion.
A simple 80% solution would be to look for all known tags and strip them.
RegExp('</?(a|b|blockquote|cite|dd|dl|dt|...|u)\b.*?>')
The code would be more readable if you use an array of tags and build expressions as you loop through them. It will not handle comments nicely, so if you need more than hack quality, don't do it with a hack approach. If you need correctness, use an actual HTML parser (e.g. DOMDocument in PHP).

Have you tried the HTML purifier library? You can configure it to strip all tags out, I've found the library very reliable.

exploding <table>, <div> from sql content

I have a line in my template that I need help fixing.
If a user posts opening <div> tags or closing </div>, or any other structure related HTML tags into the content, it will result in a template mess on the page.
I'm using htmlentities on titles, and other forms. Unfortunately, I can't do that here, because the content field has a rich editor, and I need to keep the text styling tags intact (<b>, <u>, colors, <i> and such).
Right now, it's very easy for users to mess up the template on purpose and I want to prevent that.

The best way is to throw the text into a DOM parser and have it sort out any mess. Such a tool will probably be more robust than anything you can put together, including solving problems you didn't know you might have.

Now that you've clarified in the comments what the actual problem is, I may be able to help. The issue is basically that you don't want long words entered by the user to break the page layout.
The PHP wordwrap solution you've come up with already has numerous problems, of which the one you've found (breaking your HTML) is the most obvious.
However, since the issue is purely a question of not wanting to allow long words to break your page layout, there are several other solutions that could be used instead.
Do something specific to any excessively long words, rather than to the whole text. You could add manual breaks to them, spaces, or the HTML <wbr> tag (which is an optional line-break, ie for hyphenation purposes). Or you could even just block users from entering crazy long words in the first place.
CSS overflow-x:hidden
Using this, any text overflowing out of the side of the box will simply be hidden, rather than be printed on top of other parts of the page.
CSS word-wrap
There are several ways of doing this, and it gets a little tricky because of varying browser support. But here's a link that explains it all: http://blog.kenneth.io/blog/2012/03/04/word-wrapping-hypernation-using-css/
Hope that helps.

HTML escape only some characters and combination of characters?

I'm managing a blog where a select few people can submit their own articles and entries. I want them to be able to embed video via HTML (and bold, italicize, etc text at their choosing). How do I do this while maintaining site security?
If I don't HTML escape the actual article space, an open comment will ruin my site. Is there a way to selectively escape some combination of characters?
edit; hopefully without writing my own parser. I just want simple things like <b>, <i>, etc tags unescaped, as well as video and link embedding.

I use what SO uses. it is opensource and has parsers for many languages.
The name is WMD and the question "Where's the WMD editor open source project?" has some QA material outlining this editor.
The question "running showdown.js serverside to conver Markdown to HTML (in PHP)" has some QA material outlining some Markdown libraries in PHP.

You can safely HTML escape everything. URL's for your videos will be unaffected by whatever escaping you want to do.

The simplest way to do this that most sites (such as SO) use is to introduce your own special markup, which is then translated into the features that you want.
For example, SO uses asterisks (*) to italicize and (**) bold (Edit: next to the HTML tags <b></b> itself, see source of this answer).
Other sites use [b] and [i] tags. You could have a [video=http://myvideo.com] tag, which your PHP then translates into the appropriate HTML entity.

Is there a way to decode html e-mails?

I am writing support software and I figured for highlighting stuff it would be great to have HTML support.
Looking at Outlooks "HTML" I want to crawl up into the fetal position and cry!
Is there a php class to unscramble HTML emails to support basic HTML? I don't want to display the E-Mails in a frame because I want to work with the data and analyse it. I also don't want to support stupid things like changing font since its a webapp I want my webapp to say what the font is and not have some hippie who sends the support team e-mails in comic sans and yellow color. I want to support bold, italic, underlined, streched out and lists (http://dl.getdropbox.com/u/5910/Jing/2009-02-23_2100.png).
I also don't quite know the difference between rich-text and html since I always thought rich-text only allowed the functions I wanted but I seem to be able to do everything in rich-text which I can do in Html.
Also I should add I am using the Zend Framework because of the fabulous Zend_Mail

You can pipe it through htmltidy and then further filter it with something like HtmlPurifier, but of course you may strip out something that is essential to understanding the contents. That's the problem with a visual format, like html.

You can use PHP's strip_tags() function, and it's optional "allowable_tags" parameter. This will allow you to strip out all the tags that are not <em> <b> <strong> <u> etc.
About RTF vs. HTML, my understanding is that when Outlook and Exchange communicate with non-RTF compliant systems they convert RTF to HTML. I'm not sure this is always true, or how consistent that function is, but that might explain why messages sent RTF appear to be HTML.

I'm pretty sure you'll have to write your own class... there is no real class like that in the PHP documents I've seen..

Or you could use the plain-text variant attached to the e-mail. If there is no plain-text variant you could use a stripped version of the html. I think using these steps you would have a nice result:
Remove newlines
Turn </p> and <br/> into newline
Strip all html tags

Pulling out the HTML from an Outlook mail may seem scary at first, but it's only HTML tags - just a whole lot of them!
So if you just locate to a "<" and then find the next ">" you have a tag. If it is not something you want to have, like "</strong>" just throw it away and repeat Simple as that.
(I have done exactly this in a spelling and grammar checker which not only pulls out plain text from Outlook and checks it - it can then push all the user's changes back into the HTML without destroying any tags. The latter was not easy, though! ;-)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.