How to minify php page html output? - php

I am looking for a php script or class that can minify my php page html output like google page speed does.
How can I do this?

CSS and Javascript
Consider the following link to minify Javascript/CSS files: https://github.com/mrclay/minify
HTML
Tell Apache to deliver HTML with GZip - this generally reduces the response size by about 70%. (If you use Apache, the module configuring gzip depends on your version: Apache 1.3 uses mod_gzip while Apache 2.x uses mod_deflate.)
Accept-Encoding: gzip, deflate
Content-Encoding: gzip
Use the following snippet to remove white-spaces from the HTML with the help ob_start's buffer:
<?php
function sanitize_output($buffer) {
$search = array(
'/\>[^\S ]+/s', // strip whitespaces after tags, except space
'/[^\S ]+\</s', // strip whitespaces before tags, except space
'/(\s)+/s', // shorten multiple whitespace sequences
'/<!--(.|\s)*?-->/' // Remove HTML comments
);
$replace = array(
'>',
'<',
'\\1',
''
);
$buffer = preg_replace($search, $replace, $buffer);
return $buffer;
}
ob_start("sanitize_output");
?>

Turn on gzip if you want to do it properly. You can also just do something like this:
$this->output = preg_replace(
array(
'/ {2,}/',
'/<!--.*?-->|\t|(?:\r?\n[ \t]*)+/s'
),
array(
' ',
''
),
$this->output
);
This removes about 30% of the page size by turning your html into one line, no tabs, no new lines, no comments. Mileage may vary

This work for me.
function minify_html($html)
{
$search = array(
'/(\n|^)(\x20+|\t)/',
'/(\n|^)\/\/(.*?)(\n|$)/',
'/\n/',
'/\<\!--.*?-->/',
'/(\x20+|\t)/', # Delete multispace (Without \n)
'/\>\s+\</', # strip whitespaces between tags
'/(\"|\')\s+\>/', # strip whitespaces between quotation ("') and end tags
'/=\s+(\"|\')/'); # strip whitespaces between = "'
$replace = array(
"\n",
"\n",
" ",
"",
" ",
"><",
"$1>",
"=$1");
$html = preg_replace($search,$replace,$html);
return $html;
}

I've tried several minifiers and they either remove too little or too much.
This code removes redundant empty spaces and optional HTML (ending) tags. Also it plays it safe and does not remove anything that could potentially break HTML, JS or CSS.
Also the code shows how to do that in Zend Framework:
class Application_Plugin_Minify extends Zend_Controller_Plugin_Abstract {
public function dispatchLoopShutdown() {
$response = $this->getResponse();
$body = $response->getBody(); //actually returns both HEAD and BODY
//remove redundant (white-space) characters
$replace = array(
//remove tabs before and after HTML tags
'/\>[^\S ]+/s' => '>',
'/[^\S ]+\</s' => '<',
//shorten multiple whitespace sequences; keep new-line characters because they matter in JS!!!
'/([\t ])+/s' => ' ',
//remove leading and trailing spaces
'/^([\t ])+/m' => '',
'/([\t ])+$/m' => '',
// remove JS line comments (simple only); do NOT remove lines containing URL (e.g. 'src="http://server.com/"')!!!
'~//[a-zA-Z0-9 ]+$~m' => '',
//remove empty lines (sequence of line-end and white-space characters)
'/[\r\n]+([\t ]?[\r\n]+)+/s' => "\n",
//remove empty lines (between HTML tags); cannot remove just any line-end characters because in inline JS they can matter!
'/\>[\r\n\t ]+\</s' => '><',
//remove "empty" lines containing only JS's block end character; join with next line (e.g. "}\n}\n</script>" --> "}}</script>"
'/}[\r\n\t ]+/s' => '}',
'/}[\r\n\t ]+,[\r\n\t ]+/s' => '},',
//remove new-line after JS's function or condition start; join with next line
'/\)[\r\n\t ]?{[\r\n\t ]+/s' => '){',
'/,[\r\n\t ]?{[\r\n\t ]+/s' => ',{',
//remove new-line after JS's line end (only most obvious and safe cases)
'/\),[\r\n\t ]+/s' => '),',
//remove quotes from HTML attributes that does not contain spaces; keep quotes around URLs!
'~([\r\n\t ])?([a-zA-Z0-9]+)="([a-zA-Z0-9_/\\-]+)"([\r\n\t ])?~s' => '$1$2=$3$4', //$1 and $4 insert first white-space character found before/after attribute
);
$body = preg_replace(array_keys($replace), array_values($replace), $body);
//remove optional ending tags (see http://www.w3.org/TR/html5/syntax.html#syntax-tag-omission )
$remove = array(
'</option>', '</li>', '</dt>', '</dd>', '</tr>', '</th>', '</td>'
);
$body = str_ireplace($remove, '', $body);
$response->setBody($body);
}
}
But note that when using gZip compression your code gets compressed a lot more that any minification can do so combining minification and gZip is pointless, because time saved by downloading is lost by minification and also saves minimum.
Here are my results (download via 3G network):
Original HTML: 150kB 180ms download
gZipped HTML: 24kB 40ms
minified HTML: 120kB 150ms download + 150ms minification
min+gzip HTML: 22kB 30ms download + 150ms minification

All of the preg_replace() solutions above have issues of single line comments, conditional comments and other pitfalls. I'd recommend taking advantage of the well-tested Minify project rather than creating your own regex from scratch.
In my case I place the following code at the top of a PHP page to minify it:
function sanitize_output($buffer) {
require_once('min/lib/Minify/HTML.php');
require_once('min/lib/Minify/CSS.php');
require_once('min/lib/JSMin.php');
$buffer = Minify_HTML::minify($buffer, array(
'cssMinifier' => array('Minify_CSS', 'minify'),
'jsMinifier' => array('JSMin', 'minify')
));
return $buffer;
}
ob_start('sanitize_output');

Create a PHP file outside your document root. If your document root is
/var/www/html/
create the a file named minify.php one level above it
/var/www/minify.php
Copy paste the following PHP code into it
<?php
function minify_output($buffer){
$search = array('/\>[^\S ]+/s','/[^\S ]+\</s','/(\s)+/s');
$replace = array('>','<','\\1');
if (preg_match("/\<html/i",$buffer) == 1 && preg_match("/\<\/html\>/i",$buffer) == 1) {
$buffer = preg_replace($search, $replace, $buffer);
}
return $buffer;
}
ob_start("minify_output");?>
Save the minify.php file and open the php.ini file. If it is a dedicated server/VPS search for the following option, on shared hosting with custom php.ini add it.
auto_prepend_file = /var/www/minify.php
Reference: http://websistent.com/how-to-use-php-to-minify-html-output/

you can check out this set of classes:
https://code.google.com/p/minify/source/browse/?name=master#git%2Fmin%2Flib%2FMinify , you'll find html/css/js minification classes there.
you can also try this: http://code.google.com/p/htmlcompressor/
Good luck :)

You can look into HTML TIDY - http://uk.php.net/tidy
It can be installed as a PHP module and will (correctly, safely) strip whitespace and all other nastiness, whilst still outputting perfectly valid HTML / XHTML markup. It will also clean your code, which can be a great thing or a terrible thing, depending on how good you are at writing valid code in the first place ;-)
Additionally, you can gzip the output using the following code at the start of your file:
ob_start('ob_gzhandler');

First of all gzip can help you more than a Html Minifier
With nginx:
gzip on;
gzip_disable "msie6";
gzip_vary on;
gzip_proxied any;
gzip_comp_level 6;
gzip_buffers 16 8k;
gzip_http_version 1.1;
gzip_types text/plain text/css application/json application/x-javascript text/xml application/xml application/xml+rss text/javascript;
With apache you can use mod_gzip
Second: with gzip + Html Minification you can reduce the file size drastically!!!
I've created this HtmlMinifier for PHP.
You can retrieve it through composer: composer require arjanschouten/htmlminifier dev-master.
There is a Laravel service provider. If you're not using Laravel you can use it from PHP.
// create a minify context which will be used through the minification process
$context = new MinifyContext(new PlaceholderContainer());
// save the html contents in the context
$context->setContents('<html>My html...</html>');
$minify = new Minify();
// start the process and give the context with it as parameter
$context = $minify->run($context);
// $context now contains the minified version
$minifiedContents = $context->getContents();
As you can see you can extend a lot of things in here and you can pass various options. Check the readme to see all the available options.
This HtmlMinifier is complete and safe. It takes 3 steps for the minification process:
Replace critical content temporary with a placeholder.
Run the minification strategies.
Restore the original content.
I would suggest that you cache the output of you're views. The minification process should be a one time process. Or do it for example interval based.
Clear benchmarks are not created at the time. However the minifier can reduce the page size with 5-25% based on the your markup!
If you want to add you're own strategies you can use the addPlaceholder and the addMinifier methods.

I have a GitHub gist contains PHP functions to minify HTML, CSS and JS files → https://gist.github.com/taufik-nurrohman/d7b310dea3b33e4732c0
Here’s how to minify the HTML output on the fly with output buffer:
<?php
include 'path/to/php-html-css-js-minifier.php';
ob_start('minify_html');
?>
<!-- HTML code goes here ... -->
<?php echo ob_get_clean(); ?>

If you want to remove all new lines in the page, use this fast code:
ob_start(function($b){
if(strpos($b, "<html")!==false) {
return str_replace(PHP_EOL,"",$b);
} else {return $b;}
});

Thanks to Andrew.
Here's what a did to use this in cakePHP:
Download minify-2.1.7
Unpack the file and copy min subfolder to cake's Vendor folder
Creates MinifyCodeHelper.php in cake's View/Helper like this:
App::import('Vendor/min/lib/Minify/', 'HTML');
App::import('Vendor/min/lib/Minify/', 'CommentPreserver');
App::import('Vendor/min/lib/Minify/CSS/', 'Compressor');
App::import('Vendor/min/lib/Minify/', 'CSS');
App::import('Vendor/min/lib/', 'JSMin');
class MinifyCodeHelper extends Helper {
public function afterRenderFile($file, $data) {
if( Configure::read('debug') < 1 ) //works only e production mode
$data = Minify_HTML::minify($data, array(
'cssMinifier' => array('Minify_CSS', 'minify'),
'jsMinifier' => array('JSMin', 'minify')
));
return $data;
}
}
Enabled my Helper in AppController
public $helpers = array ('Html','...','MinifyCode');
5... Voila!
My conclusion: If apache's deflate and headers modules is disabled in your server your gain is 21% less size and 0.35s plus in request to compress (this numbers was in my case).
But if you had enable apache's modules the compressed response has no significant difference (1.3% to me) and the time to compress is the samne (0.3s to me).
So... why did I do that? 'couse my project's doc is all in comments (php, css and js) and my final user dont need to see this ;)

You can use a well tested Java minifier like HTMLCompressor by invoking it using passthru (exec).
Remember to redirect console using 2>&1
This however may not be useful, if speed is a concern. I use it for static php output

The easiest possible way would be using strtr and removing the whitespace.
That being said don't use javascript as it might break your code.
$html_minify = fn($html) => strtr($html, [PHP_EOL => '', "\t" => '', ' ' => '', '< ' => '<', '> ' => '>']);
echo $html_minify(<<<HTML
<li class="flex--item">
<a href="#"
class="-marketing-link js-gps-track js-products-menu"
aria-controls="products-popover"
data-controller="s-popover"
data-action="s-popover#toggle"
data-s-popover-placement="bottom"
data-s-popover-toggle-class="is-selected"
data-gps-track="top_nav.products.click({location:2, destination:1})"
data-ga="["top navigation","products menu click",null,null,null]">
Products
</a>
</li>
HTML);
// Response (echo): <li class="flex--item">Products</li>

This work for me on php apache wordpress + w total cache + amp
put it on the top of php page
<?Php
if(substr_count($_SERVER['HTTP_ACCEPT_ENCODING'], 'gzip')){
ob_start ('ob_html_compress');
}else{
ob_start();
}
function ob_html_compress($buf) {
$search = array('/\>[^\S ]+/s', '/[^\S ]+\</s', '/(\s)+/s', '/<!--(.|\s)*?-->/');
$replace = array('>', '<', '\\1', '');
$buf = preg_replace($search, $replace, $buf);
return $buf;
return str_replace(array("\n","\r","\t"),'',$buf);
}
?>
then the rest of ur php or html stuff

Related

Compress Magento HTML Code

I'm trying to compress HTML code generated by Magento with this:
Observer.php
public function alterOutput($observer)
{
$lib_path = Mage::getBaseDir('lib').'/Razorphyn/html_compressor.php';
include_once($lib_path);
//Retrieve html body
$response = $observer->getResponse();
$html = $response->getBody();
$html=html_compress($html);
//Send Response
$response->setBody($html);
}
html_compressor.php:
function html_compress($string){
global $idarray;
$idarray=array();
//Replace PRE and TEXTAREA tags
$search=array(
'#(<)\s*?(pre\b[^>]*?)(>)([\s\S]*?)(<)\s*(/\s*?pre\s*?)(>)#', //Find PRE Tag
'#(<)\s*?(textarea\b[^>]*?)(>)([\s\S]*?)(<)\s*?(/\s*?textarea\s*?)(>)#' //Find TEXTAREA
);
$string=preg_replace_callback($search,
function($m){
$id='<!['.uniqid().']!>';
global $idarray;
$idarray[]=array($id,$m[0]);
return $id;
},
$string
);
//Remove blank useless space
$search = array(
'#( |\t|\f)+#', // Shorten multiple whitespace sequences
'#(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+#', //Remove blank lines
'#^(\s)+|( |\t|\0|\r\n)+$#' //Trim Lines
);
$replace = array(' ',"\\1",'');
$string = preg_replace($search, $replace, $string);
//Replace IE COMMENTS, SCRIPT, STYLE and CDATA tags
$search=array(
'#<!--\[if\s(?:[^<]+|<(?!!\[endif\]-->))*<!\[endif\]-->#', //Find IE Comments
'#(<)\s*?(script\b[^>]*?)(>)([\s\S]*?)(<)\s*?(/\s*?script\s*?)(>)#', //Find SCRIPT Tag
'#(<)\s*?(style\b[^>]*?)(>)([\s\S]*?)(<)\s*?(/\s*?style\s*?)(>)#', //Find STYLE Tag
'#(//<!\[CDATA\[([\s\S]*?)//]]>)#', //Find commented CDATA
'#(<!\[CDATA\[([\s\S]*?)]]>)#' //Find CDATA
);
$string=preg_replace_callback($search,
function($m){
$id='<!['.uniqid().']!>';
global $idarray;
$idarray[]=array($id,$m[0]);
return $id;
},
$string
);
//Remove blank useless space
$search = array(
'#(class|id|value|alt|href|src|style|title)=(\'\s*?\'|"\s*?")#', //Remove empty attribute
'#<!--([\s\S]*?)-->#', // Strip comments except IE
'#[\r\n|\n|\r]#', // Strip break line
'#[ |\t|\f]+#', // Shorten multiple whitespace sequences
'#(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+#', //Remove blank lines
'#^(\s)+|( |\t|\0|\r\n)+$#' //Trim Lines
);
$replace = array(' ','',' ',' ',"\\1",'');
$string = preg_replace($search, $replace, $string);
//Replace unique id with original tag
$c=count($idarray);
for($i=0;$i<$c;$i++){
$string = str_replace($idarray[$i][0], "\n".$idarray[$i][1]."\n", $string);
}
return $string;
}
My main concers are two:
Is this a heavy(or good) solution?
Is there a way to optimize this?
Has it really got sense to compress a Magento HTML page(taken resource and time vs real benefit)?
I will not comment or review your code. Deciphering regexes (in any flavor) is not my favorite hobby.
Yes, compressing HTML makes sense if you aim to provide professional services.
If I look at a HTML code of someone's site with lots of nonsense blank spaces and user-useless comments inside and the site disrespects Google's PageSpeed Insights Rules and does not help in making the web faster and eco-friendly then it says to me: be aware, don't trust, certainly don't give them your credit card number
My advice:
read answers to this question: Stack Overflow: HTML minification?
read other developer's code, this is the one I use: https://github.com/kangax/html-minifier
benchmark, e.g. run Google Chrome > Developer tools > Audits
test a lot if your minifier does not unintentionally destroy the pages
There is really no point in doing this if you have gzip compression (and you should have it) enabled. It's a waste of CPU cycles really. You should rather focus on image optimization, reducing number of http requests and setting proper cache headers.

How to replace bbcode tags with attributes

My shared server recently upgraded to php5.4, which broke PEAR HTMLBBcode.
I've tried to write a small function to replace the parser of some simple bbcode, with some code I found on a few forums.
The bbcode I want to parse includes tags such as the image tag with attributes:
[img src="" h="" w="" alt=""]
[*] //for bulleted lists
As my knowledge of regex is limited, maybe someone can explain how to add attributes to the img lines? I assume the # is to suppress errors from preg_replace() ?
How would you handle this tag [ * ] ?
// original function
function bbCode($string) {
$search = array(
'#\[(?i)img\](.*?)\[/(?i)img\]#si',
'#\[url\s*=\s*(.*?)\s*\](.*?)\[\/url\]#si'
);
$replace = array(
'<img src="\\1">',
'\\2'
);
return preg_replace($search , $replace, $string);
}
// test
function bbCode($string) {
$search = array(
'#\[img\s*=\s*(.*?)\s*\
\s*=[(0-9)+]
\s*=[(0-9)+]
\s*=\s*(.*?)\s*\]
(.*?)\[\/img\]#si',
'[*]'
);
$replace = array(
'<img src="\\1" height="\\2" width="\\3" alt="\\4">',
'<li></li>'
);
return preg_replace($search , $replace, $string);
}
Square Brackets are special characters in regular expressions used for creating groups. The asterisk is also a regular expression directive as well. Both need to be escaped if you are trying to match literal content. You would want to change your test code to something like this. But make sure to test my suggestion since I am doing it off the cuff and didn't confirm it first.
// test
function bbCode($string) {
$search = array(
'#\[img\s*=\s*(.*?)\s*\
\s*=[(0-9)+]
\s*=[(0-9)+]
\s*=\s*(.*?)\s*\]
(.*?)\[\/img\]#si',
'\[\*\]'
);
$replace = array(
'<img src="\\1" height="\\2" width="\\3" alt="\\4">',
'<li></li>'
);
return preg_replace($search , $replace, $string);
}
The regex needed to solve these issues requires a closer examination. The PEAR bbcode library is already ahead many steps in that direction.
Although there are still minor problems with the PEAR library, it is still better than what we were attempting here.
I added the path in php.ini and it returned errors for some reason
require "/path/to/pear";
My host support was able to fix the reference simply by
include "../path/to/pear_bbcode";

Remove HTML Entity if Incomplete

I have an issue where I have displayed up to 400 characters of a string that is pulled from the database, however, this string is required to contain HTML Entities.
By chance, the client has created the string to have the 400th character to sit right in the middle of a closing P tag, thus killing the tag, resulting in other errors for code after it.
I would prefer this closing P tag to be removed entirely as I have a "...read more" link attached to the end which would look cleaner if attached to the existing paragraph.
What would be the best approach for this to cover all HTML Entity issues? Is there a PHP function that will automatically close off/remove any erroneous HTML tags? I don't need a coded answer, just a direction will help greatly.
Thanks.
Here's a simple way you can do it with DOMDocument, its not perfect but it may be of interest:
<?php
function html_tidy($src){
libxml_use_internal_errors(true);
$x = new DOMDocument;
$x->loadHTML('<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />'.$src);
$x->formatOutput = true;
$ret = preg_replace('~<(?:!DOCTYPE|/?(?:html|body|head))[^>]*>\s*~i', '', $x->saveHTML());
return trim(str_replace('<meta http-equiv="Content-Type" content="text/html;charset=utf-8">','',$ret));
}
$brokenHTML[] = "<p><span>This is some broken html</spa";
$brokenHTML[] = "<poken html</spa";
$brokenHTML[] = "<p><span>This is some broken html</spa</p>";
/*
<p><span>This is some broken html</span></p>
<poken html></poken>
<p><span>This is some broken html</span></p>
*/
foreach($brokenHTML as $test){
echo html_tidy($test);
}
?>
Though take note of Mike 'Pomax' Kamermans's comment.
why you don't take the last word in the paragraph or content and remove it, if the word is complete you remove it , if is not complete you also remove it, and you are sure that the content still clean, i show you an example for what code will be look like :
while($row = $req->fetch(PDO::FETCH_OBJ){
//extract 400 first characters from the content you need to show
$extraction = substr($row->text, 0, 400);
// find the last space in this extraction
$last_space = strrpos($extraction, ' ');
//take content from the first character to the last space and add (...)
echo substr($extraction, 0, $last_space) . ' ...';
}
just remove last broken tag and then strip_tags
$str = "<p>this is how we do</p";
$str = substr($str, 0, strrpos($str, "<"));
$str = strip_tags($str);

Modifying a PHP file from a Bash script

I need to perform some modifications to PHP files (PHTML files to be exact, but they are still valid PHP files), from a Bash script. My original thought was to use sed or similar utility with regex, but reading some of the replies here for other HTML parsing questions it seems that there might be a better solution.
The problem I was facing with the regex was a lack of support for detecting if the string I wanted to match: (src|href|action)=["']/ was in <?php ?> tags or not, so that I could then either perform string concatenation if the match was in PHP tags, or add in new PHP tags should it not be. For example:
(1) <img id="icon-loader-small" src="/css/images/loader-small.gif" style="vertical-align:middle; display:none;"/>
(2) <li><span class="name"><?php echo $this->loggedInAs()?></span> | Logout</li>
(3) <?php echo ($watched_dir->getExistsFlag())?"":"<span class='ui-icon-alert'><img src='/css/images/warning-icon.png'></span>"?><span><?php echo $watched_dir->getDirectory();?></span></span><span class="ui-icon ui-icon-close"></span>
(EDIT: 4) <form method="post" action="/Preference/stream-setting" enctype="application/x-www-form-urlencoded" onsubmit="return confirm('<?php echo $this->confirm_pypo_restart_text ?>');">
In (1) there a src="/css, and as it is not in PHP tags I want that to become src="<?php echo $baseUrl?>/css. In (2), there is a PHP tag but it is not around the href="/Login, so it also becomes href="<?php echo $baseUrl?>/Login.
Unfortunately, (3) has src='/css but inside the PHP tags (it is an echoed string). It is also quoted by " in the PHP code, so the modification needs to pick up on that too. The final result would look something like: src='".$baseUrl."/css.
All the other modifications to my HTML and PHP files have been done using a regex (I know, I know...). If regexes could support matching everything except a certain pattern, like [^(<\?php)(\?>)]* then I would be flying through this part. Unfortunately it seems that this is Type 2 grammar territory. So - what should I use?
Ideally it needs to be installed by default with the GNU suite, but other tools like PHP itself or other interpreters are fine too, just not preferred. Of course, if someone could structure a regex that would work on the above examples, then that would be excellent.
EDIT: (4) is the nasty match, where most regexes will fail.
The way I solved this problem was by separating my file into sections that were encapsulated by . The script kept track of the 'context' it was currently in - by default set to html but switching to php when it hit those tags. An operation (not necessarily a regex) then performs on that section, which is then appended to the output buffer. When the file is completely processed the output buffer is written back into the file.
I attempted to do this with sed, but I faced the problem of not being able to control where newlines would be printed. The context based logic was also hardcoded meaning it would be tedious to add in a new context, like ASP.NET support for example. My current solution is written in Perl and mitigates both problems, although I am having a bit of trouble getting my regex to actually do something, but this might just be me coding my regex incorrectly.
Script is as follows:
#!/usr/bin/perl -w
use strict;
#Prototypes
sub readFile(;\$);
sub writeFile($);
#Constants
my $file;
my $outputBuffer;
my $holdBuffer;
# Regexes should have s and g modifiers
# Pattern is in $_
my %contexts = (
html => {
operation => ''
},
php => {
openTag => '<\?php |<\? ', closeTag => '\?>', operation => ''
},
js => {
openTag => '<script>', closeTag => '<\/script>', operation => ''
}
);
my $currentContext = 'html';
my $combinedOpenTags;
#Initialisation
unshift(#ARGV, '-') unless #ARGV;
foreach my $key (keys %contexts) {
if($contexts{$key}{openTag}) {
if($combinedOpenTags) {
$combinedOpenTags .= "|".$contexts{$key}{openTag};
} else {
$combinedOpenTags = $contexts{$key}{openTag};
}
}
}
#Main loop
while(readFile($holdBuffer)) {
$outputBuffer = '';
while($holdBuffer) {
$currentContext = "html";
foreach my $key (keys %contexts) {
if(!$contexts{$key}{openTag}) {
next;
}
if($holdBuffer =~ /\A($contexts{$key}{openTag})/) {
$currentContext = $key;
last;
}
}
if($currentContext eq "html") {
$holdBuffer =~ s/\A(.*?)($combinedOpenTags|\z)/$2/s;
$_ = $1;
} else {
$holdBuffer =~ s/\A(.*?$contexts{$currentContext}{closeTag}|\z)//s;
$_ = $1;
}
eval($contexts{$currentContext}{operation});
$outputBuffer .= $_;
}
writeFile($outputBuffer);
}
# readFile: read file into $_
sub readFile(;\$) {
my $argref = #_ ? shift() : \$_;
return 0 unless #ARGV;
$file = shift(#ARGV);
open(WORKFILE, "<$file") || die("$0: can't open $file for reading ($!)\n");
local $/;
$$argref = <WORKFILE>;
close(WORKFILE);
return 1;
}
# writeFile: write $_[0] to file
sub writeFile($) {
open(WORKFILE, ">$file") || die("$0: can't open $file for writing ($!)\n");
print WORKFILE $_[0];
close(WORKFILE);
}
I hope that this can be used and modified by others to suit their needs.

strip_tags disallow some tags

Based on the strip_tags documentation, the second parameter takes the allowable tags. However in my case, I want to do the reverse. Say I'll accept the tags the script_tags normally (default) accept, but strip only the <script> tag. Any possible way for this?
I don't mean somebody to code it for me, but rather an input of possible ways on how to achieve this (if possible) is greatly appreciated.
EDIT
To use the HTML Purifier HTML.ForbiddenElements config directive, it seems you would do something like:
require_once '/path/to/HTMLPurifier.auto.php';
$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.ForbiddenElements', array('script','style','applet'));
$purifier = new HTMLPurifier($config);
$clean_html = $purifier->purify($dirty_html);
http://htmlpurifier.org/docs
HTML.ForbiddenElements should be set to an array. What I don't know is what form the array members should take:
array('script','style','applet')
Or:
array('<script>','<style>','<applet>')
Or... Something else?
I think it's the first form, without delimiters; HTML.AllowedElements uses a form of configuration string somewhat common to TinyMCE's valid elements syntax:
tinyMCE.init({
...
valid_elements : "a[href|target=_blank],strong/b,div[align],br",
...
});
So my guess is it's just the term, and no attributes should be provided (since you're banning the element... although there is a HTML.ForbiddenAttributes, too). But that's a guess.
I'll add this note from the HTML.ForbiddenAttributes docs, as well:
Warning: This directive complements %HTML.ForbiddenElements,
accordingly, check out that directive for a discussion of why you
should think twice before using this directive.
Blacklisting is just not as "robust" as whitelisting, but you may have your reasons. Just beware and be careful.
Without testing, I'm not sure what to tell you. I'll keep looking for an answer, but I will likely go to bed first. It is very late. :)
Although I think you really should use HTML Purifier and utilize it's HTML.ForbiddenElements configuration directive, I think a reasonable alternative if you really, really want to use strip_tags() is to derive a whitelist from the blacklist. In other words, remove what you don't want and then use what's left.
For instance:
function blacklistElements($blacklisted = '', &$errors = array()) {
if ((string)$blacklisted == '') {
$errors[] = 'Empty string.';
return array();
}
$html5 = array(
"<menu>","<command>","<summary>","<details>","<meter>","<progress>",
"<output>","<keygen>","<textarea>","<option>","<optgroup>","<datalist>",
"<select>","<button>","<input>","<label>","<legend>","<fieldset>","<form>",
"<th>","<td>","<tr>","<tfoot>","<thead>","<tbody>","<col>","<colgroup>",
"<caption>","<table>","<math>","<svg>","<area>","<map>","<canvas>","<track>",
"<source>","<audio>","<video>","<param>","<object>","<embed>","<iframe>",
"<img>","<del>","<ins>","<wbr>","<br>","<span>","<bdo>","<bdi>","<rp>","<rt>",
"<ruby>","<mark>","<u>","<b>","<i>","<sup>","<sub>","<kbd>","<samp>","<var>",
"<code>","<time>","<data>","<abbr>","<dfn>","<q>","<cite>","<s>","<small>",
"<strong>","<em>","<a>","<div>","<figcaption>","<figure>","<dd>","<dt>",
"<dl>","<li>","<ul>","<ol>","<blockquote>","<pre>","<hr>","<p>","<address>",
"<footer>","<header>","<hgroup>","<aside>","<article>","<nav>","<section>",
"<body>","<noscript>","<script>","<style>","<meta>","<link>","<base>",
"<title>","<head>","<html>"
);
$list = trim(strtolower($blacklisted));
$list = preg_replace('/[^a-z ]/i', '', $list);
$list = '<' . str_replace(' ', '> <', $list) . '>';
$list = array_map('trim', explode(' ', $list));
return array_diff($html5, $list);
}
Then run it:
$blacklisted = '<html> <bogus> <EM> em li ol';
$whitelist = blacklistElements($blacklisted);
if (count($errors)) {
echo "There were errors.\n";
print_r($errors);
echo "\n";
} else {
// Do strip_tags() ...
}
http://codepad.org/LV8ckRjd
So if you pass in what you don't want to allow, it will give you back the HTML5 element list in an array form that you can then feed into strip_tags() after joining it into a string:
$stripped = strip_tags($html, implode('', $whitelist)));
Caveat Emptor
Now, I've kind've hacked this together and I know there are some issues I haven't thought out yet. For instance, from the strip_tags() man page for the $allowable_tags argument:
Note:
This parameter should not contain whitespace. strip_tags() sees a tag
as a case-insensitive string between < and the first whitespace or >.
It means that strip_tags("<br/>", "<br>") returns an empty string.
It's late and for some reason I can't quite figure out what this means for this approach. So I'll have to think about that tomorrow. I also compiled the HTML element list in the function's $html5 element from this MDN documentation page. Sharp-eyed reader's might notice all of the tags are in this form:
<tagName>
I'm not sure how this will effect the outcome, whether I need to take into account variations in the use of a shorttag <tagName/> and some of the, ahem, odder variations. And, of course, there are more tags out there.
So it's probably not production ready. But you get the idea.
First, see what others have said on this topic:
Strip <script> tags and everything in between with PHP?
and
remove script tag from HTML content
It seems you have 2 choices, one is a Regex solution, both the links above give them. The second is to use HTML Purifier.
If you are stripping the script tag for some other reason than sanitation of user content, the Regex could be a good solution. However, as everyone has warned, it is a good idea to use HTML Purifier if you are sanitizing input.
PHP(5 or greater) solution:
If you want to remove <script> tags (or any other), and also you want to remove the content inside tags, you should use:
OPTION 1 (simplest):
preg_replace('#<script(.*?)>(.*?)</script>#is', '', $text);
OPTION 2 (more versatile):
<?php
$html = "<p>Your HTML code</p><script>With malicious code</script>"
$dom = new DOMDocument();
$dom->loadHTML($html);
$script = $dom->getElementsByTagName('script');
$remove = [];
foreach($script as $item)
{
$item->parentNode->removeChild($item);
}
$html = $dom->saveHTML();
Then $html will be:
"<p>Your HTML code</p>"
This is what I use to strip out a list of forbidden tags, can do both removing of tags wrapping content and tags including content, Plus trim off leftover white space.
$description = trim(preg_replace([
# Strip tags around content
'/\<(.*)doctype(.*)\>/i',
'/\<(.*)html(.*)\>/i',
'/\<(.*)head(.*)\>/i',
'/\<(.*)body(.*)\>/i',
# Strip tags and content inside
'/\<(.*)script(.*)\>(.*)<\/script>/i',
], '', $description));
Input example:
$description = '<html>
<head>
</head>
<body>
<p>This distinctive Mini Chopper with Desire styling has a powerful wattage and high capacity which makes it a very versatile kitchen accessory. It also comes equipped with a durable glass bowl and lid for easy storage.</p>
<script type="application/javascript">alert('Hello world');</script>
</body>
</html>';
Output result:
<p>This distinctive Mini Chopper with Desire styling has a powerful wattage and high capacity which makes it a very versatile kitchen accessory. It also comes equipped with a durable glass bowl and lid for easy storage.</p>
I use the following:
function strip_tags_with_forbidden_tags($input, $forbidden_tags)
{
foreach (explode(',', $forbidden_tags) as $tag) {
$tag = preg_replace(array('/^</', '/>$/'), array('', ''), $tag);
$input = preg_replace(sprintf('/<%s[^>]*>([^<]+)<\/%s>/', $tag, $tag), '$1', $input);
}
return $input;
}
Then you can do:
echo strip_tags_with_forbidden_tags('<cancel>abc</cancel>xpto<p>def></p><g>xyz</g><t>xpto</t>', 'cancel,g');
Output: 'abcxpto<p>def></p>xyz<t>xpto</t>'
echo strip_tags_with_forbidden_tags('<cancel>abc</cancel> xpto <p>def></p> <g>xyz</g> <t>xpto</t>', 'cancel,g');
Outputs: 'abc xpto <p>def></p> xyz <t>xpto</t>'

Categories