I am using a way to compress HTML on fly. Below is the function
function compress_page($buffer) {
$search = array(
'/\>[^\S ]+/s', /*strip whitespaces after tags, except space*/
'/[^\S ]+\</s', /*strip whitespaces before tags, except space*/
'/(\s)+/s', /*shorten multiple whitespace sequences*/
);
$replace = array(
'>',
'<',
'\\1',
);
$buffer = preg_replace($search, $replace, $buffer);
return $buffer;
}
function is working but the problem is, after implement this, germam characters are not showing anymore. They are showing like "�". Can you please help me to find problem.
I tried other ways to minify HTML but get same proble.
Maybe it's happen because you are not add Unicode flag support to regex.
Anyway I write a code to minified:
function sanitize_output($buffer, $type = null) {
$search = array(
'/\>[^\S ]+/s', // strip whitespaces after tags, except space
'/[^\S ]+\</s', // strip whitespaces before tags, except space
'/(\s)+/s', // shorten multiple whitespace sequences
'/<!--(.|\s)*?-->/', // Remove HTML comments
'#/\*(.|\s)*\*/#Uu' // Remove JS comments
);
$replace = array(
'>',
'<',
' ',
'',
''
);
if( $type == 'html' ){
// Remove quets of attributs
$search[] = '#(\w+=)(?:"|\')((\S|\.|\-|/|_|\(|\)|\w){1,8})(?:"|\')#u';
$replace[] = '$1$2';
// Remove spaces beetween tags
$search[] = '#(>)\s+(<)#mu';
$replace[] = '$1$2';
}
$buffer = str_replace( PHP_EOL, '', preg_replace( $search, $replace, $buffer ) );
return $buffer;
}
After research, I found this solution. This will minify full html in one line.
function pt_html_minyfy_finish( $html ) {
$html = preg_replace('/<!--(?!s*(?:[if [^]]+]|!|>))(?:(?!-->).)*-->/s', '', $html);
$html = str_replace(array("\r\n", "\r", "\n", "\t"), '', $html);
while ( stristr($html, ' '))
$html = str_replace(' ', ' ', $html);
return $html;
}
Hope this will help someone!
Related
I'm writing a regex where I need to filter content to format it's typography. So far, my code seems to be filtering out my content properly using preg_replace, but I can't figure out how to avoid this for content wrapped within certain tags, say <pre>.
As a reference, this is to be used within WordPress's the_content filter, so my current code looks like so:
function my_typography( $str ) {
$ignore_elements = array("code", "pre");
$rules = array(
"?" => array("before"=> " ", "after"=>""),
// the others are stripped out for simplicity
);
foreach($rules as $rule=>$params) {
// Pseudo :
// if( !in_array( $parent_tag, $ignore_elements) {
// /Pseudo
$formatted = $params['before'] . $rule . $params['after'];
$str = preg_replace( $rule, $formatted, $str );
// Pseudo :
// }
// /Pseudo
}
return $str;
}
add_filter( 'the_content', 'my_typography' );
Basically:
<p>Was this filtered? I hope so</p>
<pre>Was this filtered? I hope not.</pre>
should become
<p>Was this filtered ? I hope so</p>
<pre>Was this filtered? I hope not.</pre>
You need to wrap search regex with regex delimiter in preg_replace and must call preg_quote to escape all special regex characters such ?, ., *, + etc:
$str = preg_replace( '~' . preg_quote($rule, '~') . '~', $formatted, $str );
Full Code:
function my_typography( $str ) {
$ignore_elements = array("code", "pre");
$rules = array(
"?" => array("before"=> " ", "after"=>""),
// the others are stripped out for simplicity
);
foreach($rules as $rule=>$params) {
// Pseudo :
// if( !in_array( $parent_tag, $ignore_elements) {
// /Pseudo
$formatted = $params['before'] . $rule . $params['after'];
$str = preg_replace( '~' . preg_quote($rule, '~') . '~', $formatted, $str );
// Pseudo :
// }
// /Pseudo
}
return $str;
}
Output:
<p>Was this filtered ? I hope so</p>
<pre>Was this filtered ? I hope not.</pre>
My first question here:
I am using this code to show excerpts on my Wordpress. It allows me to show tags within the excerpt and it works OK, but the problem is when a have a word that is linked, after it I always get one space before the comma.
Example:
Google, Yahoo
Output:
Google , Yahoo
The space after the comma is on purpose and it should be there. The one before is one too much.
Any tips on how to fix this?
Code that I use in order to define excerpt:
function new_wp_trim_excerpt($text) {
$raw_excerpt = $text;
if ( '' == $text ) {
$text = get_the_content('');
$text = strip_shortcodes( $text );
$text = apply_filters('the_content', $text);
$text = str_replace(']]>', ']]>', $text);
$text = strip_tags($text, '<a>');
$excerpt_length = apply_filters('excerpt_length', 70);
$words = preg_split('/(<a.*?a>)|\n|\r|\t|\s/', $text, $excerpt_length + 1, PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE );
if ( count($words) > $excerpt_length ) {
array_pop($words);
$text = implode(' ', $words);
$text = $text . $excerpt_more;
} else {
$text = implode(' ', $words);
}
}
return apply_filters('new_wp_trim_excerpt', $text, $raw_excerpt);
}
remove_filter('get_the_excerpt', 'wp_trim_excerpt');
add_filter('get_the_excerpt', 'new_wp_trim_excerpt');
It appears to be an issue with your regex, if you add a comma after the <a.*?a> does that help? Here's the regex line with the comma included:
$words = preg_split('/(<a.*?a>,)|\n|\r|\t|\s/', $text, $excerpt_length + 1, PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE );
I'm using the buffer sanitizer, as seen in a PHP manual comment, but having trouble with double newlines in textareas.
When pulling a string out from my database, containing double/triple/quadruple newlines, and putting it into a textarea, the newlines are reduced to only a single newline.
Therefore: Is it possible to have the function exclude all output between <pre>, <textarea> and </pre>, </textarea>?
Seeing this question, How to minify php html output without removing IE conditional comments?, I think i need to use the preg_match, but I'm not sure how to implement it into this function.
The function I'm using is
function sanitize_output($buffer) {
$search = array(
'/\>[^\S ]+/s', // strip whitespaces after tags, except space
'/[^\S ]+\</s', // strip whitespaces before tags, except space
'/(\s)+/s' // shorten multiple whitespace sequences
);
$replace = array(
'>',
'<',
'\\1'
);
$buffer = preg_replace($search, $replace, $buffer);
return $buffer;
}
ob_start("sanitize_output");
And yeah I'm using both this sanitizer and GZIP to get the smallest size possible.
here is an implementation of the function mentioned in the comments:
function sanitize_output($buffer) {
// Searching textarea and pre
preg_match_all('#\<textarea.*\>.*\<\/textarea\>#Uis', $buffer, $foundTxt);
preg_match_all('#\<pre.*\>.*\<\/pre\>#Uis', $buffer, $foundPre);
// replacing both with <textarea>$index</textarea> / <pre>$index</pre>
$buffer = str_replace($foundTxt[0], array_map(function($el){ return '<textarea>'.$el.'</textarea>'; }, array_keys($foundTxt[0])), $buffer);
$buffer = str_replace($foundPre[0], array_map(function($el){ return '<pre>'.$el.'</pre>'; }, array_keys($foundPre[0])), $buffer);
// your stuff
$search = array(
'/\>[^\S ]+/s', // strip whitespaces after tags, except space
'/[^\S ]+\</s', // strip whitespaces before tags, except space
'/(\s)+/s' // shorten multiple whitespace sequences
);
$replace = array(
'>',
'<',
'\\1'
);
$buffer = preg_replace($search, $replace, $buffer);
// Replacing back with content
$buffer = str_replace(array_map(function($el){ return '<textarea>'.$el.'</textarea>'; }, array_keys($foundTxt[0])), $foundTxt[0], $buffer);
$buffer = str_replace(array_map(function($el){ return '<pre>'.$el.'</pre>'; }, array_keys($foundPre[0])), $foundPre[0], $buffer);
return $buffer;
}
There is always room for optimation but that works
There is a simple solution for PRE that does not work for TEXTAREA: replace the spaces with then use nl2br() to replace the newlines with BR elements before outputting the values. It's not elegant but it works:
<pre><?php
echo(nl2br(str_replace(' ', ' ', htmlspecialchars($value))));
?></pre>
Unfortunately, it cannot be used for TEXTAREA because the browsers display <br /> as text.
Maybe this will give you the result you need.
But in general i do not recommend this kind of sanitize jobs, it's not good for performance. In these days there is no really need of stripping whitespace characters from a html output.
function sanitize_output($buffer) {
$ignoreTags = array("textarea", "pre");
# find tags that must be ignored and replace it with a placeholder
$tmpReplacements = array();
foreach($ignoreTags as $tag){
preg_match_all("~<$tag.*?>.*?</$tag>~is", $buffer, $match);
if($match && $match[0]){
foreach($match[0] as $key => $value){
if(!isset($tmpReplacements[$tag])) $tmpReplacements[$tag] = array();
$index = count($tmpReplacements[$tag]);
$replacementValue = "<tmp-replacement>$index</tmp-relacement>";
$tmpReplacements[$tag][$index] = array($value, $replacementValue);
$buffer = str_replace($value, $replacementValue, $buffer);
}
}
}
$search = array(
'/\>[^\S ]+/s', // strip whitespaces after tags, except space
'/[^\S ]+\</s', // strip whitespaces before tags, except space
'/(\s)+/s' // shorten multiple whitespace sequences
);
$replace = array(
'>',
'<',
'\\1'
);
$buffer = preg_replace($search, $replace, $buffer);
# re-insert previously ignored tags
foreach($tmpReplacements as $tag => $rows){
foreach($rows as $values){
$buffer = str_replace($values[1], $values[0], $buffer);
}
}
return $buffer;
}
function nl2ascii($str){
return str_replace(array("\n","\r"), array("
","
"), $str);
}
$StrTest = "test\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\rtest";
ob_start("sanitize_output");
?>
<textarea><?php echo nl2ascii($StrTest); ?></textarea>
<textarea><?php echo $StrTest; ?></textarea>
<pre style="border: 1px solid red"><?php echo nl2ascii($StrTest); ?></pre>
<pre style="border: 1px solid red"><?php echo $StrTest; ?></pre>
<?php
ob_flush();
raw output
<textarea>test
test</textarea>
<textarea>test
test</textarea>
<pre style="border: 1px solid red">test
test</pre>
<pre style="border: 1px solid red">test
test</pre>
visual output
This is my version of the sanitize HTML. I have commented the code, so it should be clear what it is doing.
function comprimeer($html = '', $arr_tags = ['textarea', 'pre']) {
$arr_found = [];
$arr_back = [];
$arr_temp = [];
// foreach tag get an array with tag and its content
// the array is like: $arr_temp[0] = [ 0 = ['<tag>content</tag>'] ];
foreach ($arr_tags as $tag) {
if(preg_match_all('#\<' . $tag . '.*\>.*\<\/' . $tag . '\>#Uis', $html, $arr_temp)) {
// the tag is present
foreach($arr_temp as $key => $arr_item) {
// for every item of the tag keep the item
$arr_found[$tag][] = $arr_item[0];
// make an nmubered replace <tag>1</tag>
$arr_back[$tag][] = '<' . $tag . '>' . $key . '</' . $tag . '>';
}
// replace all the present tags with the numbered ones
$html = str_replace((array) $arr_found[$tag], (array) $arr_back[$tag], $html);
}
} // end foreach
// clean the html
$arr_search = [
'/\>[^\S ]+/s', // strip whitespaces after tags, except space
'/[^\S ]+\</s', // strip whitespaces before tags, except space
'/(\s)+/s' // shorten multiple whitespace sequences
];
$arr_replace = [
'>',
'<',
'\\1'
];
$clean = preg_replace($arr_search, $arr_replace, $html);
// put the kept items back
foreach ($arr_tags as $tag) {
if(isset($arr_found[$tag])) {
// the tag was present replace them back
$clean = str_replace($arr_back[$tag], $arr_found[$tag], $clean);
}
} // end foreach
// give the cleaned html back
return $clean;
} // end function
I have this bbcode tag "remover" which should remove bbcode tags from my test text.
All i get is nothing. Just blank page where should be the text replaced with html tags.
Whats wrong with it. And maybe anyone have some better script to share.
$str = 'This [b]is just[/b] a [i]test[/i] text!';
function forum_text($str)
{
$str = htmlspecialchars($str);
$str = preg_replace( "#\[url\](?:http:\/\/)?(.+?)\[/url\]#is", "$1", $str );
$str = preg_replace( "#\[img\](?:http:\/\/)?(.+?)\[/img\]#is", "<img src=\"http://$1\" />", $str );
$str = preg_replace( "#\[b\](.+?)\[/b\]#is", "<strong>$1</strong>", $str );
$str = preg_replace( "#\[i\](.+?)\[/i\]#is", "<i>$1</i>", $str );
$str = preg_replace( "#\[u\](.+?)\[/u\]#is", "<u>$1</u>", $str );
return $str;
}
The following is your code, with some code in front of it (to make sure any errors are shown) and some code at the back (that actually calls your function).
If this doesn't work for you, your problem is not here, unless you don't have a working PCRE.
error_reporting(-1); ini_set('display_errors', 'On');
$str = 'This [b]is just[/b] a [i]test[/i] text!';
function forum_text($str)
{
$str = htmlspecialchars($str);
$str = preg_replace( "#\[url\](?:http:\/\/)?(.+?)\[/url\]#is", "$1", $str );
$str = preg_replace( "#\[img\](?:http:\/\/)?(.+?)\[/img\]#is", "<img src=\"http://$1\" />", $str );
$str = preg_replace( "#\[b\](.+?)\[/b\]#is", "<strong>$1</strong>", $str );
$str = preg_replace( "#\[i\](.+?)\[/i\]#is", "<i>$1</i>", $str );
$str = preg_replace( "#\[u\](.+?)\[/u\]#is", "<u>$1</u>", $str );
return $str;
}
echo forum_text($str);
I have a function which slugifies the text, it works well except that I need to replace ":" with "/". Currently it replaces all non-letter or digits with "-". Here it is :
function slugify($text)
{
// replace non letter or digits by -
$text = preg_replace('~[^\\pL\d]+~u', '-', $text);
// trim
$text = trim($text, '-');
// transliterate
if (function_exists('iconv'))
{
$text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
}
// lowercase
$text = strtolower($text);
// remove unwanted characters
$text = preg_replace('~[^-\w]+~', '', $text);
if (empty($text))
{
return 'n-a';
}
return $text;
}
I made just a couple modifications. I provided a search/replace set of arrays to let us replace most everything with -, but replace : with /:
$search = array( '~[^\\pL\d:]+~u', '~:~' );
$replace = array( '-', '/' );
$text = preg_replace( $search, $replace, $text);
And later on, this last preg_replace was replacing our / with an empty string. So I permited foward slashes in the character class.
$text = preg_replace('~[^-\w\/]+~', '', $text);
Which outputs the following:
// antiques/antiquities
echo slugify( "Antiques:Antiquities" );