Let's say I have a string from the user ($input). I can go and strip tags, to allow only allowed tags in. I can convert to text with htmlspecialchars(). I can even replace all tags I don't want with text.
function html($input) {
$input = '<bl>'.htmlspecialchars($input).'</bl>'; // bl is a custom tag that I style (stands for block)
global $open;
$open = []; //Array of open tags
for ($i = 0; $i < strlen($input); $i++) {
if (!in_array('code', $open) && !in_array('codebl', $open)) { //If we are parsing
$input = preg_replace_callback('#^(.{'.$i.'})<(em|i|del|sub|sup|sml|code|kbd|pre|codebl|quote|bl|sbl)>\s*#s', function($match) {
global $open; //...then add new tags to the array
array_push($open,$match[2]);
return $match[1].'<'.$match[2].'>'; //And replace them
}, $input);
$input = preg_replace_callback('#^(.{'.$i.'})(https?):\/\/([^\s"\(\)<>]+)#', function($m) {
return $m[1].''.$m[3].'';
}, $input, -1, $num); //Simple linking
$i += $num * 9;
$input = preg_replace_callback('#^(.{'.$i.'})\n\n#', function($m) {
return $m[1].'</bl><bl>';
}, $input); // More of this bl element
}
if (end($open)) { //Close tags
$input = preg_replace_callback('#^(.{'.$i.'})</('.end($open).')>#s', function($match) {
global $open;
array_pop($open);
return trim($match[1]).'</'.$match[2].'>';
}, $input);
}
}
while ($open) { //Handle unclosed tags
$input .= '</'.end($open).'>';
array_pop($open);
}
return $input;
}
The problem is that after that, there is no way to write literally <i&lgt;</i>, because it will be automatically parsed into either <i></i> (if you write <i></i>), or &lt;i&gt;&lt;/i&gt; (if you write <i></i>). I want the user to be able to enter < (or any other HTML entity) and get < back. If I just send it straight to the browser unparsed, it would (obviously) be vulnerable to whatever sorcery the hackers are trying (and I'm letting) to (be) put on my site. So, How can I let the user use any of the pre-defined set of HTML tags, while still letting them use html entities?
This is what I eventually used:
function html($input) {
$input = preg_replace(["#&([^A-z])#","#<([^A-z/])#","#&$#","#<$#"], ['&$1','<$1','&','<'], $input); //Fix single "<"s and "&"s
$open = []; //Array of open tags
$close = false; //Is the current tag a close tag?
for ($i = 0; $i <= strlen($input); $i++) { //Start the loop
if ($tag) { //Are we in a tag?
if (preg_match("/[^a-z]/", $input[$i])) { //The tag has ended
if ($close) {
$close = false;
$sPos = strrpos(substr($input,0,$i), '<') + 2; //start position of tag
$tag = substr($input,$sPos,$i-$sPos); //tag name
if (end($open) == $tag) {
array_pop($open); //Good, it's a valid XML closing
} else {
$input = substr($input, 0, $sPos-2) . '</' . $tag . substr($input, $i); //BAD! Convert tag to text (open tag will be handled later)
}
} else {
$sPos = strrpos(substr($input,0,$i), '<') + 1; //start position of tag
$tag = substr($input,$sPos,$i-$sPos); //tag name
if (in_array($tag, ['em','i','del','sub','sup','sml','code','kbd','pre','codebl','bl','sbl'])) { //Is it an acceptable tag?
array_push($open, $tag); //Add it to the array
$j = $i + 1;
while (preg_match("/\s/", $input[$j])) { //Get rid of whitespace
$j++;
}
$input = substr($input, 0, $sPos - 1) . '<' . $tag . '>' . substr($input, $j); //Seems legit
} else {
$input = substr($input, 0, $sPos - 1) . '<' . $tag . substr($input, $i); //BAD! Convert tag to text
}
}
$tag = false;
}
} else if (!in_array('code', $open) && !in_array('codebl', $open) && !in_array('pre', $open)) { //Standard parsing of text
if ($input[$i] == '<') { //Is it a tag?
$tag = true;
if ($input[$i+1] == '/') { //Is it a close tag?
$i++;
$close = true;
}
} else if (substr($input, $i, 4) == 'http') { //Link
if (preg_match('#^.{'.$i.'}(https?):\/\/([^\s"\(\)<>]+)#', $input, $m)) {
$insert = ''.$m[2].'';
$input = substr($input, 0, $i) . $insert . substr($input, $i + strlen($m[1].'://'.$m[2]));
$i += strlen($insert);
}
} else if ($input[$i] == "\n" && $input[$i+1] == "\n") { //Insert <bl> tag? (I use this to separate sections of text)
$input = substr($input, 0, $i + 1) . '</bl><bl>' . substr($input, $i + 1);
}
} else { // We're in a code tag
if (substr($input, $i+1, strlen(end($open)) + 3) == '</'.current($open).'>') {
array_pop($open);
$i += 2;
} elseif ($input[$i] == '<') {
$input = substr($input, 0, $i) . '<' . substr($input, $i + 1);
$i += 3; //Code tags have raw text
} elseif (in_array('code', $open) && $input[$i] == "\n") { //No linebreaks are allowed in inline tags, convert to <codebl>
$open[count($open) - 1] = 'codebl';
$input = substr($input, 0, strrpos($input,'<code>')) . '<codebl>' . substr($input, strrpos($input,'<code>') + 6, strpos(substr($input, strrpos($input,'<code>')),'</code>') - 6) . '</codebl>' . substr($input, strpos(substr($input, strrpos($input,'<code>')),'</code>') + strrpos($input,'<code>') + 7);
$i += 4;
}
}
}
while ($open) { //Handle open tags
$input .= '</'.end($open).'>';
array_pop($open);
}
return '<bl>'.$input.'</bl>';
}
I know it's a bit more risky, but you can first assume the input's good, then filter out the stuff explicitly found as bad.
Related
I am trying to generate html from a given string pattern, similar to a plugin.
There are three patterns, a no arg, a single arg and a multi arg string pattern. I can't change this pattern since it's from a CMS.
{pluginName} or {pluginName=3} or {pluginName id=3|view=simple|arg999=asv}
An example:
<p>Hi this is a html page</p>
<p>The following line should generate html</p>
{pluginName=3}
<p>The following line also should generate html</p>
{pluginName id=3|view=simple|arg999=asv}
My goal is to replace those "tags" with something (it's not relavant for this question the processing per say). However I want to be able to pass the args given to a class/function that should handle that logic.
This is my first attempt, without using regexes since I don't know how I could approach this problem with them (and mainly because they are slower).
<?php
function processPlugins($text, $pos = 0, $start = '{', $end = '}') {
$plugins = array('plugin1', 'plugin2');
while(($pos = strpos($text, $start, $pos)) !== false) {
$startPos = $pos;
$pos += strlen($start);
foreach($plugins as $plugin) {
if(substr($text, $pos, strlen($plugin)) === $plugin
&& ($endPos = strpos($text, $end, $pos + strlen($plugin))) !== false) {
$char = substr($text, $pos + strlen($plugin), 1); // 1 is strlen of (= or ' ')
$pos += strlen($plugin) + 1; // 1 is strlen of (= or ' ')
$argString = substr($text, $pos, $endPos - $pos);
if($char === ' ') { //Multi arg
$params = explode('|', trim($argString));
$paramDict = array();
foreach ($params as $param) {
list($k, $v) = array_pad(explode('=', $param), 2, null);
$paramDict[$k] = $v;
}
//$output = $plugin->processDictionary($paramDict);
var_dump($paramDict);
} elseif ($char === '=') { //One arg
//$output = $plugin->processArg($argString);
echo $argString . "\n";
} elseif ($char === $end) { //No arg
//$output = $plugin->processNoArg();
echo $plugin. "\n";
}
$pos = $endPos + strlen($end);
break;
}
}
}
}
processPlugins('{plugin1}');
processPlugins('{plugin2=3}');
processPlugins('{plugin2 arg1=b|arg2=d}');
The previous code works in a PHP sandbox.
This code seems to work (for now) but it seems sketchy. Would you approach this problem differently? Could I refactor this code somehow?
If you opt for string manipulation functions over regex, why not use explode for stripping the input down to the significant part?
Here is an alternative implementation:
function processPlugins($text, $pos = 0, $start = '{', $end = '}') {
$t = substr($text, $pos);
if($pos > 0) {
echo "$pos chracters removed from the begining: $t" . PHP_EOL;
} else {
echo "Starting with '$t'" . PHP_EOL;
}
$parts = explode($start, $t);
$t = $parts[1];
$parts = explode($end, $t);
$t = $parts[0];
echo "The part between curly braces: '$t'" . PHP_EOL;
$t = str_replace(['plugin1', 'plugin2'], '', $t);
echo "After plugin name has been removed: '$t'" . PHP_EOL;
$n = strlen($t);
if(!$n) {
echo "Processing complete: " . trim($parts[0]) . PHP_EOL . PHP_EOL;
return;
}
$params = explode('|', $t);
echo 'Key-Values: ' . json_encode($params) . PHP_EOL;
$kv = [];
foreach($params as $p) {
list($k, $v) = explode('=', trim($p));
echo " Item: '$p', Key: '$k', Value: '$v'" . PHP_EOL;
if($k === '') {
echo "Processing complete: $v" . PHP_EOL . PHP_EOL;
return;
}
$kv[$k] = $v;
}
echo "Processing complete: " . json_encode($kv) . PHP_EOL . PHP_EOL;
}
echo '<pre>';
processPlugins('{plugin1}');
processPlugins('{plugin2=3}');
processPlugins('{plugin2 arg1=b|arg2=d}');
Of course the echo lines could be thrown away. With them in place we get this output:
Starting with '{plugin1}'
The part between curly braces: 'plugin1'
After plugin name has been removed: ''
Processing complete: plugin1
Starting with '{plugin2=3}'
The part between curly braces: 'plugin2=3'
After plugin name has been removed: '=3'
Key-Values: ["=3"]
Item: '=3', Key: '', Value: '3'
Processing complete: 3
Starting with '{plugin2 arg1=b|arg2=d}'
The part between curly braces: 'plugin2 arg1=b|arg2=d'
After plugin name has been removed: 'arg1=b|arg2=d'
Key-Values: [" arg1=b","arg2=d"]
Item: ' arg1=b', Key: 'arg1', Value: 'b'
Item: 'arg2=d', Key: 'arg2', Value: 'd'
Processing complete: {"arg1":"b","arg2":"d"}
This version works with inputs having more than one plugin token.
function processPlugins($text, $pos = 0, $start = '{', $end = '}') {
$processed = [];
$t = substr($text, $pos);
$parts = explode($start, $t);
array_shift($parts);
foreach($parts as $part) {
$pparts = explode($end, $part);
$t = trim($pparts[0]);
$t = str_replace(['plugin1', 'plugin2'], '', $t);
$n = strlen($t);
if(!$n) {
$processed[] = trim($pparts[0]);
continue;
}
$params = explode('|', $t);
$kv = [];
foreach($params as $p) {
list($k, $v) = explode('=', trim($p));
if(trim($k) === '') {
$processed[] = trim($v);
continue 2;
}
$kv[trim($k)] = trim($v);
}
$processed[] = $kv;
}
return $processed;
}
function test($case) {
$p = processPlugins($case);
echo "$case => " . json_encode($p) . PHP_EOL;
}
$cases = [
'{plugin1}',
'{plugin2=3}',
'{plugin2 arg1=b|arg2=d}',
'text here {plugin1} and more{plugin2=55}here {plugin2 arg1=b|arg2=d} till the end'
];
foreach($cases as $case) {
test($case);
}
The output:
{plugin1} => ["plugin1"]
{plugin2=3} => ["3"]
{plugin2 arg1=b|arg2=d} => [{"arg1":"b","arg2":"d"}]
text here {plugin1} and more{plugin2=55}here {plugin2 arg1=b|arg2=d} till the end => ["plugin1","55",{"arg1":"b","arg2":"d"}]
I'm looking for a travel auto-link detection.
I'm trying to make a social media website and when my users post URLs I need it so like shows instead of just normal text.
Try Autologin for Laravel by dwightwatson, which provides you to generate URLs that will provide automatic login to your application and then redirect to the appropriate location
As far as I know, there's no equivalent in the Laravel's core for the auto_link() funtion helper from Code Igniter (assuming you are refering to the CI version).
Anyway, it's very simple to grab that code and use it in Laravel for a quick an dirty workaround. I just did casually looking for the same issue.
Put in your App directory a container class for your helpers (or any containter for the matter, it's just need to be discovered by the framework), in this case I put a UrlHelpers.php file. Then, inside of it put this two static functions grabbed for the CI version:
class UrlHelpers
{
static function auto_link($str, $type = 'both', $popup = FALSE)
{
// Find and replace any URLs.
if ($type !== 'email' && preg_match_all('#(\w*://|www\.)[^\s()<>;]+\w#i', $str, $matches, PREG_OFFSET_CAPTURE | PREG_SET_ORDER)) {
// Set our target HTML if using popup links.
$target = ($popup) ? ' target="_blank"' : '';
// We process the links in reverse order (last -> first) so that
// the returned string offsets from preg_match_all() are not
// moved as we add more HTML.
foreach (array_reverse($matches) as $match) {
// $match[0] is the matched string/link
// $match[1] is either a protocol prefix or 'www.'
//
// With PREG_OFFSET_CAPTURE, both of the above is an array,
// where the actual value is held in [0] and its offset at the [1] index.
$a = '<a href="' . (strpos($match[1][0], '/') ? '' : 'http://') . $match[0][0] . '"' . $target . '>' . $match[0][0] . '</a>';
$str = substr_replace($str, $a, $match[0][1], strlen($match[0][0]));
}
}
// Find and replace any emails.
if ($type !== 'url' && preg_match_all('#([\w\.\-\+]+#[a-z0-9\-]+\.[a-z0-9\-\.]+[^[:punct:]\s])#i', $str, $matches, PREG_OFFSET_CAPTURE)) {
foreach (array_reverse($matches[0]) as $match) {
if (filter_var($match[0], FILTER_VALIDATE_EMAIL) !== FALSE) {
$str = substr_replace($str, static::safe_mailto($match[0]), $match[1], strlen($match[0]));
}
}
}
return $str;
}
static function safe_mailto($email, $title = '', $attributes = '')
{
$title = (string)$title;
if ($title === '') {
$title = $email;
}
$x = str_split('<a href="mailto:', 1);
for ($i = 0, $l = strlen($email); $i < $l; $i++) {
$x[] = '|' . ord($email[$i]);
}
$x[] = '"';
if ($attributes !== '') {
if (is_array($attributes)) {
foreach ($attributes as $key => $val) {
$x[] = ' ' . $key . '="';
for ($i = 0, $l = strlen($val); $i < $l; $i++) {
$x[] = '|' . ord($val[$i]);
}
$x[] = '"';
}
} else {
for ($i = 0, $l = strlen($attributes); $i < $l; $i++) {
$x[] = $attributes[$i];
}
}
}
$x[] = '>';
$temp = array();
for ($i = 0, $l = strlen($title); $i < $l; $i++) {
$ordinal = ord($title[$i]);
if ($ordinal < 128) {
$x[] = '|' . $ordinal;
} else {
if (count($temp) === 0) {
$count = ($ordinal < 224) ? 2 : 3;
}
$temp[] = $ordinal;
if (count($temp) === $count) {
$number = ($count === 3)
? (($temp[0] % 16) * 4096) + (($temp[1] % 64) * 64) + ($temp[2] % 64)
: (($temp[0] % 32) * 64) + ($temp[1] % 64);
$x[] = '|' . $number;
$count = 1;
$temp = array();
}
}
}
$x[] = '<';
$x[] = '/';
$x[] = 'a';
$x[] = '>';
$x = array_reverse($x);
$output = "<script type=\"text/javascript\">\n"
. "\t//<![CDATA[\n"
. "\tvar l=new Array();\n";
for ($i = 0, $c = count($x); $i < $c; $i++) {
$output .= "\tl[" . $i . "] = '" . $x[$i] . "';\n";
}
$output .= "\n\tfor (var i = l.length-1; i >= 0; i=i-1) {\n"
. "\t\tif (l[i].substring(0, 1) === '|') document.write(\"&#\"+unescape(l[i].substring(1))+\";\");\n"
. "\t\telse document.write(unescape(l[i]));\n"
. "\t}\n"
. "\t//]]>\n"
. '</script>';
return $output;
}
}
The function safe_mailto is used in case there are email links in your string. If you don't need it you are free to modify the code.
Then you could use the helper class like this in any part of your Laravel code as usually (here inside a blade template, but the principle is the same):
<p>{!! \App\Helpers\Helpers::auto_link($string) !!}</p>
Quick and dirty, and It works. Hope to have helped. ¡Good luck!
I have various HTML strings to cut to 100 characters (of the stripped content, not the original) without stripping tags and without breaking HTML.
Original HTML string (288 characters):
$content = "<div>With a <span class='spanClass'>span over here</span> and a
<div class='divClass'>nested div over <div class='nestedDivClass'>there</div>
</div> and a lot of other nested <strong><em>texts</em> and tags in the air
<span>everywhere</span>, it's a HTML taggy kind of day.</strong></div>";
Standard trim: Trim to 100 characters and HTML breaks, stripped content comes to ~40 characters:
$content = substr($content, 0, 100)."..."; /* output:
<div>With a <span class='spanClass'>span over here</span> and a
<div class='divClass'>nested div ove... */
Stripped HTML: Outputs correct character count but obviously looses formatting:
$content = substr(strip_tags($content)), 0, 100)."..."; /* output:
With a span over here and a nested div over there and a lot of other nested
texts and tags in the ai... */
Partial solution: using HTML Tidy or purifier to close off tags outputs clean HTML but 100 characters of HTML not displayed content.
$content = substr($content, 0, 100)."...";
$tidy = new tidy; $tidy->parseString($content); $tidy->cleanRepair(); /* output:
<div>With a <span class='spanClass'>span over here</span> and a
<div class='divClass'>nested div ove</div></div>... */
Challenge: To output clean HTML and n characters (excluding character count of HTML elements):
$content = cutHTML($content, 100); /* output:
<div>With a <span class='spanClass'>span over here</span> and a
<div class='divClass'>nested div over <div class='nestedDivClass'>there</div>
</div> and a lot of other nested <strong><em>texts</em> and tags in the
ai</strong></div>...";
Similar Questions
How to clip HTML fragments without breaking up tags
Cutting HTML strings without breaking HTML tags
Not amazing, but works.
function html_cut($text, $max_length)
{
$tags = array();
$result = "";
$is_open = false;
$grab_open = false;
$is_close = false;
$in_double_quotes = false;
$in_single_quotes = false;
$tag = "";
$i = 0;
$stripped = 0;
$stripped_text = strip_tags($text);
while ($i < strlen($text) && $stripped < strlen($stripped_text) && $stripped < $max_length)
{
$symbol = $text{$i};
$result .= $symbol;
switch ($symbol)
{
case '<':
$is_open = true;
$grab_open = true;
break;
case '"':
if ($in_double_quotes)
$in_double_quotes = false;
else
$in_double_quotes = true;
break;
case "'":
if ($in_single_quotes)
$in_single_quotes = false;
else
$in_single_quotes = true;
break;
case '/':
if ($is_open && !$in_double_quotes && !$in_single_quotes)
{
$is_close = true;
$is_open = false;
$grab_open = false;
}
break;
case ' ':
if ($is_open)
$grab_open = false;
else
$stripped++;
break;
case '>':
if ($is_open)
{
$is_open = false;
$grab_open = false;
array_push($tags, $tag);
$tag = "";
}
else if ($is_close)
{
$is_close = false;
array_pop($tags);
$tag = "";
}
break;
default:
if ($grab_open || $is_close)
$tag .= $symbol;
if (!$is_open && !$is_close)
$stripped++;
}
$i++;
}
while ($tags)
$result .= "</".array_pop($tags).">";
return $result;
}
Usage example:
$content = html_cut($content, 100);
I'm not claiming to have invented this, but there is a very complete Text::truncate() method in CakePHP which does what you want:
function truncate($text, $length = 100, $ending = '...', $exact = true, $considerHtml = false) {
if (is_array($ending)) {
extract($ending);
}
if ($considerHtml) {
if (mb_strlen(preg_replace('/<.*?>/', '', $text)) <= $length) {
return $text;
}
$totalLength = mb_strlen($ending);
$openTags = array();
$truncate = '';
preg_match_all('/(<\/?([\w+]+)[^>]*>)?([^<>]*)/', $text, $tags, PREG_SET_ORDER);
foreach ($tags as $tag) {
if (!preg_match('/img|br|input|hr|area|base|basefont|col|frame|isindex|link|meta|param/s', $tag[2])) {
if (preg_match('/<[\w]+[^>]*>/s', $tag[0])) {
array_unshift($openTags, $tag[2]);
} else if (preg_match('/<\/([\w]+)[^>]*>/s', $tag[0], $closeTag)) {
$pos = array_search($closeTag[1], $openTags);
if ($pos !== false) {
array_splice($openTags, $pos, 1);
}
}
}
$truncate .= $tag[1];
$contentLength = mb_strlen(preg_replace('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|&#x[0-9a-f]{1,6};/i', ' ', $tag[3]));
if ($contentLength + $totalLength > $length) {
$left = $length - $totalLength;
$entitiesLength = 0;
if (preg_match_all('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|&#x[0-9a-f]{1,6};/i', $tag[3], $entities, PREG_OFFSET_CAPTURE)) {
foreach ($entities[0] as $entity) {
if ($entity[1] + 1 - $entitiesLength <= $left) {
$left--;
$entitiesLength += mb_strlen($entity[0]);
} else {
break;
}
}
}
$truncate .= mb_substr($tag[3], 0 , $left + $entitiesLength);
break;
} else {
$truncate .= $tag[3];
$totalLength += $contentLength;
}
if ($totalLength >= $length) {
break;
}
}
} else {
if (mb_strlen($text) <= $length) {
return $text;
} else {
$truncate = mb_substr($text, 0, $length - strlen($ending));
}
}
if (!$exact) {
$spacepos = mb_strrpos($truncate, ' ');
if (isset($spacepos)) {
if ($considerHtml) {
$bits = mb_substr($truncate, $spacepos);
preg_match_all('/<\/([a-z]+)>/', $bits, $droppedTags, PREG_SET_ORDER);
if (!empty($droppedTags)) {
foreach ($droppedTags as $closingTag) {
if (!in_array($closingTag[1], $openTags)) {
array_unshift($openTags, $closingTag[1]);
}
}
}
}
$truncate = mb_substr($truncate, 0, $spacepos);
}
}
$truncate .= $ending;
if ($considerHtml) {
foreach ($openTags as $tag) {
$truncate .= '</'.$tag.'>';
}
}
return $truncate;
}
Use PHP's DOMDocument class to normalize an HTML fragment:
$dom= new DOMDocument();
$dom->loadHTML('<div><p>Hello World');
$xpath = new DOMXPath($dom);
$body = $xpath->query('/html/body');
echo($dom->saveXml($body->item(0)));
This question is similar to an earlier question and I've copied and pasted one solution here. If the HTML is submitted by users you'll also need to filter out potential Javascript attack vectors like onmouseover="do_something_evil()" or .... Tools like HTML Purifier were designed to catch and solve these problems and are far more comprehensive than any code that I could post.
I made another function to do it, it supports UTF-8:
/**
* Limit string without break html tags.
* Supports UTF8
*
* #param string $value
* #param int $limit Default 100
*/
function str_limit_html($value, $limit = 100)
{
if (mb_strwidth($value, 'UTF-8') <= $limit) {
return $value;
}
// Strip text with HTML tags, sum html len tags too.
// Is there another way to do it?
do {
$len = mb_strwidth($value, 'UTF-8');
$len_stripped = mb_strwidth(strip_tags($value), 'UTF-8');
$len_tags = $len - $len_stripped;
$value = mb_strimwidth($value, 0, $limit + $len_tags, '', 'UTF-8');
} while ($len_stripped > $limit);
// Load as HTML ignoring errors
$dom = new DOMDocument();
#$dom->loadHTML('<?xml encoding="utf-8" ?>'.$value, LIBXML_HTML_NODEFDTD);
// Fix the html errors
$value = $dom->saveHtml($dom->getElementsByTagName('body')->item(0));
// Remove body tag
$value = mb_strimwidth($value, 6, mb_strwidth($value, 'UTF-8') - 13, '', 'UTF-8'); // <body> and </body>
// Remove empty tags
return preg_replace('/<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|"[^"]*"|[\w\-.:]+))?)*\s*\/?>\s*<\/\1\s*>/', '', $value);
}
SEE DEMO.
I recommend use html_entity_decode at the start of function, so it preserves the UTF-8 characters:
$value = html_entity_decode($value);
Use a HTML parser and stop after 100 characters of text.
You should use Tidy HTML. You cut the string and then you run Tidy to close the tags.
(Credits where credits are due)
Regardless of the 100 count issues you state at the beginning, you indicate in the challenge the following:
output the character count of
strip_tags (the number of characters
in the actual displayed text of the
HTML)
retain HTML formatting close
any unfinished HTML tag
Here is my proposal:
Bascially, I parse through each character counting as I go. I make sure NOT to count any characters in any HTML tag. I also check at the end to make sure I am not in the middle of a word when I stop. Once I stop, I back track to the first available SPACE or > as a stopping point.
$position = 0;
$length = strlen($content)-1;
// process the content putting each 100 character section into an array
while($position < $length)
{
$next_position = get_position($content, $position, 100);
$data[] = substr($content, $position, $next_position);
$position = $next_position;
}
// show the array
print_r($data);
function get_position($content, $position, $chars = 100)
{
$count = 0;
// count to 100 characters skipping over all of the HTML
while($count <> $chars){
$char = substr($content, $position, 1);
if($char == '<'){
do{
$position++;
$char = substr($content, $position, 1);
} while($char !== '>');
$position++;
$char = substr($content, $position, 1);
}
$count++;
$position++;
}
echo $count."\n";
// find out where there is a logical break before 100 characters
$data = substr($content, 0, $position);
$space = strrpos($data, " ");
$tag = strrpos($data, ">");
// return the position of the logical break
if($space > $tag)
{
return $space;
} else {
return $tag;
}
}
This will also count the return codes etc. Considering they will take space, I have not removed them.
Here is a function I'm using in one of my projects. It's based on DOMDocument, works with HTML5 and is about 2x faster than other solutions I've tried (at least on my machine, 0.22 ms vs 0.43 ms using html_cut($text, $max_length) from the top answer on a 500 text-node-characters string with a limit of 400).
function cut_html ($html, $limit) {
$dom = new DOMDocument();
$dom->loadHTML(mb_convert_encoding("<div>{$html}</div>", "HTML-ENTITIES", "UTF-8"), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
cut_html_recursive($dom->documentElement, $limit);
return substr($dom->saveHTML($dom->documentElement), 5, -6);
}
function cut_html_recursive ($element, $limit) {
if($limit > 0) {
if($element->nodeType == 3) {
$limit -= strlen($element->nodeValue);
if($limit < 0) {
$element->nodeValue = substr($element->nodeValue, 0, strlen($element->nodeValue) + $limit);
}
}
else {
for($i = 0; $i < $element->childNodes->length; $i++) {
if($limit > 0) {
$limit = cut_html_recursive($element->childNodes->item($i), $limit);
}
else {
$element->removeChild($element->childNodes->item($i));
$i--;
}
}
}
}
return $limit;
}
Here is my try at the cutter. Maybe you guys can catch some bugs. The problem, i found with the other parsers, is that they don't close tags properly and they cut in the middle of a word (blah)
function cutHTML($string, $length, $patternsReplace = false) {
$i = 0;
$count = 0;
$isParagraphCut = false;
$htmlOpen = false;
$openTag = false;
$tagsStack = array();
while ($i < strlen($string)) {
$char = substr($string, $i, 1);
if ($count >= $length) {
$isParagraphCut = true;
break;
}
if ($htmlOpen) {
if ($char === ">") {
$htmlOpen = false;
}
} else {
if ($char === "<") {
$j = $i;
$char = substr($string, $j, 1);
while ($j < strlen($string)) {
if($char === '/'){
$i++;
break;
}
elseif ($char === ' ') {
$tagsStack[] = substr($string, $i, $j);
}
$j++;
}
$htmlOpen = true;
}
}
if (!$htmlOpen && $char != ">") {
$count++;
}
$i++;
}
if ($isParagraphCut) {
$j = $i;
while ($j > 0) {
$char = substr($string, $j, 1);
if ($char === " " || $char === ";" || $char === "." || $char === "," || $char === "<" || $char === "(" || $char === "[") {
break;
} else if ($char === ">") {
$j++;
break;
}
$j--;
}
$string = substr($string, 0, $j);
foreach($tagsStack as $tag){
$tag = strtolower($tag);
if($tag !== "img" && $tag !== "br"){
$string .= "</$tag>";
}
}
$string .= "...";
}
if ($patternsReplace) {
foreach ($patternsReplace as $value) {
if (isset($value['pattern']) && isset($value["replace"])) {
$string = preg_replace($value["pattern"], $value["replace"], $string);
}
}
}
return $string;
}
try this function
// trim the string function
function trim_word($text, $length, $startPoint=0, $allowedTags=""){
$text = html_entity_decode(htmlspecialchars_decode($text));
$text = strip_tags($text, $allowedTags);
return $text = substr($text, $startPoint, $length);
}
and
echo trim_word("<h2 class='zzzz'>abcasdsdasasdas</h2>","6");
I know this is quite old, but I've recently made a small class for cutting HTML for previews: https://github.com/Simbiat/HTMLCut/
Why would you want to use that instead of the other suggestions? Here are a few things that come to my mind (taken from readme):
Preserve HTML tags, unless they are empty.
Preserve words.
Remove some orphaned punctuation signs at the end of the cut string.
Remove HTML tags, that you would not want in a preview (optional).
Limit number of paragraphs (optional).
Add an ellipsis if text was cut (optional).
Class operates with DOM, but also uses Regex in some places (mainly for cutting and trimming). Perhaps it can be of use to some.
Try the following:
<?php echo strip_tags(mb_strimwidth($VARIABLE_HERE, 0, 160, "...")); ?>
This will strip the HTML (strip_tags) amd limit characters (mb_strimwidth) to 160 characters
I have been trying to improve this script in PHP so it can give me the value of a to 9999999999.....9999999999 (up to 72 characters) to insert in MySQL. So far it stops at 999. I have increased Apache's memory and the script exuction time but it still stays the same. Here is my script:
<?php
function evol($length = 1, $deb_chaine = '') {
$tab=array("a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z","0","1","2","3","4","5","6","7","8","9");
$str = '';
if(strlen($deb_chaine) <= ($length - 1)) {
foreach($tab as $lettre) {
$str .= ' '. $deb_chaine . $lettre;
}
if($deb_chaine == '') {
$str .= evol($length, 'a');
}
else { // sinon
$last = substr($deb_chaine, -1);
$reste = substr($deb_chaine, 0, -1);
if($last == "9") {
$i = strlen($deb_chaine) - 1;
$reste = "";
while($i >= 0) {
if($deb_chaine[$i] == "9") {
$reste = 'a'. $reste;
}
else {
$reste = $tab[(array_search($deb_chaine[$i], $tab) + 1)] . substr($reste, 0, -1);
break 1;
}
$i--;
}
$new = 'a';
}
else {
$new = $tab[(array_search($last, $tab) + 1)];
}
$str .= evol($length, ($reste . $new));
}
}
return $str;
}
echo evol(72);
?>
This code sets the value of a to 999.
Is there a way to do this without writing my own function?
For example:
$text = 'Test <span><a>something</a> something else</span>.';
$text = cutText($text, 2, null, 20, true);
//result: Test <span><a>something</a></span>
I need to make this function indestructible
My problem is similar to
This thread
but I need a better solution. I would like to keep nested tags untouched.
So far my algorithm is:
function cutText($content, $max_words, $max_chars, $max_word_len, $html = false) {
$len = strlen($content);
$res = '';
$word_count = 0;
$word_started = false;
$current_word = '';
$current_word_len = 0;
if ($max_chars == null) {
$max_chars = $len;
}
$inHtml = false;
$openedTags = array();
for ($i = 0; $i<$max_chars;$i++) {
if ($content[$i] == '<' && $html) {
$inHtml = true;
}
if ($inHtml) {
$max_chars++;
}
if ($html && !$inHtml) {
if ($content[$i] != ' ' && !$word_started) {
$word_started = true;
$word_count++;
}
$current_word .= $content[$i];
$current_word_len++;
if ($current_word_len == $max_word_len) {
$current_word .= '- ';
}
if (($content[$i] == ' ') && $word_started) {
$word_started = false;
$res .= $current_word;
$current_word = '';
$current_word_len = 0;
if ($word_count == $max_words) {
return $res;
}
}
}
if ($content[$i] == '<' && $html) {
$inHtml = true;
}
}
return $res;
}
But of course it won't work. I thought about remembering opened tags and closing them if they were not closed but maybe there is a better way?
This works perfectly for me:
function trimContent ($str, $trimAtIndex) {
$beginTags = array();
$endTags = array();
for($i = 0; $i < strlen($str); $i++) {
if( $str[$i] == '<' )
$beginTags[] = $i;
else if($str[$i] == '>')
$endTags[] = $i;
}
foreach($beginTags as $k=>$index) {
// Trying to trim in between tags. Trim after the last tag
if( ( $trimAtIndex >= $index ) && ($trimAtIndex <= $endTags[$k]) ) {
$trimAtIndex = $endTags[$k];
}
}
return substr($str, 0, $trimAtIndex);
}
Try something like this
function cutText($inputText, $start, $length) {
$temp = $inputText;
$res = array();
while (strpos($temp, '>')) {
$ts = strpos($temp, '<');
$te = strpos($temp, '>');
if ($ts > 0) $res[] = substr($temp, 0, $ts);
$res[] = substr($temp, $ts, $te - $ts + 1);
$temp = substr($temp, $te + 1, strlen($temp) - $te);
}
if ($temp != '') $res[] = $temp;
$pointer = 0;
$end = $start + $length - 1;
foreach ($res as &$part) {
if (substr($part, 0, 1) != '<') {
$l = strlen($part);
$p1 = $pointer;
$p2 = $pointer + $l - 1;
$partx = "";
if ($start <= $p1 && $end >= $p2) $partx = "";
else {
if ($start > $p1 && $start <= $p2) $partx .= substr($part, 0, $start-$pointer);
if ($end >= $p1 && $end < $p2) $partx .= substr($part, $end-$pointer+1, $l-$end+$pointer);
if ($partx == "") $partx = $part;
}
$part = $partx;
$pointer += $l;
}
}
return join('', $res);
}
Parameters:
$inputText - input text
$start - position of first character
$length - how menu characters we want to remove
Example #1 - Removing first 3 characters
$text = 'Test <span><a>something</a> something else</span>.';
$text = cutText($text, 0, 3);
var_dump($text);
Output (removed "Tes")
string(47) "t <span><a>something</a> something else</span>."
Removing first 10 characters
$text = cutText($text, 0, 10);
Output (removed "Test somet")
string(40) "<span><a>hing</a> something else</span>."
Example 2 - Removing inner characters - "es" from "Test "
$text = cutText($text, 1, 2);
Output
string(48) "Tt <span><a>something</a> something else</span>."
Removing "thing something el"
$text = cutText($text, 9, 18);
Output
string(32) "Test <span><a>some</a>se</span>."
Hope this helps.
Well, maybe this is not the best solution but it's everything I can do at the moment.
Ok I solved this thing.
I divided this in 2 parts.
First cutting text without destroying html:
function cutHtml($content, $max_words, $max_chars, $max_word_len) {
$len = strlen($content);
$res = '';
$word_count = 0;
$word_started = false;
$current_word = '';
$current_word_len = 0;
if ($max_chars == null) {
$max_chars = $len;
}
$inHtml = false;
$openedTags = array();
$i = 0;
while ($i < $max_chars) {
//skip any html tags
if ($content[$i] == '<') {
$inHtml = true;
while (true) {
$res .= $content[$i];
$i++;
while($content[$i] == ' ') { $res .= $content[$i]; $i++; }
//skip any values
if ($content[$i] == "'") {
$res .= $content[$i];
$i++;
while(!($content[$i] == "'" && $content[$i-1] != "\\")) {
$res .= $content[$i];
$i++;
}
}
//skip any values
if ($content[$i] == '"') {
$res .= $content[$i];
$i++;
while(!($content[$i] == '"' && $content[$i-1] != "\\")) {
$res .= $content[$i];
$i++;
}
}
if ($content[$i] == '>') { $res .= $content[$i]; $i++; break;}
}
$inHtml = false;
}
if (!$inHtml) {
while($content[$i] == ' ') { $res .= $content[$i]; $letter_count++; $i++; } //skip spaces
$word_started = false;
$current_word = '';
$current_word_len = 0;
while (!in_array($content[$i], array(' ', '<', '.', ','))) {
if (!$word_started) {
$word_started = true;
$word_count++;
}
$current_word .= $content[$i];
$current_word_len++;
if ($current_word_len == $max_word_len) {
$current_word .= '-';
$current_word_len = 0;
}
$i++;
}
if ($letter_count > $max_chars) {
return $res;
}
if ($word_count < $max_words) {
$res .= $current_word;
$letter_count += strlen($current_word);
}
if ($word_count == $max_words) {
$res .= $current_word;
$letter_count += strlen($current_word);
return $res;
}
}
}
return $res;
}
And next thing is closing unclosed tags:
function cleanTags(&$html) {
$count = strlen($html);
$i = -1;
$openedTags = array();
while(true) {
$i++;
if ($i >= $count) break;
if ($html[$i] == '<') {
$tag = '';
$closeTag = '';
$reading = false;
//reading whole tag
while($html[$i] != '>') {
$i++;
while($html[$i] == ' ') $i++; //skip any spaces (need to be idiot proof)
if (!$reading && $html[$i] == '/') { //closing tag
$i++;
while($html[$i] == ' ') $i++; //skip any spaces
$closeTag = '';
while($html[$i] != ' ' && $html[$i] != '>') { //start reading first actuall string
$reading = true;
$html[$i] = strtolower($html[$i]); //tags to lowercase
$closeTag .= $html[$i];
$i++;
}
$c = count($openedTags);
if ($c > 0 && $openedTags[$c-1] == $closeTag) array_pop($openedTags);
}
if (!$reading) //read only tag
while($html[$i] != ' ' && $html[$i] != '>') { //start reading first actuall string
$reading = true;
$html[$i] = strtolower($html[$i]); //tags to lowercase
$tag .= $html[$i];
$i++;
}
//skip any values
if ($html[$i] == "'") {
$i++;
while(!($html[$i] == "'" && $html[$i-1] != "\\")) {
$i++;
}
}
//skip any values
if ($html[$i] == '"') {
$i++;
while(!($html[$i] == '"' && $html[$i-1] != "\\")) {
$i++;
}
}
if ($reading && $html[$i] == '/') { //self closed tag
$tag = '';
break;
}
}
if (!empty($tag)) $openedTags[] = $tag;
}
}
while (count($openedTags) > 0) {
$tag = array_pop($openedTags);
$html .= "</$tag>";
}
}
It's not idiot proof but tinymce will clear this thing out so further cleaning is not necessary.
It may be a little long but i don't think it will eat a lot of resources and it should be faster than regex.