I'm trying to have a feature that acts like Facebook's show more behaviour.
I want it to trim the string if:
its length is more than 200 characters.
there are more than 5 /n occurrences.
It sounds simple and I already have an initial function (that does it only by length, I haven't implemented the /n occurrences yet):
function contentShowMore($string, $max_length) {
if(mb_strlen($string, 'utf-8') <= $max_length) {
return $string; // return the original string if haven't passed $max_length
} else {
$teaser = mb_substr($string, 0, $max_length); // trim to max length
$dots = '<span class="show-more-dots"> ...</span>'; // add dots
$show_more_content = mb_substr($string, $max_length); // get the hidden content
$show_more_wrapper = '<span class="show-more-content">'.$show_more_content.'</span>'; // wrap it
return $teaser.$dots.$show_more_wrapper; // connect all together for usage on HTML.
}
}
The problem is that the string might include URLs, so it breaks them. I need to find a way to make a functional show-more button that checks length, newlines and won't cut URLs.
Thank you!
Example:
input: contentShowMore("hello there http://google.com/ good day!", 20).
output:
hello there http://g
<span class="show-more-dots"> ...</span>
<span class="show-more-content">oogle.com/ good day!</span>
the output i want:
hello there http://google.com/
<span class="show-more-dots"> ...</span>
<span class="show-more-content"> good day!</span>
found a solution!
function contentShowMore($string, $max_length, $max_newlines) {
$trim_str = trim($string);
if(mb_strlen($trim_str, 'utf-8') <= $max_length && substr_count($trim_str, "\n") < $max_newlines) { // return the original if short, or less than X newlines
return $trim_str;
} else {
$teaser = mb_substr($trim_str, 0, $max_length); // text to show
$show_more_content = mb_substr($trim_str, $max_length);
// the read more might have cut a string (or worse - an URL) in the middle of it.
// so we will take all the rest of the string before the next whitespace and will add it back to the teaser.
$content_parts = explode(' ', $show_more_content, 2); // [0] - before first space, [1] - after first space
$teaser .= $content_parts[0];
if(isset($content_parts[1])) { // if there are still leftover strings, its on show more! :)
$show_more_content = $content_parts[1];
}
// NOW WERE CHEKING MAX NEWLINES.
$teaser_parts = explode("\n", $teaser); // break to array.
$teaser = implode("\n", array_slice($teaser_parts, 0, $max_newlines)); // take the first $max_newlines lines and use them as teaser.
$show_more_content = implode("\n", array_slice($teaser_parts, $max_newlines)) . ' ' . $show_more_content; // connect the rest to the hidden content.
if(mb_strlen($show_more_content, "UTF-8") === 0) {
return $trim_str; // nothing to hide - return original.
} else {
$show_more_wrapper = '<span class="show-more-content">'.$show_more_content.'</span>';
$dots = '<span class="show-more-dots"> ...</span>'; // dots will be visible between the teaser and the hidden.
$button = ' <span class="show-more">Show more</span>';
return $teaser.$dots.$button.$show_more_wrapper; // connect ingredients
}
}
}
I am using the following code to place some ad code inside my content .
<?php
$content = apply_filters('the_content', $post->post_content);
$content = explode (' ', $content);
$halfway_mark = ceil(count($content) / 2);
$first_half_content = implode(' ', array_slice($content, 0, $halfway_mark));
$second_half_content = implode(' ', array_slice($content, $halfway_mark));
echo $first_half_content.'...';
echo ' YOUR ADS CODE';
echo $second_half_content;
?>
How can i modify this so that the 2 paragraphs (top and bottom) enclosing the ad code should not be the one having images. If the top or bottom paragraph has image then try for next 2 paragraphs.
Example: Correct Implementation on the right.
preg_replace version
This code steps through every paragraph ignoring those that contain image tags. The $pcount variable is incremented for every paragraph found without an image, if an image is encountered however, $pcount is reset to zero. Once $pcount reaches the point where it would hit two, the advert markup is inserted just before that paragraph. This should leave the advert markup between two safe paragraphs. The advert markup variable is then nullified so only one advert is inserted.
The following code is just for set up and could be modified to split the content differently, you could also modify the regular expression that is used — just in case you are using double BRs or something else to delimit your paragraphs.
/// set our advert content
$advert = '<marquee>BUY THIS STUFF!!</marquee>' . "\n\n";
/// calculate mid point
$mpoint = floor(strlen($content) / 2);
/// modify back to the start of a paragraph
$mpoint = strripos($content, '<p', -$mpoint);
/// split html so we only work on second half
$first = substr($content, 0, $mpoint);
$second = substr($content, $mpoint);
$pcount = 0;
$regexp = '/<p>.+?<\/p>/si';
The rest is the bulk of the code that runs the replacement. This could be modified to insert more than one advert, or to support more involved image checking.
$content = $first . preg_replace_callback($regexp, function($matches){
global $pcount, $advert;
if ( !$advert ) {
$return = $matches[0];
}
else if ( stripos($matches[0], '<img ') !== FALSE ) {
$return = $matches[0];
$pcount = 0;
}
else if ( $pcount === 1 ) {
$return = $advert . $matches[0];
$advert = '';
}
else {
$return = $matches[0];
$pcount++;
}
return $return;
}, $second);
After this code has been executed the $content variable will contain the enhanced HTML.
PHP versions prior to 5.3
As your chosen testing area does not support PHP 5.3, and so does not support anonymous functions, you need to use a slightly modified and less succinct version; that makes use of a named function instead.
Also, in order to support content that may not actually leave space for the advert in it's second half I have modified the $mpoint so that it is calculated to be 80% from the end. This will have the effect of including more in the $second part — but will also mean your adverts will be generally placed higher up in the mark-up. This code has not had any fallback implemented into it, because your question does not mention what should happen in the event of failure.
$advert = '<marquee>BUY THIS STUFF!!</marquee>' . "\n\n";
$mpoint = floor(strlen($content) * 0.8);
$mpoint = strripos($content, '<p', -$mpoint);
$first = substr($content, 0, $mpoint);
$second = substr($content, $mpoint);
$pcount = 0;
$regexp = '/<p>.+?<\/p>/si';
function replacement_callback($matches){
global $pcount, $advert;
if ( !$advert ) {
$return = $matches[0];
}
else if ( stripos($matches[0], '<img ') !== FALSE ) {
$return = $matches[0];
$pcount = 0;
}
else if ( $pcount === 1 ) {
$return = $advert . $matches[0];
$advert = '';
}
else {
$return = $matches[0];
$pcount++;
}
return $return;
}
echo $first . preg_replace_callback($regexp, 'replacement_callback', $second);
You could try this:
<?php
$ad_code = 'SOME SCRIPT HERE';
// Your code.
$content = apply_filters('the_content', $post->post_content);
// Split the content at the <p> tags.
$content = explode ('<p>', $content);
// Find the mid of the article.
$content_length = count($content);
$content_mid = floor($content_length / 2);
// Save no image p's index.
$last_no_image_p_index = NULL;
// Loop beginning from the mid of the article to search for images.
for ($i = $content_mid; $i < $content_length; $i++) {
// If we do not find an image, let it go down.
if (stripos($content[$i], '<img') === FALSE) {
// In case we already have a last no image p, we check
// if it was the one right before this one, so we have
// two p tags with no images in there.
if ($last_no_image_p_index === ($i - 1)) {
// We break here.
break;
}
else {
$last_no_image_p_index = $i;
}
}
}
// If no none image p tag was found, we use the last one.
if (is_null($last_no_image_p_index)) {
$last_no_image_p_index = ($content_length - 1);
}
// Add ad code here with trailing <p>, so the implode later will work correctly.
$content = array_slice($content, $last_no_image_p_index, 0, $ad_code . '</p>');
$content = implode('<p>', $content);
?>
It will try to find a place for the ad from the mid of your article and if none is found the ad is put to the end.
Regards
func0der
I think this will work:
First explode the paragraphs, then you have to loop it and check if you find img inside them.
If you find it inside, you try the next.
Think of this as psuedo-code, since it's not tested. You will have to make a loop too, comments in the code :) Sorry if it contains bugs, it's written in Notepad.
<?php
$i = 0; // counter
$arrBoolImg = array(); // array for the paragraph booleans
$content = apply_filters('the_content', $post->post_content);
$contents = str_replace ('<p>', '<explode><p>', $content); // here we add a custom tag, so we can explode
$contents = explode ('<explode>', $contents); // then explode it, so we can iterate the paragraphs
// fill array with boolean array returned
$arrBoolImg = hasImages($contents);
$halfway_mark = ceil(count($contents) / 2);
/*
TODO (by you):
---
When you have $arrBoolImg filled, you can itarate through it.
You then simply loop from the middle of the array $contents (plural), that is exploded from above.
The startingpoing for your loop is the middle, the upper bounds is the +2 or what ever :-)
Then you simply insert your magic.. And then glue it back together, as you did before.
I think this will work. even though the code may have some bugs, since I wrote it in Notepad.
*/
function hasImages($contents) {
/*
This function loops through the $contents array and checks if they have images in them
The return value, is an array with boolean values, so one can iterate through it.
*/
$arrRet = array(); // array for the paragraph booleans
if (count($content)>=1) {
foreach ($contents as $v) { // iterate the content
if (strpos($v, '<img') === false) { // did not find img
$arrRet[$i] = false;
}
else { // found img
$arrRet[$i] = true;
}
$i++;
} // end for each loop
return $arrRet;
} // end if count
} // end hasImages func
?>
[This is just an idea, I don't have enough reputation to comment...]
After calling #Olavxxx's method and filling your boolean array you could just loop through that array in an alternating manner starting in the middle: Let's assume your array is 8 entries long. Calculating the middle using your method you get 4. So you check the combination of values 4 + 3, if that doesn't work, you check 4 + 5, after that 3 + 2, ...
So your loop looks somewhat like
$middle = ceil(count($content) / 2);
$i = 1;
while ($i <= $middle) {
$j = $middle + (-1) ^ $i * $i;
$k = $j + 1;
if (!$hasImagesArray[$j] && !$hasImagesArray[$k])
break; // position found
$i++;
}
It's up to you to implement further constraints to make sure the add is not shown to far up or down in the article...
Please note that you need to take care of special cases like too short arrays too in order to prevent IndexOutOfBounds-Exceptions.
I have a string that has php code in it, I need to remove the php code from the string, for example:
<?php $db1 = new ps_DB() ?><p>Dummy</p>
Should return <p>Dummy</p>
And a string with no php for example <p>Dummy</p> should return the same string.
I know this can be done with a regular expression, but after 4h I haven't found a solution.
<?php
function filter_html_tokens($a){
return is_array($a) && $a[0] == T_INLINE_HTML ?
$a[1]:
'';
}
$htmlphpstring = '<a>foo</a> something <?php $db1 = new ps_DB() ?><p>Dummy</p>';
echo implode('',array_map('filter_html_tokens',token_get_all($htmlphpstring)));
?>
As ircmaxell pointed out: this would require valid PHP!
A regex route would be (allowing for no 'php' with short tags. no ending ?> in the string / file (for some reason Zend recommends this?) and of course an UNgreedy & DOTALL pattern:
preg_replace('/<\\?.*(\\?>|$)/Us', '',$htmlphpstring);
Well, you can use DomDocument to do it...
function stripPHPFromHTML($html) {
$dom = new DomDocument();
$dom->loadHtml($html);
removeProcessingInstructions($dom);
$simple = simplexml_import_dom($d->getElementsByTagName('body')->item(0));
return $simple->children()->asXml();
}
function removeProcessingInstructions(DomNode &$node) {
foreach ($node->childNodes as $child) {
if ($child instanceof DOMProcessingInstruction) {
$node->removeChild($child);
} else {
removeProcessingInstructions($child);
}
}
}
Those two functions will turn
$str = '<?php echo "foo"; ?><b>Bar</b>';
$clean = stripPHPFromHTML($str);
$html = '<b>Bar</b>';
Edit: Actually, after looking at Wrikken's answer, I realized that both methods have a disadvantage... Mine requires somewhat valid HTML markup (Dom is decent, but it won't parse <b>foo</b><?php echo $bar). Wrikken's requires valid PHP (any syntax errors and it'll fail). So perhaps a combination of the two (try one first. If it fails, try the other. If both fail, there's really not much you can do without trying to figure out the exact reason they failed)...
A simple solution is to explode into arrays using the php tags to remove any content between and implode back to a string.
function strip_php($str) {
$newstr = '';
//split on opening tag
$parts = explode('<?',$str);
if(!empty($parts)) {
foreach($parts as $part) {
//split on closing tag
$partlings = explode('?>',$part);
if(!empty($partlings)) {
//remove content before closing tag
$partlings[0] = '';
}
//append to string
$newstr .= implode('',$partlings);
}
}
return $newstr;
}
This is slower than regex but doesn't require valid html or php; it only requires all php tags to be closed.
For files which don't always include a final closing tag and for general error checking you could count the tags and append a closing tag if it's missing or notify if the opening and closing tags don't add up as expected, e.g. add the code below at the start of the function. This would slow it down a bit more though :)
$tag_diff = (substr_count($str,'<?') - (substr_count($str,'?>');
//Append if there's one less closing tag
if($tag_diff == 1) $str .= '?>';
//Parse error if the tags don't add up
if($tag_diff < 0 || $tag_diff > 1) die('Error: Tag mismatch.
(Opening minus closing tags = '.$tag_diff.')<br><br>
Dumping content:<br><hr><br>'.htmlentities($str));
This is an enhanced version of strip_php suggested by #jon that is able to replace php part of code with another string:
/**
* Remove PHP code part from a string.
*
* #param string $str String to clean
* #param string $replacewith String to use as replacement
* #return string Result string without php code
*/
function dolStripPhpCode($str, $replacewith='')
{
$newstr = '';
//split on each opening tag
$parts = explode('<?php',$str);
if (!empty($parts))
{
$i=0;
foreach($parts as $part)
{
if ($i == 0) // The first part is never php code
{
$i++;
$newstr .= $part;
continue;
}
//split on closing tag
$partlings = explode('?>', $part);
if (!empty($partlings))
{
//remove content before closing tag
if (count($partlings) > 1) $partlings[0] = '';
//append to out string
$newstr .= $replacewith.implode('',$partlings);
}
}
}
return $newstr;
}
If you are using PHP, you just need to use a regular expression to replace anything that matches PHP code.
The following statement will remove the PHP tag:
preg_replace('/^<\?php.*\?\>/', '', '<?php $db1 = new ps_DB() ?><p>Dummy</p>');
If it doesn't find any match, it won't replace anything.
I've convinced my boss to do the typesetting stuff using PHP(PHP Version 5.2.8).
And this is what I got so far(set Character encoding to Unicode(UTF-8) if you see misrendered Japanese characters):
demo page at my personal website
Basically, if you copy and paste the latin sample paragraph into the textarea and click the button, everything works well, you can verify that by pasting the result into Notepad for a check(albeit the fact that I haven't done anything to use hyphens to denote words separated by new lines).
However, when it comes with non-latin/Asian characters, nothing got printed out. I didn't get any error message generated, just cannot see anything at all...
The following is my code:
<?php
$words = typesetWords($_POST['words']);
echo json_encode(array('feedback' => $words));
function typesetWords($words, $lineLength = 70)
{
try
{
$result = '';
$paragraphs = explode("\n\n", $words);
foreach($paragraphs as $paragraph)
{
$paragraph = str_replace("\n", "", $paragraph);
$length = strlen($paragraph);
$numberOfLines = intval($length / $lineLength);
$tmp = '';
if($numberOfLines > 0)
{
for($i = 0; $i < $numberOfLines; $i++)
$tmp .= substr($paragraph, $i * $lineLength, $lineLength)."\n";
$tmp .= substr($paragraph, -1 * ($length % $lineLength))."\n\n";
$result .= $tmp;
}
else $result .= $paragraph."\n\n";
}
}
catch(Exception $e)
{
return $e->getMessage();
}
return $result;
}
?>
I tried to return what was sent by the form directly back, and I did see the Japanese sample paragraph without problems. So I reckon one of the PHP library functions must have
caused the error, but I couldn't tell which one and how to fix it...
Many thanks in advance!
strlen() will return the number of characters from a string formatted for ANSI/ASCII, not UTF-8. Try mb_strlen() instead.
I want to truncate some text (loaded from a database or text file), but it contains HTML so as a result the tags are included and less text will be returned. This can then result in tags not being closed, or being partially closed (so Tidy may not work properly and there is still less content). How can I truncate based on the text (and probably stopping when you get to a table as that could cause more complex issues).
substr("Hello, my <strong>name</strong> is <em>Sam</em>. I´m a web developer.",0,26)."..."
Would result in:
Hello, my <strong>name</st...
What I would want is:
Hello, my <strong>name</strong> is <em>Sam</em>. I´m...
How can I do this?
While my question is for how to do it in PHP, it would be good to know how to do it in C#... either should be OK as I think I would be able to port the method over (unless it is a built in method).
Also note that I have included an HTML entity ´ - which would have to be considered as a single character (rather than 7 characters as in this example).
strip_tags is a fallback, but I would lose formatting and links and it would still have the problem with HTML entities.
Assuming you are using valid XHTML, it's simple to parse the HTML and make sure tags are handled properly. You simply need to track which tags have been opened so far, and make sure to close them again "on your way out".
<?php
header('Content-type: text/plain; charset=utf-8');
function printTruncated($maxLength, $html, $isUtf8=true)
{
$printedLength = 0;
$position = 0;
$tags = array();
// For UTF-8, we need to count multibyte sequences as one character.
$re = $isUtf8
? '{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;|[\x80-\xFF][\x80-\xBF]*}'
: '{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}';
while ($printedLength < $maxLength && preg_match($re, $html, $match, PREG_OFFSET_CAPTURE, $position))
{
list($tag, $tagPosition) = $match[0];
// Print text leading up to the tag.
$str = substr($html, $position, $tagPosition - $position);
if ($printedLength + strlen($str) > $maxLength)
{
print(substr($str, 0, $maxLength - $printedLength));
$printedLength = $maxLength;
break;
}
print($str);
$printedLength += strlen($str);
if ($printedLength >= $maxLength) break;
if ($tag[0] == '&' || ord($tag) >= 0x80)
{
// Pass the entity or UTF-8 multibyte sequence through unchanged.
print($tag);
$printedLength++;
}
else
{
// Handle the tag.
$tagName = $match[1][0];
if ($tag[1] == '/')
{
// This is a closing tag.
$openingTag = array_pop($tags);
assert($openingTag == $tagName); // check that tags are properly nested.
print($tag);
}
else if ($tag[strlen($tag) - 2] == '/')
{
// Self-closing tag.
print($tag);
}
else
{
// Opening tag.
print($tag);
$tags[] = $tagName;
}
}
// Continue after the tag.
$position = $tagPosition + strlen($tag);
}
// Print any remaining text.
if ($printedLength < $maxLength && $position < strlen($html))
print(substr($html, $position, $maxLength - $printedLength));
// Close any open tags.
while (!empty($tags))
printf('</%s>', array_pop($tags));
}
printTruncated(10, '<b><Hello></b> <img src="world.png" alt="" /> world!'); print("\n");
printTruncated(10, '<table><tr><td>Heck, </td><td>throw</td></tr><tr><td>in a</td><td>table</td></tr></table>'); print("\n");
printTruncated(10, "<em><b>Hello</b>w\xC3\xB8rld!</em>"); print("\n");
Encoding note: The above code assumes the XHTML is UTF-8 encoded. ASCII-compatible single-byte encodings (such as Latin-1) are also supported, just pass false as the third argument. Other multibyte encodings are not supported, though you may hack in support by using mb_convert_encoding to convert to UTF-8 before calling the function, then converting back again in every print statement.
(You should always be using UTF-8, though.)
Edit: Updated to handle character entities and UTF-8. Fixed bug where the function would print one character too many, if that character was a character entity.
I've written a function that truncates HTML just as yous suggest, but instead of printing it out it puts it just keeps it all in a string variable. handles HTML Entities, as well.
/**
* function to truncate and then clean up end of the HTML,
* truncates by counting characters outside of HTML tags
*
* #author alex lockwood, alex dot lockwood at websightdesign
*
* #param string $str the string to truncate
* #param int $len the number of characters
* #param string $end the end string for truncation
* #return string $truncated_html
*
* **/
public static function truncateHTML($str, $len, $end = '…'){
//find all tags
$tagPattern = '/(<\/?)([\w]*)(\s*[^>]*)>?|&[\w#]+;/i'; //match html tags and entities
preg_match_all($tagPattern, $str, $matches, PREG_OFFSET_CAPTURE | PREG_SET_ORDER );
//WSDDebug::dump($matches); exit;
$i =0;
//loop through each found tag that is within the $len, add those characters to the len,
//also track open and closed tags
// $matches[$i][0] = the whole tag string --the only applicable field for html enitities
// IF its not matching an &htmlentity; the following apply
// $matches[$i][1] = the start of the tag either '<' or '</'
// $matches[$i][2] = the tag name
// $matches[$i][3] = the end of the tag
//$matces[$i][$j][0] = the string
//$matces[$i][$j][1] = the str offest
while($matches[$i][0][1] < $len && !empty($matches[$i])){
$len = $len + strlen($matches[$i][0][0]);
if(substr($matches[$i][0][0],0,1) == '&' )
$len = $len-1;
//if $matches[$i][2] is undefined then its an html entity, want to ignore those for tag counting
//ignore empty/singleton tags for tag counting
if(!empty($matches[$i][2][0]) && !in_array($matches[$i][2][0],array('br','img','hr', 'input', 'param', 'link'))){
//double check
if(substr($matches[$i][3][0],-1) !='/' && substr($matches[$i][1][0],-1) !='/')
$openTags[] = $matches[$i][2][0];
elseif(end($openTags) == $matches[$i][2][0]){
array_pop($openTags);
}else{
$warnings[] = "html has some tags mismatched in it: $str";
}
}
$i++;
}
$closeTags = '';
if (!empty($openTags)){
$openTags = array_reverse($openTags);
foreach ($openTags as $t){
$closeTagString .="</".$t . ">";
}
}
if(strlen($str)>$len){
// Finds the last space from the string new length
$lastWord = strpos($str, ' ', $len);
if ($lastWord) {
//truncate with new len last word
$str = substr($str, 0, $lastWord);
//finds last character
$last_character = (substr($str, -1, 1));
//add the end text
$truncated_html = ($last_character == '.' ? $str : ($last_character == ',' ? substr($str, 0, -1) : $str) . $end);
}
//restore any open tags
$truncated_html .= $closeTagString;
}else
$truncated_html = $str;
return $truncated_html;
}
100% accurate, but pretty difficult approach:
Iterate charactes using DOM
Use DOM methods to remove remaining elements
Serialize the DOM
Easy brute-force approach:
Split string into tags (not elements) and text fragments using preg_split('/(<tag>)/') with PREG_DELIM_CAPTURE.
Measure text length you want (it'll be every second element from split, you might use html_entity_decode() to help measure accurately)
Cut the string (trim &[^\s;]+$ at the end to get rid of possibly chopped entity)
Fix it with HTML Tidy
I used a nice function found at http://alanwhipple.com/2011/05/25/php-truncate-string-preserving-html-tags-words, apparently taken from CakePHP
The following is a simple state-machine parser which handles you test case successfully. I fails on nested tags though as it doesn't track the tags themselves. I also chokes on entities within HTML tags (e.g. in an href-attribute of an <a>-tag). So it cannot be considered a 100% solution to this problem but because it's easy to understand it could be the basis for a more advanced function.
function substr_html($string, $length)
{
$count = 0;
/*
* $state = 0 - normal text
* $state = 1 - in HTML tag
* $state = 2 - in HTML entity
*/
$state = 0;
for ($i = 0; $i < strlen($string); $i++) {
$char = $string[$i];
if ($char == '<') {
$state = 1;
} else if ($char == '&') {
$state = 2;
$count++;
} else if ($char == ';') {
$state = 0;
} else if ($char == '>') {
$state = 0;
} else if ($state === 0) {
$count++;
}
if ($count === $length) {
return substr($string, 0, $i + 1);
}
}
return $string;
}
you can use tidy as well:
function truncate_html($html, $max_length) {
return tidy_repair_string(substr($html, 0, $max_length),
array('wrap' => 0, 'show-body-only' => TRUE), 'utf8');
}
Could use DomDocument in this case with a nasty regex hack, worst that would happen is a warning, if there's a broken tag :
$dom = new DOMDocument();
$dom->loadHTML(substr("Hello, my <strong>name</strong> is <em>Sam</em>. I´m a web developer.",0,26));
$html = preg_replace("/\<\/?(body|html|p)>/", "", $dom->saveHTML());
echo $html;
Should give output : Hello, my <strong>**name**</strong>.
I've made light changes to Søren Løvborg printTruncated function making it UTF-8 compatible:
/* Truncate HTML, close opened tags
*
* #param int, maxlength of the string
* #param string, html
* #return $html
*/
function html_truncate($maxLength, $html){
mb_internal_encoding("UTF-8");
$printedLength = 0;
$position = 0;
$tags = array();
ob_start();
while ($printedLength < $maxLength && preg_match('{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}', $html, $match, PREG_OFFSET_CAPTURE, $position)){
list($tag, $tagPosition) = $match[0];
// Print text leading up to the tag.
$str = mb_strcut($html, $position, $tagPosition - $position);
if ($printedLength + mb_strlen($str) > $maxLength){
print(mb_strcut($str, 0, $maxLength - $printedLength));
$printedLength = $maxLength;
break;
}
print($str);
$printedLength += mb_strlen($str);
if ($tag[0] == '&'){
// Handle the entity.
print($tag);
$printedLength++;
}
else{
// Handle the tag.
$tagName = $match[1][0];
if ($tag[1] == '/'){
// This is a closing tag.
$openingTag = array_pop($tags);
assert($openingTag == $tagName); // check that tags are properly nested.
print($tag);
}
else if ($tag[mb_strlen($tag) - 2] == '/'){
// Self-closing tag.
print($tag);
}
else{
// Opening tag.
print($tag);
$tags[] = $tagName;
}
}
// Continue after the tag.
$position = $tagPosition + mb_strlen($tag);
}
// Print any remaining text.
if ($printedLength < $maxLength && $position < mb_strlen($html))
print(mb_strcut($html, $position, $maxLength - $printedLength));
// Close any open tags.
while (!empty($tags))
printf('</%s>', array_pop($tags));
$bufferOuput = ob_get_contents();
ob_end_clean();
$html = $bufferOuput;
return $html;
}
Bounce added multi-byte character support to Søren Løvborg's solution - I've added:
support for unpaired HTML tags (e.g. <hr>, <br> <col> etc. don't get closed - in HTML a '/' is not required at the end of these (in is for XHTML though)),
customisable truncation indicator (defaults to &hellips; i.e. … ),
return as a string without using output buffer, and
unit tests with 100% coverage.
All this at Pastie.
Another light changes to Søren Løvborg printTruncated function making it UTF-8 (Needs mbstring) compatible and making it return string not print one. I think it's more useful.
And my code not use buffering like Bounce variant, just one more variable.
UPD: to make it work properly with utf-8 chars in tag attributes you need mb_preg_match function, listed below.
Great thanks to Søren Løvborg for that function, it's very good.
/* Truncate HTML, close opened tags
*
* #param int, maxlength of the string
* #param string, html
* #return $html
*/
function htmlTruncate($maxLength, $html)
{
mb_internal_encoding("UTF-8");
$printedLength = 0;
$position = 0;
$tags = array();
$out = "";
while ($printedLength < $maxLength && mb_preg_match('{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}', $html, $match, PREG_OFFSET_CAPTURE, $position))
{
list($tag, $tagPosition) = $match[0];
// Print text leading up to the tag.
$str = mb_substr($html, $position, $tagPosition - $position);
if ($printedLength + mb_strlen($str) > $maxLength)
{
$out .= mb_substr($str, 0, $maxLength - $printedLength);
$printedLength = $maxLength;
break;
}
$out .= $str;
$printedLength += mb_strlen($str);
if ($tag[0] == '&')
{
// Handle the entity.
$out .= $tag;
$printedLength++;
}
else
{
// Handle the tag.
$tagName = $match[1][0];
if ($tag[1] == '/')
{
// This is a closing tag.
$openingTag = array_pop($tags);
assert($openingTag == $tagName); // check that tags are properly nested.
$out .= $tag;
}
else if ($tag[mb_strlen($tag) - 2] == '/')
{
// Self-closing tag.
$out .= $tag;
}
else
{
// Opening tag.
$out .= $tag;
$tags[] = $tagName;
}
}
// Continue after the tag.
$position = $tagPosition + mb_strlen($tag);
}
// Print any remaining text.
if ($printedLength < $maxLength && $position < mb_strlen($html))
$out .= mb_substr($html, $position, $maxLength - $printedLength);
// Close any open tags.
while (!empty($tags))
$out .= sprintf('</%s>', array_pop($tags));
return $out;
}
function mb_preg_match(
$ps_pattern,
$ps_subject,
&$pa_matches,
$pn_flags = 0,
$pn_offset = 0,
$ps_encoding = NULL
) {
// WARNING! - All this function does is to correct offsets, nothing else:
//(code is independent of PREG_PATTER_ORDER / PREG_SET_ORDER)
if (is_null($ps_encoding)) $ps_encoding = mb_internal_encoding();
$pn_offset = strlen(mb_substr($ps_subject, 0, $pn_offset, $ps_encoding));
$ret = preg_match($ps_pattern, $ps_subject, $pa_matches, $pn_flags, $pn_offset);
if ($ret && ($pn_flags & PREG_OFFSET_CAPTURE))
foreach($pa_matches as &$ha_match) {
$ha_match[1] = mb_strlen(substr($ps_subject, 0, $ha_match[1]), $ps_encoding);
}
return $ret;
}
Use the function truncateHTML() from:
https://github.com/jlgrall/truncateHTML
Example: truncate after 9 characters including the ellipsis:
truncateHTML(9, "<p><b>A</b> red ball.</p>", ['wholeWord' => false]);
// => "<p><b>A</b> red ba…</p>"
Features: UTF-8, configurable ellipsis, include/exclude length of ellipsis, self-closing tags, collapsing spaces, invisible elements (<head>, <script>, <noscript>, <style>, <!-- comments -->), HTML $entities;, truncating at last whole word (with option to still truncate very long words), PHP 5.6 and 7.0+, 240+ unit tests, returns a string (doesn't use the output buffer), and well commented code.
I wrote this function, because I really liked Søren Løvborg's function above (especially how he managed encodings), but I needed a bit more functionality and flexibility.
The CakePHP framework has a HTML-aware truncate() function in the Text Helper that works for me. See Text. MIT license. Link to source (provided by #Quentin).
This is very difficult to do without using a validator and a parser, the reason being that imagine if you have
<div id='x'>
<div id='y'>
<h1>Heading</h1>
500
lines
of
html
...
etc
...
</div>
</div>
How do you plan to truncate that and end up with valid HTML?
After a brief search, I found this link which could help.