Facebook-like "show more" button for a string with URLs - php

I'm trying to have a feature that acts like Facebook's show more behaviour.
I want it to trim the string if:
its length is more than 200 characters.
there are more than 5 /n occurrences.
It sounds simple and I already have an initial function (that does it only by length, I haven't implemented the /n occurrences yet):
function contentShowMore($string, $max_length) {
if(mb_strlen($string, 'utf-8') <= $max_length) {
return $string; // return the original string if haven't passed $max_length
} else {
$teaser = mb_substr($string, 0, $max_length); // trim to max length
$dots = '<span class="show-more-dots"> ...</span>'; // add dots
$show_more_content = mb_substr($string, $max_length); // get the hidden content
$show_more_wrapper = '<span class="show-more-content">'.$show_more_content.'</span>'; // wrap it
return $teaser.$dots.$show_more_wrapper; // connect all together for usage on HTML.
}
}
The problem is that the string might include URLs, so it breaks them. I need to find a way to make a functional show-more button that checks length, newlines and won't cut URLs.
Thank you!
Example:
input: contentShowMore("hello there http://google.com/ good day!", 20).
output:
hello there http://g
<span class="show-more-dots"> ...</span>
<span class="show-more-content">oogle.com/ good day!</span>
the output i want:
hello there http://google.com/
<span class="show-more-dots"> ...</span>
<span class="show-more-content"> good day!</span>

found a solution!
function contentShowMore($string, $max_length, $max_newlines) {
$trim_str = trim($string);
if(mb_strlen($trim_str, 'utf-8') <= $max_length && substr_count($trim_str, "\n") < $max_newlines) { // return the original if short, or less than X newlines
return $trim_str;
} else {
$teaser = mb_substr($trim_str, 0, $max_length); // text to show
$show_more_content = mb_substr($trim_str, $max_length);
// the read more might have cut a string (or worse - an URL) in the middle of it.
// so we will take all the rest of the string before the next whitespace and will add it back to the teaser.
$content_parts = explode(' ', $show_more_content, 2); // [0] - before first space, [1] - after first space
$teaser .= $content_parts[0];
if(isset($content_parts[1])) { // if there are still leftover strings, its on show more! :)
$show_more_content = $content_parts[1];
}
// NOW WERE CHEKING MAX NEWLINES.
$teaser_parts = explode("\n", $teaser); // break to array.
$teaser = implode("\n", array_slice($teaser_parts, 0, $max_newlines)); // take the first $max_newlines lines and use them as teaser.
$show_more_content = implode("\n", array_slice($teaser_parts, $max_newlines)) . ' ' . $show_more_content; // connect the rest to the hidden content.
if(mb_strlen($show_more_content, "UTF-8") === 0) {
return $trim_str; // nothing to hide - return original.
} else {
$show_more_wrapper = '<span class="show-more-content">'.$show_more_content.'</span>';
$dots = '<span class="show-more-dots"> ...</span>'; // dots will be visible between the teaser and the hidden.
$button = ' <span class="show-more">Show more</span>';
return $teaser.$dots.$button.$show_more_wrapper; // connect ingredients
}
}
}

Related

Insert text in content after 300 words but after closing tag of a Paragraph

I am looking for a way to insert an ad or text after X amount of words and after the closing tag of the paragraph the last word appears in.
So far, I have only been able to do this after the X amount of characters. The problem with this approach is that HTML characters are counted which gives inaccurate results.
function chars1($content) {
// only inject google ads if post is longer than 2500 characters
$enable_length1 = 2500;
// insert after the 210th character
$after_character1 = 2100;
if (is_single() && strlen($content) > $enable_length1) {
$before_content1 = substr($content, 0, $after_character1);
$after_content1 = substr($content, $after_character1);
$after_content1 = explode('</p>', $after_content1);
ob_start();
dynamic_sidebar('single-image-ads-1');
$text1 = ob_get_contents();
ob_end_clean();
array_splice($after_content1, 1, 0, $text1);
$after_content1 = implode('', $after_content1);
return $before_content1 . $after_content1;
} else {
return $content;
}
}
//add filter to WordPress with priority 49
add_filter('the_content', 'chars1',49);
Another approach I have tried is using:
strip_tags($content)
and counted the words using:
st_word_count()
The problem with this is that I have no way of returning the $content with the HTML tags
Depending on the size of the post, I will insert up to 5 ad units, with the functions I have above I would need to create a function for each ad. If there is a way to insert all 5 ads using one function that would be great.
Any help is appreciated.
Deciding what is a word or not can oftentimes be very hard. But if you're alright with an approximate solution, like defining a word as text between two whitespaces, I suggest you implement a simple function yourself.
This may be achieved by iterating over the characters of the string until 150 words are counted and then jumping to the end of the current paragraph. Insert an ad and then repeat until you've added sufficiently many.
Implementing this in your function might look like this
function chars1($content) {
// only inject google ads if post is longer than 2500 characters
$enable_length1 = 2500;
// Insert at the end of the paragraph every 300 words
$after_word1 = 300;
// Maximum of 5 ads
$max_ads = 5;
if (strlen($content) > $enable_length1) {
$len = strlen($content);
$i=0;
// Keep adding untill end of content or $max_ads number of ads has ben inserted
while($i<$len && $max_ads-->0) {
// Work our way untill the apropriate length
$word_cout = 0;
$in_tag = false;
while(++$i < $len && $word_cout < $after_word1) {
if(!$in_tag && ctype_space($content[$i])) {
// Whitespace
$word_cout++;
}
else if(!$in_tag && $content[$i] == '<') {
// Begin tag
$in_tag = true;
$word_cout++;
}
else if($in_tag && $content[$i] == '>') {
// End tag
$in_tag = false;
}
}
// Find the next '</p>'
$i = strpos($content, "</p>", $i);
if($i === false) {
// No more paragraph endings
break;
}
else {
// Add the length of </p>
$i += 4;
// Get ad as string
ob_start();
dynamic_sidebar('single-image-ads-1');
$ad = ob_get_contents();
ob_end_clean();
$content = substr($content, 0, $i) . $ad . substr($content, $i);
// Set the correct i
$i+= strlen($ad);
}
}
}
return $content;
}
With this approach, it's easy to add new rules.
I've just had to do this myself. This is how I did it. First explode the content on </p> tags. Loop over the resulting array, put the end </p> back onto the paragraph, do a count on the paragraph with the tags stripped and add it to the global count. Compare the global word count against our word positions. If it's greater, append the content and unset that word position. Stringify and return.
function insert_after_words( $content, $words_positions = array(), $content_to_insert = 'Insert Me' ) {
$total_words_count = 0;
// Explode content on paragraphs.
$content_exploded = explode( '</p>', $content );
foreach ( $content_exploded as $key => $content ) {
// Put the paragraph tags back.
$content_exploded[ $key ] .= '</p>';
$total_words_count += str_word_count( strip_tags( $content_exploded[ $key ] ) );
// Check the total word count against the word positoning.
foreach ( $words_positions as $words_key => $words_count ) {
if ( $total_words_count >= $words_count ) {
$content_exploded[ $key ] .= PHP_EOL . $content_to_insert;
unset( $words_positions[ $words_key ] );
}
}
}
// Stringify content.
return implode( '', $content_exploded );
}

How to split content after a space

I'm just a newbie in PHP ,
now I have a trouble with the split content function,
I have a string look like this :
$tring = 'World Cup 2014 draw: England's chances of landing tough group rise';
my splitcontent code :
if(strlen($string)<$width){
return $string;
}
$string = substr($string,0,$width);
$string = $string.'...';
return $string;
the result is :
World Cup 2014 draw: England's chances of lan.... ,
when i insert
$string = substr($string,0,strrpos($string,' '));
it look like this :
World Cup 2014 draw: England's chances of... ,
Now I want to my string look like this :
World Cup 2014 draw: England's chances of landing...
What will I do ? Thanks for any helping
This will split the content at $width chars but if it lands on a word, it will match to the end of the word (so it will really be $width+[num chars till end of word if in middle of word]. A "word" being defined as letters, numbers, underscore, hyphen or single quote. This will account for stuff like "England" or "England's" or "pre-text" as a whole word.
$width = 20;
$text = "World Cup 2014 draw: England's chances of landi-ng tough group rise";
preg_match("~^.{".($width-1)."}([\w'-]+)?~i",$text,$m);
$newtext = $m[0].'...';
echo $newtext.'<br/>';
May be this function will help you. It will help you to truncate content or string without letting words to cut in the middle. It also considers HTML tags.
function truncate_content( $text, $length = 100, $ending = '...', $exact = false, $considerHtml = true)
{
if ($considerHtml)
{
// if the plain text is shorter than the maximum length, return the whole text
if (strlen(preg_replace('/<.*?>/', '', $text)) <= $length)
{
return $text;
}
// splits all html-tags to scanable lines
preg_match_all('/(<.+?>)?([^<>]*)/s', $text, $lines, PREG_SET_ORDER);
$total_length = strlen($ending);
$open_tags = array();
$truncate = '';
foreach ($lines as $line_matchings)
{
// if there is any html-tag in this line, handle it and add it (uncounted) to the output
if (!empty($line_matchings[1]))
{
// if it's an "empty element" with or without xhtml-conform closing slash
if (preg_match('/^<(\s*.+?\/\s*|\s*(img|br|input|hr|area|base|basefont|col|frame|isindex|link|meta|param)(\s.+?)?)>$/is', $line_matchings[1]))
{
// do nothing
// if tag is a closing tag
}
else if (preg_match('/^<\s*\/([^\s]+?)\s*>$/s', $line_matchings[1], $tag_matchings))
{
// delete tag from $open_tags list
$pos = array_search($tag_matchings[1], $open_tags);
if ($pos !== false)
{
unset($open_tags[$pos]);
}
// if tag is an opening tag
}
else if (preg_match('/^<\s*([^\s>!]+).*?>$/s', $line_matchings[1], $tag_matchings))
{
// add tag to the beginning of $open_tags list
array_unshift($open_tags, strtolower($tag_matchings[1]));
}
// add html-tag to $truncate'd text
$truncate .= $line_matchings[1];
}
// calculate the length of the plain text part of the line; handle entities as one character
$content_length = strlen(preg_replace('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|[0-9a-f]{1,6};/i', ' ', $line_matchings[2]));
if ($total_length+$content_length> $length)
{
// the number of characters which are left
$left = $length - $total_length;
$entities_length = 0;
// search for html entities
if (preg_match_all('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|[0-9a-f]{1,6};/i', $line_matchings[2], $entities, PREG_OFFSET_CAPTURE)) {
// calculate the real length of all entities in the legal range
foreach ($entities[0] as $entity)
{
if ($entity[1]+1-$entities_length <= $left)
{
$left--;
$entities_length += strlen($entity[0]);
}
else
{
// no more characters left
break;
}
}
}
$truncate .= substr($line_matchings[2], 0, $left+$entities_length);
// maximum lenght is reached, so get off the loop
break;
}
else
{
$truncate .= $line_matchings[2];
$total_length += $content_length;
}
// if the maximum length is reached, get off the loop
if($total_length>= $length) {
break;
}
}
}
else
{
if (strlen($text) <= $length)
{
return $text;
}
else
{
$truncate = substr($text, 0, $length - strlen($ending));
}
}
// if the words shouldn't be cut in the middle...
if (!$exact)
{
// ...search the last occurance of a space...
$spacepos = strrpos($truncate, ' ');
if (isset($spacepos))
{
// ...and cut the text in this position
$truncate = substr($truncate, 0, $spacepos);
}
}
// add the defined ending to the text
$truncate .= $ending;
if($considerHtml)
{
// close all unclosed html-tags
foreach ($open_tags as $tag)
{
$truncate .= '</' . $tag . '>';
}
}
return $truncate;
}
In this
$text is the content or string.
$length is the length you want to cut the string with
$ending is the ending sequence to be added if the content is cut.
$exact specifies whether you need to cut the string exactly without taking in to account the word. If you specifeis $exact to be true, then it will not consider the word.
$considerHtml parameter ask you whether we need to consider html preventing breaking html tags when splitting content.
So in your case you can just use :
$string = "World Cup 2014 draw: England's chances of landing tough group rise";
echo truncate_content($string, $width);
Hope this helps you

Excerpt isn't working with file contents

$excerpt= excerpt(file_get_contents("data/file.txt"), 30);
echo $excerpt;
function excerpt($str, $chars){
$index = strripos($str, ' ');
return substr($str, 0, $index)."...";
}
It don't return the text stripped at the 30 characters or less. It returns the whole text without the last word and the dots added but if you use a string typed manually it works perfect.
Why this isn't working if content is loaded from a text file? I think that the /n's are broking the strripos.
You want to use stripos, not strripos.
<?php
$excerpt= excerpt(file_get_contents("data/file.txt"), 30);
echo $excerpt;
function excerpt($str, $chars){
$index = stripos($str, " ", $chars);
return substr($str, 0, $index)."...";
}
?>
The problem is not related to the stripos usage. As I can see you're trying to trim the string at 30 characters without cutting words in half. In order to do that you need to correct your excerpt function:
function excerpt($str, $chars) {
//no need to trim, already shorter than wanted dimension
if (strlen($tr) <= $chars) {
return $str;
}
//find last space within wanted dimension
$last_space = strrpos(substr($str, 0, $chars), ' ');
$trimmed_text = substr($str, 0, $last_space);
return $trimmed_text . '...';
}
and yes, your function doesn't even use the $chars param...
I gues you want an excerpt with as many whole words as possible.
Some tips:
If you only just want the first 3o chars you should not read the whole file!
What you should do: read only to the maximum excerpt lenght and then format it.
function readExcerpt($path){
$fhand = fopen($path,"r");
$excerpt = fread($fhand ,30);
fclose($fhand);
return $excerpt;
}
function fromatExcerpt($excerpt){
//remove last word/word fragment
$index = strripos($excerpt,' ');
if($index!==false){
$excerpt= substr($excerpt,0,$index);
}
return $excerpt.'...';
}
echo fromatExcerpt(readExcerpt("D:\hotfix.txt"));

How can I shorten a very long string in PHP

I have a problem with a PHP breadcrumb function I am using, when the page name is very long, it overflows out of the box, which then looks really ugly.
My question is, how can I achieve this: "This is a very long string" to "This is..." with PHP?
Any other ideas on how I could handle this problem would also be appreciated, thanx in advance!
Here is the breadcrumb function:
function breadcrumbs() {
// Breadcrumb navigation
if (is_page() && !is_front_page() || is_single() || is_category()) {
echo '<ul class="breadcrumbs">';
echo '<li class="front_page">'.get_bloginfo('name').' <span style="color: #FFF;">»</span> </li>';
if (is_page()) {
$ancestors = get_post_ancestors($post);
if ($ancestors) {
$ancestors = array_reverse($ancestors);
foreach ($ancestors as $crumb) {
echo '<li>'.get_the_title($crumb).' <span style="color: #FFF;">»</span> </li>';
}
}
}
if (is_single()) {
$category = get_the_category();
echo '<li>'.$category[0]->cat_name.'</li>';
}
if (is_category()) {
$category = get_the_category();
echo '<li>'.$category[0]->cat_name.'</li>';
}
// Current page
if (is_page() || is_single()) {
echo '<li class="current">'.get_the_title().'</li>';
}
echo '</ul>';
} elseif (is_front_page()) {
// Front page
echo '<ul class="breadcrumbs">';
echo '<li class="front_page">'.get_bloginfo('name').'</li>';
echo '<li class="current">Home Page</li>';
echo '</ul>';
}
}
If you want a more nice (word limited) trucation you can use explode to split the string by spaces and then append each word (array entry) until you've reached your max limit
Something like:
define("MAX_LEN", 15);
$sentance = "Hello this is a long sentance";
$words = explode(' ', $sentance);
$newStr = "";
foreach($words as $word) {
if(strlen($newStr." ".$word) >= MAX_LEN) {
break;
}
$newStr = $newStr." ".$word;
}
If you are working with UTF-8 as charset, I suggest using the mb_strimwidth method as it is multibyte safe and won´t mess up multibyte chars. It also appends a placeholder string like ... automatically, with substr you´d have to do that in an additional step.
Usage sample:
echo mb_strimwidth("Hello World", 0, 10, "...", "UTF-8"); // .. or some other charset
// outputs Hello W...
You can safely use substr.
and eventually wordwrap() to break long words
$string = "This is a very long string";
$newString = substr( $string, 0, 7)."...";
// Output = This is...
Ideally, it should be done on the client side. You can use CSS/JS for the same.
Set this CSS property: text-overflow: ellipsis.
However, it will work only in IE. To use the same in Firefox as well, you can do something like this.
If you do not mind javascript plugins, use one of the jQuery ellipsis plugin.
Edit: These methods will work even when dealing with unicode, which can be a bit tricky if you try to handle this using php. (Like substr function)
Edit 2: If your problem is just the overflowing text and you do not mind not having the "..." at the end then it is even more simple. Simply, use the CSS: text-overflow: hidden;.
You can truncate the string at max length and then search for the last space:
Multibyte safe (Requires PHP > = 4.2)
function mb_TruncateString($string, $length = 40, $marker = "...")
{
if (mb_strlen($string) <= $length)
return $string;
// Trim at given length
$string = mb_substr($string, 0, $length);
// Get the text before the last space
if(mb_ereg("(.*)\s", $string, $matches))
$string = $matches[1];
return $string . $marker;
}
Following is not multibyte safe
function TruncateString($string, $length = 40, $marker = "...")
{
if (strlen($string) <= $length)
return $string;
// Trim at given length
$string = substr($string, 0, $length);
// Get the text before the last space
if(preg_match("/(.*)\s/i", $string, $matches))
$string = $matches[1];
return $string . $marker;
}
You're after a truncate function. This is what I use:
/**
* #param string $str
* #param int $length
* #return string
*/
function truncate($str, $length=100)
{
$str = substr($str, $length);
$words = explode(' ', $str); // separate words into an array
array_pop($words); // discard last item, as 9/10 times it's a partial word
$str = implode(' ', $words); // re-glue the string
return $str;
}
And usage:
echo truncate('This is a very long page name that will eventually be truncated', 15);

Truncate text containing HTML, ignoring tags

I want to truncate some text (loaded from a database or text file), but it contains HTML so as a result the tags are included and less text will be returned. This can then result in tags not being closed, or being partially closed (so Tidy may not work properly and there is still less content). How can I truncate based on the text (and probably stopping when you get to a table as that could cause more complex issues).
substr("Hello, my <strong>name</strong> is <em>Sam</em>. I´m a web developer.",0,26)."..."
Would result in:
Hello, my <strong>name</st...
What I would want is:
Hello, my <strong>name</strong> is <em>Sam</em>. I´m...
How can I do this?
While my question is for how to do it in PHP, it would be good to know how to do it in C#... either should be OK as I think I would be able to port the method over (unless it is a built in method).
Also note that I have included an HTML entity ´ - which would have to be considered as a single character (rather than 7 characters as in this example).
strip_tags is a fallback, but I would lose formatting and links and it would still have the problem with HTML entities.
Assuming you are using valid XHTML, it's simple to parse the HTML and make sure tags are handled properly. You simply need to track which tags have been opened so far, and make sure to close them again "on your way out".
<?php
header('Content-type: text/plain; charset=utf-8');
function printTruncated($maxLength, $html, $isUtf8=true)
{
$printedLength = 0;
$position = 0;
$tags = array();
// For UTF-8, we need to count multibyte sequences as one character.
$re = $isUtf8
? '{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;|[\x80-\xFF][\x80-\xBF]*}'
: '{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}';
while ($printedLength < $maxLength && preg_match($re, $html, $match, PREG_OFFSET_CAPTURE, $position))
{
list($tag, $tagPosition) = $match[0];
// Print text leading up to the tag.
$str = substr($html, $position, $tagPosition - $position);
if ($printedLength + strlen($str) > $maxLength)
{
print(substr($str, 0, $maxLength - $printedLength));
$printedLength = $maxLength;
break;
}
print($str);
$printedLength += strlen($str);
if ($printedLength >= $maxLength) break;
if ($tag[0] == '&' || ord($tag) >= 0x80)
{
// Pass the entity or UTF-8 multibyte sequence through unchanged.
print($tag);
$printedLength++;
}
else
{
// Handle the tag.
$tagName = $match[1][0];
if ($tag[1] == '/')
{
// This is a closing tag.
$openingTag = array_pop($tags);
assert($openingTag == $tagName); // check that tags are properly nested.
print($tag);
}
else if ($tag[strlen($tag) - 2] == '/')
{
// Self-closing tag.
print($tag);
}
else
{
// Opening tag.
print($tag);
$tags[] = $tagName;
}
}
// Continue after the tag.
$position = $tagPosition + strlen($tag);
}
// Print any remaining text.
if ($printedLength < $maxLength && $position < strlen($html))
print(substr($html, $position, $maxLength - $printedLength));
// Close any open tags.
while (!empty($tags))
printf('</%s>', array_pop($tags));
}
printTruncated(10, '<b><Hello></b> <img src="world.png" alt="" /> world!'); print("\n");
printTruncated(10, '<table><tr><td>Heck, </td><td>throw</td></tr><tr><td>in a</td><td>table</td></tr></table>'); print("\n");
printTruncated(10, "<em><b>Hello</b>w\xC3\xB8rld!</em>"); print("\n");
Encoding note: The above code assumes the XHTML is UTF-8 encoded. ASCII-compatible single-byte encodings (such as Latin-1) are also supported, just pass false as the third argument. Other multibyte encodings are not supported, though you may hack in support by using mb_convert_encoding to convert to UTF-8 before calling the function, then converting back again in every print statement.
(You should always be using UTF-8, though.)
Edit: Updated to handle character entities and UTF-8. Fixed bug where the function would print one character too many, if that character was a character entity.
I've written a function that truncates HTML just as yous suggest, but instead of printing it out it puts it just keeps it all in a string variable. handles HTML Entities, as well.
/**
* function to truncate and then clean up end of the HTML,
* truncates by counting characters outside of HTML tags
*
* #author alex lockwood, alex dot lockwood at websightdesign
*
* #param string $str the string to truncate
* #param int $len the number of characters
* #param string $end the end string for truncation
* #return string $truncated_html
*
* **/
public static function truncateHTML($str, $len, $end = '…'){
//find all tags
$tagPattern = '/(<\/?)([\w]*)(\s*[^>]*)>?|&[\w#]+;/i'; //match html tags and entities
preg_match_all($tagPattern, $str, $matches, PREG_OFFSET_CAPTURE | PREG_SET_ORDER );
//WSDDebug::dump($matches); exit;
$i =0;
//loop through each found tag that is within the $len, add those characters to the len,
//also track open and closed tags
// $matches[$i][0] = the whole tag string --the only applicable field for html enitities
// IF its not matching an &htmlentity; the following apply
// $matches[$i][1] = the start of the tag either '<' or '</'
// $matches[$i][2] = the tag name
// $matches[$i][3] = the end of the tag
//$matces[$i][$j][0] = the string
//$matces[$i][$j][1] = the str offest
while($matches[$i][0][1] < $len && !empty($matches[$i])){
$len = $len + strlen($matches[$i][0][0]);
if(substr($matches[$i][0][0],0,1) == '&' )
$len = $len-1;
//if $matches[$i][2] is undefined then its an html entity, want to ignore those for tag counting
//ignore empty/singleton tags for tag counting
if(!empty($matches[$i][2][0]) && !in_array($matches[$i][2][0],array('br','img','hr', 'input', 'param', 'link'))){
//double check
if(substr($matches[$i][3][0],-1) !='/' && substr($matches[$i][1][0],-1) !='/')
$openTags[] = $matches[$i][2][0];
elseif(end($openTags) == $matches[$i][2][0]){
array_pop($openTags);
}else{
$warnings[] = "html has some tags mismatched in it: $str";
}
}
$i++;
}
$closeTags = '';
if (!empty($openTags)){
$openTags = array_reverse($openTags);
foreach ($openTags as $t){
$closeTagString .="</".$t . ">";
}
}
if(strlen($str)>$len){
// Finds the last space from the string new length
$lastWord = strpos($str, ' ', $len);
if ($lastWord) {
//truncate with new len last word
$str = substr($str, 0, $lastWord);
//finds last character
$last_character = (substr($str, -1, 1));
//add the end text
$truncated_html = ($last_character == '.' ? $str : ($last_character == ',' ? substr($str, 0, -1) : $str) . $end);
}
//restore any open tags
$truncated_html .= $closeTagString;
}else
$truncated_html = $str;
return $truncated_html;
}
100% accurate, but pretty difficult approach:
Iterate charactes using DOM
Use DOM methods to remove remaining elements
Serialize the DOM
Easy brute-force approach:
Split string into tags (not elements) and text fragments using preg_split('/(<tag>)/') with PREG_DELIM_CAPTURE.
Measure text length you want (it'll be every second element from split, you might use html_entity_decode() to help measure accurately)
Cut the string (trim &[^\s;]+$ at the end to get rid of possibly chopped entity)
Fix it with HTML Tidy
I used a nice function found at http://alanwhipple.com/2011/05/25/php-truncate-string-preserving-html-tags-words, apparently taken from CakePHP
The following is a simple state-machine parser which handles you test case successfully. I fails on nested tags though as it doesn't track the tags themselves. I also chokes on entities within HTML tags (e.g. in an href-attribute of an <a>-tag). So it cannot be considered a 100% solution to this problem but because it's easy to understand it could be the basis for a more advanced function.
function substr_html($string, $length)
{
$count = 0;
/*
* $state = 0 - normal text
* $state = 1 - in HTML tag
* $state = 2 - in HTML entity
*/
$state = 0;
for ($i = 0; $i < strlen($string); $i++) {
$char = $string[$i];
if ($char == '<') {
$state = 1;
} else if ($char == '&') {
$state = 2;
$count++;
} else if ($char == ';') {
$state = 0;
} else if ($char == '>') {
$state = 0;
} else if ($state === 0) {
$count++;
}
if ($count === $length) {
return substr($string, 0, $i + 1);
}
}
return $string;
}
you can use tidy as well:
function truncate_html($html, $max_length) {
return tidy_repair_string(substr($html, 0, $max_length),
array('wrap' => 0, 'show-body-only' => TRUE), 'utf8');
}
Could use DomDocument in this case with a nasty regex hack, worst that would happen is a warning, if there's a broken tag :
$dom = new DOMDocument();
$dom->loadHTML(substr("Hello, my <strong>name</strong> is <em>Sam</em>. I´m a web developer.",0,26));
$html = preg_replace("/\<\/?(body|html|p)>/", "", $dom->saveHTML());
echo $html;
Should give output : Hello, my <strong>**name**</strong>.
I've made light changes to Søren Løvborg printTruncated function making it UTF-8 compatible:
/* Truncate HTML, close opened tags
*
* #param int, maxlength of the string
* #param string, html
* #return $html
*/
function html_truncate($maxLength, $html){
mb_internal_encoding("UTF-8");
$printedLength = 0;
$position = 0;
$tags = array();
ob_start();
while ($printedLength < $maxLength && preg_match('{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}', $html, $match, PREG_OFFSET_CAPTURE, $position)){
list($tag, $tagPosition) = $match[0];
// Print text leading up to the tag.
$str = mb_strcut($html, $position, $tagPosition - $position);
if ($printedLength + mb_strlen($str) > $maxLength){
print(mb_strcut($str, 0, $maxLength - $printedLength));
$printedLength = $maxLength;
break;
}
print($str);
$printedLength += mb_strlen($str);
if ($tag[0] == '&'){
// Handle the entity.
print($tag);
$printedLength++;
}
else{
// Handle the tag.
$tagName = $match[1][0];
if ($tag[1] == '/'){
// This is a closing tag.
$openingTag = array_pop($tags);
assert($openingTag == $tagName); // check that tags are properly nested.
print($tag);
}
else if ($tag[mb_strlen($tag) - 2] == '/'){
// Self-closing tag.
print($tag);
}
else{
// Opening tag.
print($tag);
$tags[] = $tagName;
}
}
// Continue after the tag.
$position = $tagPosition + mb_strlen($tag);
}
// Print any remaining text.
if ($printedLength < $maxLength && $position < mb_strlen($html))
print(mb_strcut($html, $position, $maxLength - $printedLength));
// Close any open tags.
while (!empty($tags))
printf('</%s>', array_pop($tags));
$bufferOuput = ob_get_contents();
ob_end_clean();
$html = $bufferOuput;
return $html;
}
Bounce added multi-byte character support to Søren Løvborg's solution - I've added:
support for unpaired HTML tags (e.g. <hr>, <br> <col> etc. don't get closed - in HTML a '/' is not required at the end of these (in is for XHTML though)),
customisable truncation indicator (defaults to &hellips; i.e. … ),
return as a string without using output buffer, and
unit tests with 100% coverage.
All this at Pastie.
Another light changes to Søren Løvborg printTruncated function making it UTF-8 (Needs mbstring) compatible and making it return string not print one. I think it's more useful.
And my code not use buffering like Bounce variant, just one more variable.
UPD: to make it work properly with utf-8 chars in tag attributes you need mb_preg_match function, listed below.
Great thanks to Søren Løvborg for that function, it's very good.
/* Truncate HTML, close opened tags
*
* #param int, maxlength of the string
* #param string, html
* #return $html
*/
function htmlTruncate($maxLength, $html)
{
mb_internal_encoding("UTF-8");
$printedLength = 0;
$position = 0;
$tags = array();
$out = "";
while ($printedLength < $maxLength && mb_preg_match('{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}', $html, $match, PREG_OFFSET_CAPTURE, $position))
{
list($tag, $tagPosition) = $match[0];
// Print text leading up to the tag.
$str = mb_substr($html, $position, $tagPosition - $position);
if ($printedLength + mb_strlen($str) > $maxLength)
{
$out .= mb_substr($str, 0, $maxLength - $printedLength);
$printedLength = $maxLength;
break;
}
$out .= $str;
$printedLength += mb_strlen($str);
if ($tag[0] == '&')
{
// Handle the entity.
$out .= $tag;
$printedLength++;
}
else
{
// Handle the tag.
$tagName = $match[1][0];
if ($tag[1] == '/')
{
// This is a closing tag.
$openingTag = array_pop($tags);
assert($openingTag == $tagName); // check that tags are properly nested.
$out .= $tag;
}
else if ($tag[mb_strlen($tag) - 2] == '/')
{
// Self-closing tag.
$out .= $tag;
}
else
{
// Opening tag.
$out .= $tag;
$tags[] = $tagName;
}
}
// Continue after the tag.
$position = $tagPosition + mb_strlen($tag);
}
// Print any remaining text.
if ($printedLength < $maxLength && $position < mb_strlen($html))
$out .= mb_substr($html, $position, $maxLength - $printedLength);
// Close any open tags.
while (!empty($tags))
$out .= sprintf('</%s>', array_pop($tags));
return $out;
}
function mb_preg_match(
$ps_pattern,
$ps_subject,
&$pa_matches,
$pn_flags = 0,
$pn_offset = 0,
$ps_encoding = NULL
) {
// WARNING! - All this function does is to correct offsets, nothing else:
//(code is independent of PREG_PATTER_ORDER / PREG_SET_ORDER)
if (is_null($ps_encoding)) $ps_encoding = mb_internal_encoding();
$pn_offset = strlen(mb_substr($ps_subject, 0, $pn_offset, $ps_encoding));
$ret = preg_match($ps_pattern, $ps_subject, $pa_matches, $pn_flags, $pn_offset);
if ($ret && ($pn_flags & PREG_OFFSET_CAPTURE))
foreach($pa_matches as &$ha_match) {
$ha_match[1] = mb_strlen(substr($ps_subject, 0, $ha_match[1]), $ps_encoding);
}
return $ret;
}
Use the function truncateHTML() from:
https://github.com/jlgrall/truncateHTML
Example: truncate after 9 characters including the ellipsis:
truncateHTML(9, "<p><b>A</b> red ball.</p>", ['wholeWord' => false]);
// => "<p><b>A</b> red ba…</p>"
Features: UTF-8, configurable ellipsis, include/exclude length of ellipsis, self-closing tags, collapsing spaces, invisible elements (<head>, <script>, <noscript>, <style>, <!-- comments -->), HTML $entities;, truncating at last whole word (with option to still truncate very long words), PHP 5.6 and 7.0+, 240+ unit tests, returns a string (doesn't use the output buffer), and well commented code.
I wrote this function, because I really liked Søren Løvborg's function above (especially how he managed encodings), but I needed a bit more functionality and flexibility.
The CakePHP framework has a HTML-aware truncate() function in the Text Helper that works for me. See Text. MIT license. Link to source (provided by #Quentin).
This is very difficult to do without using a validator and a parser, the reason being that imagine if you have
<div id='x'>
<div id='y'>
<h1>Heading</h1>
500
lines
of
html
...
etc
...
</div>
</div>
How do you plan to truncate that and end up with valid HTML?
After a brief search, I found this link which could help.

Categories