How to truncate string to first n words in PHP

How to truncate string to first n words in PHP - php

I would like to truncate a very long string, formatted via html elements.
I need the first 500 words (somehow I have to avoid html tags <p>, <br> while my function truncating the string), but in the result I have to keep/use those html elements because the result also should be formatted by html tags like the "original whole" text.
What's the best way to truncate my string?
Example:
Original text
> <p>The Huffington Post (via <a
> href="/t/daily-mail">Daily Mail</a>) is reporting that <a
> href="/t/misty">Misty</a> has been returned to a high kill shelter for
> farting too much! She appeared on Greenville County Pet Rescue’s
> “urgent” list, which means if she doesn’t get readopted, she will be
> euthanized!</p>
I need the first n words (n=10)
> <p>The Huffington Post (via <a
> href="/t/daily-mail">Daily Mail</a>) is reporting that.. </p>

A brute force method would be to just split all elements on blanks, then iterate over them. You count only non-tag elements up to a maximum, while you output tags nonetheless. Something along these lines:
$string = "your string here";
$output = "";
$count = 0;
$max = 10;
$tokens = preg_split('/ /', $string);
foreach ($tokens as $token)
{
if (preg_match('/<.*?>/', $token)) {
$output .= "$token ";
} else if ($count < $max) {
$output .= "$token ";
$count += 1;
}
}
print $output;

You could have found something like this with some Googling.
// Original PHP code by Chirp Internet: www.chirp.com.au
// Please acknowledge use of this code by including this header.
function restoreTags($input)
{
$opened = array();
// loop through opened and closed tags in order
if(preg_match_all("/<(\/?[a-z]+)>?/i", $input, $matches)) {
foreach($matches[1] as $tag) {
if(preg_match("/^[a-z]+$/i", $tag, $regs)) {
// a tag has been opened
if(strtolower($regs[0]) != 'br') $opened[] = $regs[0];
} elseif(preg_match("/^\/([a-z]+)$/i", $tag, $regs)) {
// a tag has been closed
unset($opened[array_pop(array_keys($opened, $regs[1]))]);
}
}
}
// close tags that are still open
if($opened) {
$tagstoclose = array_reverse($opened);
foreach($tagstoclose as $tag) $input .= "</$tag>";
}
return $input;
}
When you combine it with another function mentioned in the article:
function truncateWords($input, $numwords, $padding="")
{
$output = strtok($input, " \n");
while(--$numwords > 0) $output .= " " . strtok(" \n");
if($output != $input) $output .= $padding;
return $output;
}
Then you can just achieve what you're looking for by doing this:
$originalText = '...'; // some original text in HTML format
$output = truncateWords($originalText, 500); // This truncates to 500 words (ish...)
$output = restoreTags($output); // This fixes any open tags

Related

Facebook-like "show more" button for a string with URLs

I'm trying to have a feature that acts like Facebook's show more behaviour.
I want it to trim the string if:
its length is more than 200 characters.
there are more than 5 /n occurrences.
It sounds simple and I already have an initial function (that does it only by length, I haven't implemented the /n occurrences yet):
function contentShowMore($string, $max_length) {
if(mb_strlen($string, 'utf-8') <= $max_length) {
return $string; // return the original string if haven't passed $max_length
} else {
$teaser = mb_substr($string, 0, $max_length); // trim to max length
$dots = '<span class="show-more-dots"> ...</span>'; // add dots
$show_more_content = mb_substr($string, $max_length); // get the hidden content
$show_more_wrapper = '<span class="show-more-content">'.$show_more_content.'</span>'; // wrap it
return $teaser.$dots.$show_more_wrapper; // connect all together for usage on HTML.
}
}
The problem is that the string might include URLs, so it breaks them. I need to find a way to make a functional show-more button that checks length, newlines and won't cut URLs.
Thank you!
Example:
input: contentShowMore("hello there http://google.com/ good day!", 20).
output:
hello there http://g
<span class="show-more-dots"> ...</span>
<span class="show-more-content">oogle.com/ good day!</span>
the output i want:
hello there http://google.com/
<span class="show-more-dots"> ...</span>
<span class="show-more-content"> good day!</span>

found a solution!
function contentShowMore($string, $max_length, $max_newlines) {
$trim_str = trim($string);
if(mb_strlen($trim_str, 'utf-8') <= $max_length && substr_count($trim_str, "\n") < $max_newlines) { // return the original if short, or less than X newlines
return $trim_str;
} else {
$teaser = mb_substr($trim_str, 0, $max_length); // text to show
$show_more_content = mb_substr($trim_str, $max_length);
// the read more might have cut a string (or worse - an URL) in the middle of it.
// so we will take all the rest of the string before the next whitespace and will add it back to the teaser.
$content_parts = explode(' ', $show_more_content, 2); // [0] - before first space, [1] - after first space
$teaser .= $content_parts[0];
if(isset($content_parts[1])) { // if there are still leftover strings, its on show more! :)
$show_more_content = $content_parts[1];
}
// NOW WERE CHEKING MAX NEWLINES.
$teaser_parts = explode("\n", $teaser); // break to array.
$teaser = implode("\n", array_slice($teaser_parts, 0, $max_newlines)); // take the first $max_newlines lines and use them as teaser.
$show_more_content = implode("\n", array_slice($teaser_parts, $max_newlines)) . ' ' . $show_more_content; // connect the rest to the hidden content.
if(mb_strlen($show_more_content, "UTF-8") === 0) {
return $trim_str; // nothing to hide - return original.
} else {
$show_more_wrapper = '<span class="show-more-content">'.$show_more_content.'</span>';
$dots = '<span class="show-more-dots"> ...</span>'; // dots will be visible between the teaser and the hidden.
$button = ' <span class="show-more">Show more</span>';
return $teaser.$dots.$button.$show_more_wrapper; // connect ingredients
}
}
}

Insert text in content after 300 words but after closing tag of a Paragraph

I am looking for a way to insert an ad or text after X amount of words and after the closing tag of the paragraph the last word appears in.
So far, I have only been able to do this after the X amount of characters. The problem with this approach is that HTML characters are counted which gives inaccurate results.
function chars1($content) {
// only inject google ads if post is longer than 2500 characters
$enable_length1 = 2500;
// insert after the 210th character
$after_character1 = 2100;
if (is_single() && strlen($content) > $enable_length1) {
$before_content1 = substr($content, 0, $after_character1);
$after_content1 = substr($content, $after_character1);
$after_content1 = explode('</p>', $after_content1);
ob_start();
dynamic_sidebar('single-image-ads-1');
$text1 = ob_get_contents();
ob_end_clean();
array_splice($after_content1, 1, 0, $text1);
$after_content1 = implode('', $after_content1);
return $before_content1 . $after_content1;
} else {
return $content;
}
}
//add filter to WordPress with priority 49
add_filter('the_content', 'chars1',49);
Another approach I have tried is using:
strip_tags($content)
and counted the words using:
st_word_count()
The problem with this is that I have no way of returning the $content with the HTML tags
Depending on the size of the post, I will insert up to 5 ad units, with the functions I have above I would need to create a function for each ad. If there is a way to insert all 5 ads using one function that would be great.
Any help is appreciated.

Deciding what is a word or not can oftentimes be very hard. But if you're alright with an approximate solution, like defining a word as text between two whitespaces, I suggest you implement a simple function yourself.
This may be achieved by iterating over the characters of the string until 150 words are counted and then jumping to the end of the current paragraph. Insert an ad and then repeat until you've added sufficiently many.
Implementing this in your function might look like this
function chars1($content) {
// only inject google ads if post is longer than 2500 characters
$enable_length1 = 2500;
// Insert at the end of the paragraph every 300 words
$after_word1 = 300;
// Maximum of 5 ads
$max_ads = 5;
if (strlen($content) > $enable_length1) {
$len = strlen($content);
$i=0;
// Keep adding untill end of content or $max_ads number of ads has ben inserted
while($i<$len && $max_ads-->0) {
// Work our way untill the apropriate length
$word_cout = 0;
$in_tag = false;
while(++$i < $len && $word_cout < $after_word1) {
if(!$in_tag && ctype_space($content[$i])) {
// Whitespace
$word_cout++;
}
else if(!$in_tag && $content[$i] == '<') {
// Begin tag
$in_tag = true;
$word_cout++;
}
else if($in_tag && $content[$i] == '>') {
// End tag
$in_tag = false;
}
}
// Find the next '</p>'
$i = strpos($content, "</p>", $i);
if($i === false) {
// No more paragraph endings
break;
}
else {
// Add the length of </p>
$i += 4;
// Get ad as string
ob_start();
dynamic_sidebar('single-image-ads-1');
$ad = ob_get_contents();
ob_end_clean();
$content = substr($content, 0, $i) . $ad . substr($content, $i);
// Set the correct i
$i+= strlen($ad);
}
}
}
return $content;
}
With this approach, it's easy to add new rules.

I've just had to do this myself. This is how I did it. First explode the content on </p> tags. Loop over the resulting array, put the end </p> back onto the paragraph, do a count on the paragraph with the tags stripped and add it to the global count. Compare the global word count against our word positions. If it's greater, append the content and unset that word position. Stringify and return.
function insert_after_words( $content, $words_positions = array(), $content_to_insert = 'Insert Me' ) {
$total_words_count = 0;
// Explode content on paragraphs.
$content_exploded = explode( '</p>', $content );
foreach ( $content_exploded as $key => $content ) {
// Put the paragraph tags back.
$content_exploded[ $key ] .= '</p>';
$total_words_count += str_word_count( strip_tags( $content_exploded[ $key ] ) );
// Check the total word count against the word positoning.
foreach ( $words_positions as $words_key => $words_count ) {
if ( $total_words_count >= $words_count ) {
$content_exploded[ $key ] .= PHP_EOL . $content_to_insert;
unset( $words_positions[ $words_key ] );
}
}
}
// Stringify content.
return implode( '', $content_exploded );
}

Place content in between paragraphs without images

I am using the following code to place some ad code inside my content .
<?php
$content = apply_filters('the_content', $post->post_content);
$content = explode (' ', $content);
$halfway_mark = ceil(count($content) / 2);
$first_half_content = implode(' ', array_slice($content, 0, $halfway_mark));
$second_half_content = implode(' ', array_slice($content, $halfway_mark));
echo $first_half_content.'...';
echo ' YOUR ADS CODE';
echo $second_half_content;
?>
How can i modify this so that the 2 paragraphs (top and bottom) enclosing the ad code should not be the one having images. If the top or bottom paragraph has image then try for next 2 paragraphs.
Example: Correct Implementation on the right.

preg_replace version
This code steps through every paragraph ignoring those that contain image tags. The $pcount variable is incremented for every paragraph found without an image, if an image is encountered however, $pcount is reset to zero. Once $pcount reaches the point where it would hit two, the advert markup is inserted just before that paragraph. This should leave the advert markup between two safe paragraphs. The advert markup variable is then nullified so only one advert is inserted.
The following code is just for set up and could be modified to split the content differently, you could also modify the regular expression that is used — just in case you are using double BRs or something else to delimit your paragraphs.
/// set our advert content
$advert = '<marquee>BUY THIS STUFF!!</marquee>' . "\n\n";
/// calculate mid point
$mpoint = floor(strlen($content) / 2);
/// modify back to the start of a paragraph
$mpoint = strripos($content, '<p', -$mpoint);
/// split html so we only work on second half
$first = substr($content, 0, $mpoint);
$second = substr($content, $mpoint);
$pcount = 0;
$regexp = '/<p>.+?<\/p>/si';
The rest is the bulk of the code that runs the replacement. This could be modified to insert more than one advert, or to support more involved image checking.
$content = $first . preg_replace_callback($regexp, function($matches){
global $pcount, $advert;
if ( !$advert ) {
$return = $matches[0];
}
else if ( stripos($matches[0], '<img ') !== FALSE ) {
$return = $matches[0];
$pcount = 0;
}
else if ( $pcount === 1 ) {
$return = $advert . $matches[0];
$advert = '';
}
else {
$return = $matches[0];
$pcount++;
}
return $return;
}, $second);
After this code has been executed the $content variable will contain the enhanced HTML.
PHP versions prior to 5.3
As your chosen testing area does not support PHP 5.3, and so does not support anonymous functions, you need to use a slightly modified and less succinct version; that makes use of a named function instead.
Also, in order to support content that may not actually leave space for the advert in it's second half I have modified the $mpoint so that it is calculated to be 80% from the end. This will have the effect of including more in the $second part — but will also mean your adverts will be generally placed higher up in the mark-up. This code has not had any fallback implemented into it, because your question does not mention what should happen in the event of failure.
$advert = '<marquee>BUY THIS STUFF!!</marquee>' . "\n\n";
$mpoint = floor(strlen($content) * 0.8);
$mpoint = strripos($content, '<p', -$mpoint);
$first = substr($content, 0, $mpoint);
$second = substr($content, $mpoint);
$pcount = 0;
$regexp = '/<p>.+?<\/p>/si';
function replacement_callback($matches){
global $pcount, $advert;
if ( !$advert ) {
$return = $matches[0];
}
else if ( stripos($matches[0], '<img ') !== FALSE ) {
$return = $matches[0];
$pcount = 0;
}
else if ( $pcount === 1 ) {
$return = $advert . $matches[0];
$advert = '';
}
else {
$return = $matches[0];
$pcount++;
}
return $return;
}
echo $first . preg_replace_callback($regexp, 'replacement_callback', $second);

You could try this:
<?php
$ad_code = 'SOME SCRIPT HERE';
// Your code.
$content = apply_filters('the_content', $post->post_content);
// Split the content at the <p> tags.
$content = explode ('<p>', $content);
// Find the mid of the article.
$content_length = count($content);
$content_mid = floor($content_length / 2);
// Save no image p's index.
$last_no_image_p_index = NULL;
// Loop beginning from the mid of the article to search for images.
for ($i = $content_mid; $i < $content_length; $i++) {
// If we do not find an image, let it go down.
if (stripos($content[$i], '<img') === FALSE) {
// In case we already have a last no image p, we check
// if it was the one right before this one, so we have
// two p tags with no images in there.
if ($last_no_image_p_index === ($i - 1)) {
// We break here.
break;
}
else {
$last_no_image_p_index = $i;
}
}
}
// If no none image p tag was found, we use the last one.
if (is_null($last_no_image_p_index)) {
$last_no_image_p_index = ($content_length - 1);
}
// Add ad code here with trailing <p>, so the implode later will work correctly.
$content = array_slice($content, $last_no_image_p_index, 0, $ad_code . '</p>');
$content = implode('<p>', $content);
?>
It will try to find a place for the ad from the mid of your article and if none is found the ad is put to the end.
Regards
func0der

I think this will work:
First explode the paragraphs, then you have to loop it and check if you find img inside them.
If you find it inside, you try the next.
Think of this as psuedo-code, since it's not tested. You will have to make a loop too, comments in the code :) Sorry if it contains bugs, it's written in Notepad.
<?php
$i = 0; // counter
$arrBoolImg = array(); // array for the paragraph booleans
$content = apply_filters('the_content', $post->post_content);
$contents = str_replace ('<p>', '<explode><p>', $content); // here we add a custom tag, so we can explode
$contents = explode ('<explode>', $contents); // then explode it, so we can iterate the paragraphs
// fill array with boolean array returned
$arrBoolImg = hasImages($contents);
$halfway_mark = ceil(count($contents) / 2);
/*
TODO (by you):
---
When you have $arrBoolImg filled, you can itarate through it.
You then simply loop from the middle of the array $contents (plural), that is exploded from above.
The startingpoing for your loop is the middle, the upper bounds is the +2 or what ever :-)
Then you simply insert your magic.. And then glue it back together, as you did before.
I think this will work. even though the code may have some bugs, since I wrote it in Notepad.
*/
function hasImages($contents) {
/*
This function loops through the $contents array and checks if they have images in them
The return value, is an array with boolean values, so one can iterate through it.
*/
$arrRet = array(); // array for the paragraph booleans
if (count($content)>=1) {
foreach ($contents as $v) { // iterate the content
if (strpos($v, '<img') === false) { // did not find img
$arrRet[$i] = false;
}
else { // found img
$arrRet[$i] = true;
}
$i++;
} // end for each loop
return $arrRet;
} // end if count
} // end hasImages func
?>

[This is just an idea, I don't have enough reputation to comment...]
After calling #Olavxxx's method and filling your boolean array you could just loop through that array in an alternating manner starting in the middle: Let's assume your array is 8 entries long. Calculating the middle using your method you get 4. So you check the combination of values 4 + 3, if that doesn't work, you check 4 + 5, after that 3 + 2, ...
So your loop looks somewhat like
$middle = ceil(count($content) / 2);
$i = 1;
while ($i <= $middle) {
$j = $middle + (-1) ^ $i * $i;
$k = $j + 1;
if (!$hasImagesArray[$j] && !$hasImagesArray[$k])
break; // position found
$i++;
}
It's up to you to implement further constraints to make sure the add is not shown to far up or down in the article...
Please note that you need to take care of special cases like too short arrays too in order to prevent IndexOutOfBounds-Exceptions.

PHP: Display the first 500 characters of HTML

I have a huge HTML code in a PHP variable like :
$html_code = '<div class="contianer" style="text-align:center;">The Sameple text.</div><br><span>Another sample text.</span>....';
I want to display only first 500 characters of this code. This character count must consider the text in HTML tags and should exclude HTMl tags and attributes while measuring the length.
but while triming the code, it should not affect DOM structure of HTML code.
Is there any tuorial or working examples available?

If its the text you want, you can do this with the following too
substr(strip_tags($html_code),0,500);

Ooohh... I know this I can't get it exactly off the top of my head but you want to load the text you've got as a DOMDOCUMENT
http://www.php.net/manual/en/class.domdocument.php
then grab the text from the entire document node (as a DOMnode http://www.php.net/manual/en/class.domnode.php)
This won't be exactly right, but hopefully this will steer you onto the right track.
Try something like:
$html_code = '<div class="contianer" style="text-align:center;">The Sameple text.</div><br><span>Another sample text.</span>....';
$dom = new DOMDocument();
$dom->loadHTML($html_code);
$text_to_strip = $dom->textContent;
$stripped = mb_substr($text_to_strip,0,500);
echo "$stripped"; // The Sameple text.Another sample text.....
edit ok... that should work. just tested locally
edit2
Now that I understand you want to keep the tags, but limit the text, lets see. You're going to want to loop the content until you get to 500 characters. This is probably going to take a few edits and passes for me to get right, but hopefully I can help. (sorry I can't give undivided attention)
First case is when the text is less than 500 characters. Nothing to worry about. Starting with the above code we can do the following.
if (strlen($stripped) > 500) {
// this is where we do our work.
$characters_so_far = 0;
foreach ($dom->child_nodes as $ChildNode) {
// should check if $ChildNode->hasChildNodes();
// probably put some of this stuff into a function
$characters_in_next_node += str_len($ChildNode->textcontent);
if ($characters_so_far+$characters_in_next_node > 500) {
// remove the node
// try using
// $ChildNode->parentNode->removeChild($ChildNode);
}
$characters_so_far += $characters_in_next_node
}
//
$final_out = $dom->saveHTML();
} else {
$final_out = $html_code;
}

i'm pasting below a php class i wrote a long time ago, but i know it works. its not exactly what you're after, as it deals with words instead of a character count, but i figure its pretty close and someone might find it useful.
class HtmlWordManipulator
{
var $stack = array();
function truncate($text, $num=50)
{
if (preg_match_all('/\s+/', $text, $junk) <= $num) return $text;
$text = preg_replace_callback('/(<\/?[^>]+\s+[^>]*>)/','_truncateProtect', $text);
$words = 0;
$out = array();
$text = str_replace('<',' <',str_replace('>','> ',$text));
$toks = preg_split('/\s+/', $text);
foreach ($toks as $tok)
{
if (preg_match_all('/<(\/?[^\x01>]+)([^>]*)>/',$tok,$matches,PREG_SET_ORDER))
foreach ($matches as $tag) $this->_recordTag($tag[1], $tag[2]);
$out[] = trim($tok);
if (! preg_match('/^(<[^>]+>)+$/', $tok))
{
if (!strpos($tok,'=') && !strpos($tok,'<') && strlen(trim(strip_tags($tok))) > 0)
{
++$words;
}
else
{
/*
echo '<hr />';
echo htmlentities('failed: '.$tok).'<br /)>';
echo htmlentities('has equals: '.strpos($tok,'=')).'<br />';
echo htmlentities('has greater than: '.strpos($tok,'<')).'<br />';
echo htmlentities('strip tags: '.strip_tags($tok)).'<br />';
echo str_word_count($text);
*/
}
}
if ($words > $num) break;
}
$truncate = $this->_truncateRestore(implode(' ', $out));
return $truncate;
}
function restoreTags($text)
{
foreach ($this->stack as $tag) $text .= "</$tag>";
return $text;
}
private function _truncateProtect($match)
{
return preg_replace('/\s/', "\x01", $match[0]);
}
private function _truncateRestore($strings)
{
return preg_replace('/\x01/', ' ', $strings);
}
private function _recordTag($tag, $args)
{
// XHTML
if (strlen($args) and $args[strlen($args) - 1] == '/') return;
else if ($tag[0] == '/')
{
$tag = substr($tag, 1);
for ($i=count($this->stack) -1; $i >= 0; $i--) {
if ($this->stack[$i] == $tag) {
array_splice($this->stack, $i, 1);
return;
}
}
return;
}
else if (in_array($tag, array('p', 'li', 'ul', 'ol', 'div', 'span', 'a')))
$this->stack[] = $tag;
else return;
}
}
truncate is what you want, and you pass it the html and the number of words you want it trimmed down to. it ignores html while counting words, but then rewraps everything in html, even closing trailing tags due to the truncation.
please don't judge me on the complete lack of oop principles. i was young and stupid.
edit:
so it turns out the usage is more like this:
$content = $manipulator->restoreTags($manipulator->truncate($myHtml,$numOfWords));
stupid design decision. allowed me to inject html inside the unclosed tags though.

I'm not up to coding a real solution, but if someone wants to, here's what I'd do (in pseudo-PHP):
$html_code = '<div class="contianer" style="text-align:center;">The Sameple text.</div><br><span>Another sample text.</span>....';
$aggregate = '';
$document = XMLParser($html_code);
foreach ($document->getElementsByTagName('*') as $element) {
$aggregate .= $element->text(); // This is the text, not HTML. It doesn't
// include the children, only the text
// directly in the tag.
}

Truncate text containing HTML, ignoring tags

I want to truncate some text (loaded from a database or text file), but it contains HTML so as a result the tags are included and less text will be returned. This can then result in tags not being closed, or being partially closed (so Tidy may not work properly and there is still less content). How can I truncate based on the text (and probably stopping when you get to a table as that could cause more complex issues).
substr("Hello, my <strong>name</strong> is <em>Sam</em>. I´m a web developer.",0,26)."..."
Would result in:
Hello, my <strong>name</st...
What I would want is:
Hello, my <strong>name</strong> is <em>Sam</em>. I´m...
How can I do this?
While my question is for how to do it in PHP, it would be good to know how to do it in C#... either should be OK as I think I would be able to port the method over (unless it is a built in method).
Also note that I have included an HTML entity ´ - which would have to be considered as a single character (rather than 7 characters as in this example).
strip_tags is a fallback, but I would lose formatting and links and it would still have the problem with HTML entities.

Assuming you are using valid XHTML, it's simple to parse the HTML and make sure tags are handled properly. You simply need to track which tags have been opened so far, and make sure to close them again "on your way out".
<?php
header('Content-type: text/plain; charset=utf-8');
function printTruncated($maxLength, $html, $isUtf8=true)
{
$printedLength = 0;
$position = 0;
$tags = array();
// For UTF-8, we need to count multibyte sequences as one character.
$re = $isUtf8
? '{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;|[\x80-\xFF][\x80-\xBF]*}'
: '{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}';
while ($printedLength < $maxLength && preg_match($re, $html, $match, PREG_OFFSET_CAPTURE, $position))
{
list($tag, $tagPosition) = $match[0];
// Print text leading up to the tag.
$str = substr($html, $position, $tagPosition - $position);
if ($printedLength + strlen($str) > $maxLength)
{
print(substr($str, 0, $maxLength - $printedLength));
$printedLength = $maxLength;
break;
}
print($str);
$printedLength += strlen($str);
if ($printedLength >= $maxLength) break;
if ($tag[0] == '&' || ord($tag) >= 0x80)
{
// Pass the entity or UTF-8 multibyte sequence through unchanged.
print($tag);
$printedLength++;
}
else
{
// Handle the tag.
$tagName = $match[1][0];
if ($tag[1] == '/')
{
// This is a closing tag.
$openingTag = array_pop($tags);
assert($openingTag == $tagName); // check that tags are properly nested.
print($tag);
}
else if ($tag[strlen($tag) - 2] == '/')
{
// Self-closing tag.
print($tag);
}
else
{
// Opening tag.
print($tag);
$tags[] = $tagName;
}
}
// Continue after the tag.
$position = $tagPosition + strlen($tag);
}
// Print any remaining text.
if ($printedLength < $maxLength && $position < strlen($html))
print(substr($html, $position, $maxLength - $printedLength));
// Close any open tags.
while (!empty($tags))
printf('</%s>', array_pop($tags));
}
printTruncated(10, '<b><Hello></b> <img src="world.png" alt="" /> world!'); print("\n");
printTruncated(10, '<table><tr><td>Heck, </td><td>throw</td></tr><tr><td>in a</td><td>table</td></tr></table>'); print("\n");
printTruncated(10, "<em><b>Hello</b>w\xC3\xB8rld!</em>"); print("\n");
Encoding note: The above code assumes the XHTML is UTF-8 encoded. ASCII-compatible single-byte encodings (such as Latin-1) are also supported, just pass false as the third argument. Other multibyte encodings are not supported, though you may hack in support by using mb_convert_encoding to convert to UTF-8 before calling the function, then converting back again in every print statement.
(You should always be using UTF-8, though.)
Edit: Updated to handle character entities and UTF-8. Fixed bug where the function would print one character too many, if that character was a character entity.

I've written a function that truncates HTML just as yous suggest, but instead of printing it out it puts it just keeps it all in a string variable. handles HTML Entities, as well.
/**
* function to truncate and then clean up end of the HTML,
* truncates by counting characters outside of HTML tags
*
* #author alex lockwood, alex dot lockwood at websightdesign
*
* #param string $str the string to truncate
* #param int $len the number of characters
* #param string $end the end string for truncation
* #return string $truncated_html
*
* **/
public static function truncateHTML($str, $len, $end = '…'){
//find all tags
$tagPattern = '/(<\/?)([\w]*)(\s*[^>]*)>?|&[\w#]+;/i'; //match html tags and entities
preg_match_all($tagPattern, $str, $matches, PREG_OFFSET_CAPTURE | PREG_SET_ORDER );
//WSDDebug::dump($matches); exit;
$i =0;
//loop through each found tag that is within the $len, add those characters to the len,
//also track open and closed tags
// $matches[$i][0] = the whole tag string --the only applicable field for html enitities
// IF its not matching an &htmlentity; the following apply
// $matches[$i][1] = the start of the tag either '<' or '</'
// $matches[$i][2] = the tag name
// $matches[$i][3] = the end of the tag
//$matces[$i][$j][0] = the string
//$matces[$i][$j][1] = the str offest
while($matches[$i][0][1] < $len && !empty($matches[$i])){
$len = $len + strlen($matches[$i][0][0]);
if(substr($matches[$i][0][0],0,1) == '&' )
$len = $len-1;
//if $matches[$i][2] is undefined then its an html entity, want to ignore those for tag counting
//ignore empty/singleton tags for tag counting
if(!empty($matches[$i][2][0]) && !in_array($matches[$i][2][0],array('br','img','hr', 'input', 'param', 'link'))){
//double check
if(substr($matches[$i][3][0],-1) !='/' && substr($matches[$i][1][0],-1) !='/')
$openTags[] = $matches[$i][2][0];
elseif(end($openTags) == $matches[$i][2][0]){
array_pop($openTags);
}else{
$warnings[] = "html has some tags mismatched in it: $str";
}
}
$i++;
}
$closeTags = '';
if (!empty($openTags)){
$openTags = array_reverse($openTags);
foreach ($openTags as $t){
$closeTagString .="</".$t . ">";
}
}
if(strlen($str)>$len){
// Finds the last space from the string new length
$lastWord = strpos($str, ' ', $len);
if ($lastWord) {
//truncate with new len last word
$str = substr($str, 0, $lastWord);
//finds last character
$last_character = (substr($str, -1, 1));
//add the end text
$truncated_html = ($last_character == '.' ? $str : ($last_character == ',' ? substr($str, 0, -1) : $str) . $end);
}
//restore any open tags
$truncated_html .= $closeTagString;
}else
$truncated_html = $str;
return $truncated_html;
}

100% accurate, but pretty difficult approach:
Iterate charactes using DOM
Use DOM methods to remove remaining elements
Serialize the DOM
Easy brute-force approach:
Split string into tags (not elements) and text fragments using preg_split('/(<tag>)/') with PREG_DELIM_CAPTURE.
Measure text length you want (it'll be every second element from split, you might use html_entity_decode() to help measure accurately)
Cut the string (trim &[^\s;]+$ at the end to get rid of possibly chopped entity)
Fix it with HTML Tidy

I used a nice function found at http://alanwhipple.com/2011/05/25/php-truncate-string-preserving-html-tags-words, apparently taken from CakePHP

The following is a simple state-machine parser which handles you test case successfully. I fails on nested tags though as it doesn't track the tags themselves. I also chokes on entities within HTML tags (e.g. in an href-attribute of an <a>-tag). So it cannot be considered a 100% solution to this problem but because it's easy to understand it could be the basis for a more advanced function.
function substr_html($string, $length)
{
$count = 0;
/*
* $state = 0 - normal text
* $state = 1 - in HTML tag
* $state = 2 - in HTML entity
*/
$state = 0;
for ($i = 0; $i < strlen($string); $i++) {
$char = $string[$i];
if ($char == '<') {
$state = 1;
} else if ($char == '&') {
$state = 2;
$count++;
} else if ($char == ';') {
$state = 0;
} else if ($char == '>') {
$state = 0;
} else if ($state === 0) {
$count++;
}
if ($count === $length) {
return substr($string, 0, $i + 1);
}
}
return $string;
}

you can use tidy as well:
function truncate_html($html, $max_length) {
return tidy_repair_string(substr($html, 0, $max_length),
array('wrap' => 0, 'show-body-only' => TRUE), 'utf8');
}

Could use DomDocument in this case with a nasty regex hack, worst that would happen is a warning, if there's a broken tag :
$dom = new DOMDocument();
$dom->loadHTML(substr("Hello, my <strong>name</strong> is <em>Sam</em>. I´m a web developer.",0,26));
$html = preg_replace("/\<\/?(body|html|p)>/", "", $dom->saveHTML());
echo $html;
Should give output : Hello, my <strong>**name**</strong>.

I've made light changes to Søren Løvborg printTruncated function making it UTF-8 compatible:
/* Truncate HTML, close opened tags
*
* #param int, maxlength of the string
* #param string, html
* #return $html
*/
function html_truncate($maxLength, $html){
mb_internal_encoding("UTF-8");
$printedLength = 0;
$position = 0;
$tags = array();
ob_start();
while ($printedLength < $maxLength && preg_match('{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}', $html, $match, PREG_OFFSET_CAPTURE, $position)){
list($tag, $tagPosition) = $match[0];
// Print text leading up to the tag.
$str = mb_strcut($html, $position, $tagPosition - $position);
if ($printedLength + mb_strlen($str) > $maxLength){
print(mb_strcut($str, 0, $maxLength - $printedLength));
$printedLength = $maxLength;
break;
}
print($str);
$printedLength += mb_strlen($str);
if ($tag[0] == '&'){
// Handle the entity.
print($tag);
$printedLength++;
}
else{
// Handle the tag.
$tagName = $match[1][0];
if ($tag[1] == '/'){
// This is a closing tag.
$openingTag = array_pop($tags);
assert($openingTag == $tagName); // check that tags are properly nested.
print($tag);
}
else if ($tag[mb_strlen($tag) - 2] == '/'){
// Self-closing tag.
print($tag);
}
else{
// Opening tag.
print($tag);
$tags[] = $tagName;
}
}
// Continue after the tag.
$position = $tagPosition + mb_strlen($tag);
}
// Print any remaining text.
if ($printedLength < $maxLength && $position < mb_strlen($html))
print(mb_strcut($html, $position, $maxLength - $printedLength));
// Close any open tags.
while (!empty($tags))
printf('</%s>', array_pop($tags));
$bufferOuput = ob_get_contents();
ob_end_clean();
$html = $bufferOuput;
return $html;
}

Bounce added multi-byte character support to Søren Løvborg's solution - I've added:
support for unpaired HTML tags (e.g. <hr>, <br> <col> etc. don't get closed - in HTML a '/' is not required at the end of these (in is for XHTML though)),
customisable truncation indicator (defaults to &hellips; i.e. … ),
return as a string without using output buffer, and
unit tests with 100% coverage.
All this at Pastie.

Another light changes to Søren Løvborg printTruncated function making it UTF-8 (Needs mbstring) compatible and making it return string not print one. I think it's more useful.
And my code not use buffering like Bounce variant, just one more variable.
UPD: to make it work properly with utf-8 chars in tag attributes you need mb_preg_match function, listed below.
Great thanks to Søren Løvborg for that function, it's very good.
/* Truncate HTML, close opened tags
*
* #param int, maxlength of the string
* #param string, html
* #return $html
*/
function htmlTruncate($maxLength, $html)
{
mb_internal_encoding("UTF-8");
$printedLength = 0;
$position = 0;
$tags = array();
$out = "";
while ($printedLength < $maxLength && mb_preg_match('{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}', $html, $match, PREG_OFFSET_CAPTURE, $position))
{
list($tag, $tagPosition) = $match[0];
// Print text leading up to the tag.
$str = mb_substr($html, $position, $tagPosition - $position);
if ($printedLength + mb_strlen($str) > $maxLength)
{
$out .= mb_substr($str, 0, $maxLength - $printedLength);
$printedLength = $maxLength;
break;
}
$out .= $str;
$printedLength += mb_strlen($str);
if ($tag[0] == '&')
{
// Handle the entity.
$out .= $tag;
$printedLength++;
}
else
{
// Handle the tag.
$tagName = $match[1][0];
if ($tag[1] == '/')
{
// This is a closing tag.
$openingTag = array_pop($tags);
assert($openingTag == $tagName); // check that tags are properly nested.
$out .= $tag;
}
else if ($tag[mb_strlen($tag) - 2] == '/')
{
// Self-closing tag.
$out .= $tag;
}
else
{
// Opening tag.
$out .= $tag;
$tags[] = $tagName;
}
}
// Continue after the tag.
$position = $tagPosition + mb_strlen($tag);
}
// Print any remaining text.
if ($printedLength < $maxLength && $position < mb_strlen($html))
$out .= mb_substr($html, $position, $maxLength - $printedLength);
// Close any open tags.
while (!empty($tags))
$out .= sprintf('</%s>', array_pop($tags));
return $out;
}
function mb_preg_match(
$ps_pattern,
$ps_subject,
&$pa_matches,
$pn_flags = 0,
$pn_offset = 0,
$ps_encoding = NULL
) {
// WARNING! - All this function does is to correct offsets, nothing else:
//(code is independent of PREG_PATTER_ORDER / PREG_SET_ORDER)
if (is_null($ps_encoding)) $ps_encoding = mb_internal_encoding();
$pn_offset = strlen(mb_substr($ps_subject, 0, $pn_offset, $ps_encoding));
$ret = preg_match($ps_pattern, $ps_subject, $pa_matches, $pn_flags, $pn_offset);
if ($ret && ($pn_flags & PREG_OFFSET_CAPTURE))
foreach($pa_matches as &$ha_match) {
$ha_match[1] = mb_strlen(substr($ps_subject, 0, $ha_match[1]), $ps_encoding);
}
return $ret;
}

Use the function truncateHTML() from:
https://github.com/jlgrall/truncateHTML
Example: truncate after 9 characters including the ellipsis:
truncateHTML(9, "<p><b>A</b> red ball.</p>", ['wholeWord' => false]);
// => "<p><b>A</b> red ba…</p>"
Features: UTF-8, configurable ellipsis, include/exclude length of ellipsis, self-closing tags, collapsing spaces, invisible elements (<head>, <script>, <noscript>, <style>, <!-- comments -->), HTML $entities;, truncating at last whole word (with option to still truncate very long words), PHP 5.6 and 7.0+, 240+ unit tests, returns a string (doesn't use the output buffer), and well commented code.
I wrote this function, because I really liked Søren Løvborg's function above (especially how he managed encodings), but I needed a bit more functionality and flexibility.

The CakePHP framework has a HTML-aware truncate() function in the Text Helper that works for me. See Text. MIT license. Link to source (provided by #Quentin).

This is very difficult to do without using a validator and a parser, the reason being that imagine if you have
<div id='x'>
<div id='y'>
<h1>Heading</h1>
500
lines
of
html
...
etc
...
</div>
</div>
How do you plan to truncate that and end up with valid HTML?
After a brief search, I found this link which could help.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to truncate string to first n words in PHP - php

Related

Facebook-like "show more" button for a string with URLs

Insert text in content after 300 words but after closing tag of a Paragraph

Place content in between paragraphs without images

PHP: Display the first 500 characters of HTML

Truncate text containing HTML, ignoring tags

Categories

Resources