Display post excerpts, limited by word count - php

I am working on my php website (Not a Wordpress site) on the main index I display the two newest post. The thing is on the description it shows the entire article I find myself needing to display post excerpts maybe 35 word limit.
<?=$line["m_description"]?>
<?
$qresult3 = mysql_query("SELECT * FROM t_users WHERE u_id=".$line["m_userid"]." LIMIT 1");
if (mysql_num_rows($qresult3)<1) { ?>

<?php
// just the excerpt
function first_n_words($text, $number_of_words) {
// Where excerpts are concerned, HTML tends to behave
// like the proverbial ogre in the china shop, so best to strip that
$text = strip_tags($text);
// \w[\w'-]* allows for any word character (a-zA-Z0-9_) and also contractions
// and hyphenated words like 'range-finder' or "it's"
// the /s flags means that . matches \n, so this can match multiple lines
$text = preg_replace("/^\W*((\w[\w'-]*\b\W*){1,$number_of_words}).*/ms", '\\1', $text);
// strip out newline characters from our excerpt
return str_replace("\n", "", $text);
}
// excerpt plus link if shortened
function truncate_to_n_words($text, $number_of_words, $url, $readmore = 'read more') {
$text = strip_tags($text);
$excerpt = first_n_words($text, $number_of_words);
// we can't just look at the length or try == because we strip carriage returns
if( str_word_count($text) !== str_word_count($excerpt) ) {
$excerpt .= '... '.$readmore.'';
}
return $excerpt;
}
$src = <<<EOF
<b>My cool story</b>
<p>Here it is. It's really cool. I like it. I like lots of stuff.</p>
<p>I also like to read and write and carry on forever</p>
EOF;
echo first_n_words($src, 10);
echo "\n\n-----------------------------\n\n";
echo truncate_to_n_words($src, 10, 'http://www.google.com');
EDIT: Added functional example and accounted for punctuation and numbers in text

I have a function though other people may say it's not good because I'm still good at PHP too (tips welcome people) but this will give you what you are looking for, it may need better coding if anyone has suggestions.
function Short($text, $length, $url, $more){
$short = mb_substr($text, 0, $length);
if($short != $text) {
$lastspace = strrpos($short, ' ');
$short = substr($short , 0, $lastspace);
if(!$more){
$more = "Read Full Post";
} // end if more is blank
$short .= "...[<a href='$url'>$more</a>]";
} // end if content != short
$short = str_replace("’","'", $short);
$short = stripslashes($short);
$short = nl2br($short);
} // end short function
To Use:
say your article content is the variable $content
function($content, "35", "http://domain.com/article_post", "Read Full Story");
echo $short;
Similarly, you can adjust the function to remove $url and $more from it and just have the excerpt with ... at the end.

Related

php preg_match excluding text within html tags/attributes to find correct place to cut a string

I am trying to determine the absolute position of certain words within a block of html, but only if they are outside of an actual html tag. For instance, if I wanted to determine the position of the word "join" using preg_match in this text:
<p>There are 14 more days until our holiday special so come join us!</p>
I could use:
preg_match('/join/', $post_content, $matches, PREG_OFFSET_CAPTURE, $offset);
The problem is that this is matching the word within the aria-label attribute, when what I need is the one just after the link. It would be fine to match between the <a> and </a>, just not inside the brackets themselves.
My actual end goal, most of what (I think) I have aside from this last element: I am trimming a block of html (not a full document) to cut off at a specific word count. I am trying to determine which character that last word ends at, and then joining the left side of the html block with only the html from the right side, so all html tags close gracefully. I thought I had it working until I ran into an example like I showed where the last word was also within an html attribute, causing me to split the string at the wrong location. This is my code so far:
$post_content = strip_tags ( $p->post_content, "<a><br><p><ul><li>" );
$post_content_stripped = strip_tags ( $p->post_content );
$post_content_stripped = preg_replace("/[^A-Za-z0-9 ]/", ' ', $post_content_stripped);
$post_content_stripped = preg_replace("/\s+/", ' ', $post_content_stripped);
$post_content_stripped_array = explode ( " " , trim($post_content_stripped) );
$excerpt_wordcount = count( $post_content_stripped_array );
$cutpos = 0;
while($excerpt_wordcount>48){
$thiswordrev = "/" . strrev($post_content_stripped_array[$excerpt_wordcount - 1]) . "/";
preg_match($thiswordrev, strrev($post_content), $matches, PREG_OFFSET_CAPTURE, $cutpos);
$cutpos = $matches[0][1] + (strlen($thiswordrev) - 2);
array_pop($post_content_stripped_array);
$excerpt_wordcount = count( $post_content_stripped_array );
}
if($pwordcount>$excerpt_wordcount){
preg_match_all('/<\/?[^>]*>/', substr( $post_content, strlen($post_content) - $cutpos ), $closetags_result);
$excerpt_closetags = "" . $closetags_result[0][0];
$post_excerpt = substr( $post_content, 0, strlen($post_content) - $cutpos ) . $excerpt_closetags;
}else{
$post_excerpt = $post_content;
}
I am actually searching the string in reverse in this case, since I am walking word by word backwards from the end of the string, so I know that my html brackets are backwards, eg:
>p/<!su nioj emoc os >a/<laiceps yadiloh>"su nioj"=lebal-aira "renepoon rerreferon"=ler "knalb_"=tegrat "lmth.egapemos/"=ferh a< ruo litnu syad erom 41 era erehT>p<
But it's easy enough to flip all of the brackets before doing the preg_match, or I am assuming should be easy enough to have the preg_match account for that.
Do not use regex to parse HTML.
You have a simple objective: limit the text content to a given number of words, ensuring that the HTML remains valid.
To this end, I would suggest looping through text nodes until you count a certain number of words, and then removing everything after that.
$dom = new DOMDocument();
$dom->loadHTML($post_content);
$xpath = new DOMXPath($dom);
$all_text_nodes = $xpath->query("//text()");
$words_left = 48;
foreach( $all_text_nodes as $text_node) {
$text = $text_node->textContent;
$words = explode(" ", $text); // TODO: maybe preg_split on /\s/ to support more whitespace types
$word_count = count($words);
if( $word_count < $words_left) {
$words_left -= $word_count;
continue;
}
// reached the threshold
$words_that_fit = implode(" ", array_slice($words, 0, $words_left));
// If the above TODO is implemented, this will need to be adjusted to keep the specific whitespace characters
$text_node->textContent = $words_that_fit;
$remove_after = $text_node;
while( $remove_after->parentNode) {
while( $remove_after->nextSibling) {
$remove_after->parentNode->removeChild($remove_after->nextSibling);
}
$remove_after = $remove_after->parentNode;
}
break;
}
$output = substr($dom->saveHTML($dom->getElementsByTagName("body")->item(0)), strlen("<body>"), -strlen("</body>"));
Live demo
Ok, I figured out a workaround. I don't know if this is the most elegant solution, so if someone sees a better one I would still love to hear it, but for now I realized that I don't have to actually have the html in the string I am searching to determine the position to cut, I just need it to be the same length. I grabbed all of the html elements and just created a dummy string replacing all of them with the same number of asterisks:
// create faux string with placeholders instead of html for search purposes
preg_match_all('/<\/?[^>]*>/', $post_content, $alltags_result);
$tagcount = count( $alltags_result );
$post_content_dummy = $post_content;
foreach($alltags_result[0] as $thistag){
$post_content_dummy = str_replace($thistag, str_repeat("*",strlen($thistag)), $post_content_dummy);
}
Then I just use $post_content_dummy in the while loop instead of $post_content, in order to find the cut position, and then $post_content for the actual cut. So far seems to be working fine.

PHP: Strip everything except a match

The following native function of my script does strip some content from my article such as bbcodes, html tags, and http or https websites, and I'd like to modify it so that it does strip everything except anything which matches "https://my.website.com/variablepath".
function _bbcode_strip($text)
{
static $patterns = array();
if ($this->tp_bbcode)
{
// use text inside [topicpreview] bbcode as the topic preview
if (preg_match('#\[(topicpreview[^\[\]]+)\].*\[/\1\]#Usi', $text, $matches))
{
$text = $matches[0];
}
}
$text = smiley_text($text, true); // display smileys as text :)
$text = ($this->tp_line_breaks ? str_replace("\n", '
', $text) : $text); // preserve line breaks
// Loop through text stripping inner most nested BBCodes until all have been removed
$regex = '#\[(' . $this->strip_bbcodes . ')[^\[\]]+\]((?:(?!\[\1[^\[\]]+\]).)+)\[\/\1[^\[\]]+\]#Usi';
while(preg_match($regex, $text))
{
$text = preg_replace($regex, '', $text);
}
if (empty($patterns))
{
$patterns = array(
'#<!-- [lmw] --><a class="postlink[^>]*>(.*<\/a[^>]*>)?<!-- [lmw] -->#Usi', // Magic URLs
'#<[^>]*>(.*<[^>]*>)?#Usi', // HTML code
'#\[/?[^\[\]]+\]#mi', // Strip all bbcode tags
'#(http|https|ftp|mailto)(:|\&\#58;)\/\/[^\s]+#i', // Strip remaining URLs
'#"#', // Possible quotes from older board conversions
'#[\s]+#' // Multiple spaces
);
}
return trim(preg_replace($patterns, ' ', $text));
}
I tried myself working on this part of the script which is the regex which decides what to strip:
'#<!-- [lmw] --><a class="postlink[^>]*>(.*<\/a[^>]*>)?<!-- [lmw] -->#Usi', // Magic URLs
'#<[^>]*>(.*<[^>]*>)?#Usi', // HTML code
'#\[/?[^\[\]]+\]#mi', // Strip all bbcode tags
'#(http|https|ftp|mailto)(:|\&\#58;)\/\/[^\s]+#i', // Strip remaining URLs
'#"#', // Possible quotes from older board conversions
'#[\s]+#' // Multiple spaces
but I didn't manage to do what I'm trying to, so I came here to ask for help. Thank you.

How Can I Check First Few Characters, Add Characters After First Few if First Few Match x in PHP?

I want to perform a three step process:
Check the first few characters (ffc) of a string variable.
If ffc = x (another string of characters) Then
Insert x after ffc but before any other content in ffc.
How can I accomplish this in PHP?
Actual Use Case:
I am using WordPress and grabbing the_content() and moving it into a variable $content.
I want some characters to appear before the text in $content, but WP auto adds <p> tags (if wpautop is on, and I'd like to avoid turning it off), which means the characters I add appear above rather than on the same line as $content.
The goal here is to check if $content starts with <p> and if it does to insert after <p> the characters "Summary: ".
Here is what I have thus far (it isn't working):
<?php
$content = get_the_content();
echo $content;
$hasP = substr($content, 0, 3);
echo $hasP;
If ($hasP == '<p>') {
echo "Yes!";
$newString = substr($string, 3);
echo $newString;
};
?>
Unfortunately, it seems that WP just re-adds the <p> when I echo $newString.
So, this took a long time for me to figure out, but here is the solution I came up with (err, make that stole from other people):
<?php
add_filter('the_content', 'before_after');
$content = the_content();
echo $content;
?>
And then the actual magic happens here:
function before_after($content) {
$content = preg_replace('/<p>/', '<span>', $content, 1);
$content = preg_replace('/<\/p>/', '</span>', $content, 1);
return $content;
}
In the above I actually went beyond what I had initially stated and replaced both the opening and closing <p>.

remove HTML from displaying in PHP

I have this text : http://pastebin.com/2Zgbs7hi
And i want to be able to remove the HTML code from it and just display the plain text but i want to keep at least one line break where there are currently a few line breaks
i have tried:
$ticket["summary"] = 'pastebin example';
$TicketSummaryDisplay = nl2br($ticket["summary"]);
$TicketSummaryDisplay = stripslashes($TicketSummaryDisplay);
$TicketSummaryDisplay = trim(strip_tags($TicketSummaryDisplay));
$TicketSummaryDisplay = preg_replace('/\n\s+$/m', '', $TicketSummaryDisplay);
echo $TicketSummaryDisplay;
that is displaying as plain text, but it shows it all as one big block of text with no line breaks at all
Maybe this will earn you some time.
<?php
libxml_use_internal_errors(true); //crazy o tags
$html = file_get_contents('http://pastebin.com/raw.php?i=2Zgbs7hi');
$dom = new DOMDocument;
$dom->loadHTML($html);
$result='';
foreach ($dom->getElementsByTagName('p') as $node) {
if (strstr($node->nodeValue, 'Legal Disclaimer:')){
break;
}
$result .= $node->nodeValue;
}
echo $result;
This example should successfully store text from html into an array of strings.
After stripping all the tags, you can use preg_split with \R special character ( matches any newline sequence ) to convert string into array. That array will now have several blank values, and there will be also some amount of html non-breaking space entities, so we will check the array for empty values with array_filter() function ( it will remove all items that do not satisfy the filter conditions, in our case, an empty value ). Here are a problem with entity, because and space characters are not the same, they have different ASCII code, so trim() function will not remove spaces. Here are two possible solutions, the first uncommented part will only replace &nbsp and check for white space characters, while the second commented one will decode all html entities and also check for spaces.
PHP:
$text = file_get_contents( 'http://pastebin.com/raw.php?i=2Zgbs7hi' );
$text = strip_tags( $text );
$array = array_filter(
preg_split( '/\R/', $text ),
function( &$item ) {
$item = str_replace( ' ', ' ', $item );
return trim( $item );
// $item = html_entity_decode( $item );
// return trim( str_replace( "\xC2\xA0", ' ', $item ) );
}
);
foreach( $array as $value ) {
echo $value . '<br />';
}
Array output:
Array
(
[8] => Hi,
[11] => Ashley has explained that I need to ask for another line and broadband for the wifi to work, please can you arrange this.
[13] => Regards
[23] => Legal Disclaimer:
[24] => This email and its attachments are confidential. If you received it by mistake, please don’t share it. Let us know and then delete it. Its content does not necessarily represent the views of The Dragon Enterprise
[25] => Centre and we cannot guarantee the information it contains is complete. All emails are monitored and may be seen by another member of The Dragon Enterprise Centre's staff for internal use
)
Now you should have clear array with only items with value in it. By the way, newlines in HTML are expressed through <br />, not through \n, your example as response in a web browser still has them, but they are only visible in page source code. I hope I did not missed the point of the question.
try this get text output with line brakes
<?php
$ticket["summary"] = file_get_contents('http://pastebin.com/raw.php?i=2Zgbs7hi');
$TicketSummaryDisplay = nl2br($ticket["summary"]);
echo strip_tags($TicketSummaryDisplay,'<br>');
?>
You are asking on how to add line-breaks to your "one big block of text with no line breaks at all".
Short answer
After you stripped the HTML tags, apply wordwrap with a desired text-block length
$text = wordwrap($text, 90, "<br />\n");
I really wonder, why nobody suggested that function before.
there is also chunk_split around, which doesn't take words into account and just splits after a certain number of chars. breaking words - but that's not what you want, i guess.
PHP
<?php
$text = file_get_contents('http://pastebin.com/raw.php?i=2Zgbs7hi');
/**
* Returns string without html tags, also
* removes takes control chars, spaces and " " into account.
*/
function dropHtmlTags($string) {
// remove html tags
//$string = preg_replace ('/<[^>]*>/', ' ', $string);
$string = strip_tags($string);
// control characters and "&nbsp"
$string = str_replace("\r", '', $string); // remove
$string = str_replace("\n", ' ', $string); // replace with space
$string = str_replace("\t", ' ', $string); // replace with space
$string = str_replace(" ", ' ', $string);
// remove multiple spaces
$string = preg_replace('/ {2,}/', ' ', $string);
$string = trim($string);
return $string;
}
$text = dropHtmlTags($text);
// The Answer: insert line breaks after 95 chars,
// to get rid of the "one big block of text with no line breaks at all"
$text = wordwrap($text, 95, "<br />\n");
// if you want to insert line-breaks before the legal disclaimer,
// uncomment the next line
//$text = str_replace("Regards Legal Disclaimer", "<br /><br />Regards Legal Disclaimer", $text);
echo $text;
?>
Result
first section shows your text block
second section shows the text with wordwrap applied (code from above)
Hello it can be done as follows:
$abc= file_get_contents('http://pastebin.com/raw.php?i=2Zgbs7hi');
$abc = strip_tags("\n", $abc);
echo $abc;
Please, let me know whether it works
you may use
<?php
$a= file_get_contents('a.txt');
echo nl2br(htmlspecialchars($a));
?>
<?php
$handle = #fopen("pastebin.html", "r");
if ($handle) {
while (!feof($handle)) {
$buffer = fgetss($handle, 4096);
echo $buffer;
}
fclose($handle);
}
?>
output is
Hi,
Ashley has explained that I need to ask for another line and broadband for the wifi to work, please can you arrange this.
Regards
Legal Disclaimer:
This email and its attachments are confidential. If you received it by mistake, please don’t share it. Let us know and then delete it. Its content does not necessarily represent the views of The Dragon Enterprise
Centre and we cannot guarantee the information it contains is complete. All emails are monitored and may be seen by another member of The Dragon Enterprise Centre's staff for internal use
You can probably write additional code to convert to spaces etc.
I'm not sure I did understand everything correctly but this seems to be your expected result:
$txt = file_get_contents('http://pastebin.com/raw.php?i=2Zgbs7hi');
var_dump(preg_replace("/(\&nbsp\;(\s{1,})?)+/", "\n", trim(strip_tags(preg_replace("/(\s){1,}/", " ", $txt)))));
//more readable
$txt = preg_replace("/(\s){1,}/", " ", $txt);
$txt = trim(strip_tags($txt));
$txt = preg_replace("/(\&nbsp\;(\s{1,})?)+/", "\n", $txt);
The strip_tags() function strips HTML and PHP tags from a string, if that is what you are trying to accomplish.
Examples from the docs:
<?php
$text = '<p>Test paragraph.</p><!-- Comment --> Other text';
echo strip_tags($text);
echo "\n";
// Allow <p> and <a>
echo strip_tags($text, '<p><a>');
?>
The above example will output:
Test paragraph. Other text
<p>Test paragraph.</p> Other text

Removing title content from page html

Here I am creating preview for url. Which shows
Url title
Url description (title should not come in this)
Here is my try.
<?php
function plaintext($html)
{
$plaintext = preg_replace('#([<]title)(.*)([<]/title[>])#', ' ', $html);
// remove title
//$plaintext = preg_match('#<title>(.*?)</title>#', $html);
// remove comments and any content found in the the comment area (strip_tags only removes the actual tags).
$plaintext = preg_replace('#<!--.*?-->#s', '', $plaintext);
// put a space between list items (strip_tags just removes the tags).
$plaintext = preg_replace('#</li>#', ' </li>', $plaintext);
// remove all script and style tags
$plaintext = preg_replace('#<(script|style)\b[^>]*>(.*?)</(script|style)>#is', "", $plaintext);
// remove br tags (missed by strip_tags)
$plaintext = preg_replace("#<br[^>]*?>#", " ", $plaintext);
// remove all remaining html
$plaintext = strip_tags($plaintext);
return $plaintext;
}
function get_title($html)
{
return preg_match('!<title>(.*?)</title>!i', $html, $matches) ? $matches[1] : '';
}
function trim_display($size,$string)
{
$trim_string = substr($string, 0, $size);
$trim_string = $trim_string . "...";
return $trim_string;
}
$url = "http://www.nextbigwhat.com/indian-startups/";
$data = file_get_contents($url);
//$url = trim_url(5,$url);
$title = get_title($data);
echo "title is ; $title";
$content = plaintext($data);
$Preview = trim_display(100,$content);
echo '<br/>';
echo "preview is: $Preview";
?>
URL title appear correctly. But when I have excluded the title content from description, even it appear.
i have uses $plaintext = preg_replace('#([<]title)(.*)([<]/title[>])#', ' ', $html); to exclude the title from plain text.
Regex is correct as per me event it does not exclude title content.
What is the problem here?
output we get here is:
title is ; Indian Startups Archives - NextBigWhat.com
preview is: Indian Startups Archives : NextBigWhat.com [whatever rest text]...
Actually the text which appears in title part should not again come in preview. That's why i want to exclude it and display rest text in preview.
how to solve the mistery
If you look closer to the title and the preview, they're different. Let's see the output from the curl.
echo plaintext($data);
Well, it seems it has two titles:
<title>
Indian Startups Archives : NextBigWhat.com</title>
and
<title>Indian Startups Archives - NextBigWhat.com</title>
Then the get_title function is retrieving the second title and plaintext leaves alone the first one. What's the difference between them? the line break! therefore your regex isn't matching titles with newline characters, which is why the /s option modifier in regular expressions exists!
tl;dr
Your regex is wrong, add 's' to it.
$plaintext = preg_replace('#([<]title)(.*)([<]/title[>])#s', ' ', $html);`
instead of
$plaintext = preg_replace('#([<]title)(.*)([<]/title[>])#', ' ', $html);`

Categories