Removing title content from page html - php

Here I am creating preview for url. Which shows
Url title
Url description (title should not come in this)
Here is my try.
<?php
function plaintext($html)
{
$plaintext = preg_replace('#([<]title)(.*)([<]/title[>])#', ' ', $html);
// remove title
//$plaintext = preg_match('#<title>(.*?)</title>#', $html);
// remove comments and any content found in the the comment area (strip_tags only removes the actual tags).
$plaintext = preg_replace('#<!--.*?-->#s', '', $plaintext);
// put a space between list items (strip_tags just removes the tags).
$plaintext = preg_replace('#</li>#', ' </li>', $plaintext);
// remove all script and style tags
$plaintext = preg_replace('#<(script|style)\b[^>]*>(.*?)</(script|style)>#is', "", $plaintext);
// remove br tags (missed by strip_tags)
$plaintext = preg_replace("#<br[^>]*?>#", " ", $plaintext);
// remove all remaining html
$plaintext = strip_tags($plaintext);
return $plaintext;
}
function get_title($html)
{
return preg_match('!<title>(.*?)</title>!i', $html, $matches) ? $matches[1] : '';
}
function trim_display($size,$string)
{
$trim_string = substr($string, 0, $size);
$trim_string = $trim_string . "...";
return $trim_string;
}
$url = "http://www.nextbigwhat.com/indian-startups/";
$data = file_get_contents($url);
//$url = trim_url(5,$url);
$title = get_title($data);
echo "title is ; $title";
$content = plaintext($data);
$Preview = trim_display(100,$content);
echo '<br/>';
echo "preview is: $Preview";
?>
URL title appear correctly. But when I have excluded the title content from description, even it appear.
i have uses $plaintext = preg_replace('#([<]title)(.*)([<]/title[>])#', ' ', $html); to exclude the title from plain text.
Regex is correct as per me event it does not exclude title content.
What is the problem here?
output we get here is:
title is ; Indian Startups Archives - NextBigWhat.com
preview is: Indian Startups Archives : NextBigWhat.com [whatever rest text]...
Actually the text which appears in title part should not again come in preview. That's why i want to exclude it and display rest text in preview.

how to solve the mistery
If you look closer to the title and the preview, they're different. Let's see the output from the curl.
echo plaintext($data);
Well, it seems it has two titles:
<title>
Indian Startups Archives : NextBigWhat.com</title>
and
<title>Indian Startups Archives - NextBigWhat.com</title>
Then the get_title function is retrieving the second title and plaintext leaves alone the first one. What's the difference between them? the line break! therefore your regex isn't matching titles with newline characters, which is why the /s option modifier in regular expressions exists!
tl;dr
Your regex is wrong, add 's' to it.
$plaintext = preg_replace('#([<]title)(.*)([<]/title[>])#s', ' ', $html);`
instead of
$plaintext = preg_replace('#([<]title)(.*)([<]/title[>])#', ' ', $html);`

Related

PHP only removing specific words without affecting rest of the string

I'm using PHP to filter my title to remove the word "Tag:" inside and it is working fine, however when my following word starts with "T" it will instantly be removed as well.
This was how i've set my code
<?php $tag = "Tag: ";
if( str_replace( $tag, "", $title) == true ):
else:echo ltrim($title, $tag);
endif ?>
so when my title is Tag: Home it will return Home just fine,
but if my title is something like Tag: Teachers it will return me eachers instead.
How do i make it so I can still display any title starting with T without it being removed.
Try this. it will returns only Teacher
$str = "Tag: Teacher";
$str = str_replace("Tag:", "", $str);
echo $str;
Use following to remove the tag Tag:
<?php
$title = 'Tag: Teachers';
echo removeTag($title); // Teachers
function removeTag(string $str, string $tag = "Tag: "): string
{
$str = str_replace($tag, "", $str);
return $str;
}

PHP: Strip everything except a match

The following native function of my script does strip some content from my article such as bbcodes, html tags, and http or https websites, and I'd like to modify it so that it does strip everything except anything which matches "https://my.website.com/variablepath".
function _bbcode_strip($text)
{
static $patterns = array();
if ($this->tp_bbcode)
{
// use text inside [topicpreview] bbcode as the topic preview
if (preg_match('#\[(topicpreview[^\[\]]+)\].*\[/\1\]#Usi', $text, $matches))
{
$text = $matches[0];
}
}
$text = smiley_text($text, true); // display smileys as text :)
$text = ($this->tp_line_breaks ? str_replace("\n", '
', $text) : $text); // preserve line breaks
// Loop through text stripping inner most nested BBCodes until all have been removed
$regex = '#\[(' . $this->strip_bbcodes . ')[^\[\]]+\]((?:(?!\[\1[^\[\]]+\]).)+)\[\/\1[^\[\]]+\]#Usi';
while(preg_match($regex, $text))
{
$text = preg_replace($regex, '', $text);
}
if (empty($patterns))
{
$patterns = array(
'#<!-- [lmw] --><a class="postlink[^>]*>(.*<\/a[^>]*>)?<!-- [lmw] -->#Usi', // Magic URLs
'#<[^>]*>(.*<[^>]*>)?#Usi', // HTML code
'#\[/?[^\[\]]+\]#mi', // Strip all bbcode tags
'#(http|https|ftp|mailto)(:|\&\#58;)\/\/[^\s]+#i', // Strip remaining URLs
'#"#', // Possible quotes from older board conversions
'#[\s]+#' // Multiple spaces
);
}
return trim(preg_replace($patterns, ' ', $text));
}
I tried myself working on this part of the script which is the regex which decides what to strip:
'#<!-- [lmw] --><a class="postlink[^>]*>(.*<\/a[^>]*>)?<!-- [lmw] -->#Usi', // Magic URLs
'#<[^>]*>(.*<[^>]*>)?#Usi', // HTML code
'#\[/?[^\[\]]+\]#mi', // Strip all bbcode tags
'#(http|https|ftp|mailto)(:|\&\#58;)\/\/[^\s]+#i', // Strip remaining URLs
'#"#', // Possible quotes from older board conversions
'#[\s]+#' // Multiple spaces
but I didn't manage to do what I'm trying to, so I came here to ask for help. Thank you.

text only from title to make seo url

i am working on code where i upload html code and same code is added as content with top char being title and seo url.
but i had issue with making title as unable to get only plain text from html string to use it as title and seo url
below is my code to get title from html text:
$title = getplaintextintrofromhtml($str,100);
$title = str_replace(PHP_EOL, '', $title);
$title = str_replace(" "," ", $title);
$title = str_replace(str_split('\\/:*?"<>|,+=-'), '', $title);
$title = str_replace("'","", $title);
$title = str_replace("<br>","", $title);
$title = str_replace("\n","", $title);
$title = trim($title);
seo url
$newurltitle=str_replace(" ","-",$title);
and function
function getplaintextintrofromhtml($html, $numchars) {
// Remove the HTML tags
$html = strip_tags($html);
// Convert HTML entities to single characters
$html = html_entity_decode($html, ENT_QUOTES, 'UTF-8');
// Make the string the desired number of characters
// Note that substr is not good as it counts by bytes and not characters
$html = mb_substr($html, 0, $numchars, 'UTF-8');
// Add an elipsis
return $html;
}
even after my above code i get titles with new line , i could not figure out why this happens even thought i am getting plain text but issue like new line still there and i can not use them to make seo url also
You can use the following code to remove newlines, extra spaces, and line feeds:
$title = preg_replace('/\s+/', ' ', $title);

remove HTML from displaying in PHP

I have this text : http://pastebin.com/2Zgbs7hi
And i want to be able to remove the HTML code from it and just display the plain text but i want to keep at least one line break where there are currently a few line breaks
i have tried:
$ticket["summary"] = 'pastebin example';
$TicketSummaryDisplay = nl2br($ticket["summary"]);
$TicketSummaryDisplay = stripslashes($TicketSummaryDisplay);
$TicketSummaryDisplay = trim(strip_tags($TicketSummaryDisplay));
$TicketSummaryDisplay = preg_replace('/\n\s+$/m', '', $TicketSummaryDisplay);
echo $TicketSummaryDisplay;
that is displaying as plain text, but it shows it all as one big block of text with no line breaks at all
Maybe this will earn you some time.
<?php
libxml_use_internal_errors(true); //crazy o tags
$html = file_get_contents('http://pastebin.com/raw.php?i=2Zgbs7hi');
$dom = new DOMDocument;
$dom->loadHTML($html);
$result='';
foreach ($dom->getElementsByTagName('p') as $node) {
if (strstr($node->nodeValue, 'Legal Disclaimer:')){
break;
}
$result .= $node->nodeValue;
}
echo $result;
This example should successfully store text from html into an array of strings.
After stripping all the tags, you can use preg_split with \R special character ( matches any newline sequence ) to convert string into array. That array will now have several blank values, and there will be also some amount of html non-breaking space entities, so we will check the array for empty values with array_filter() function ( it will remove all items that do not satisfy the filter conditions, in our case, an empty value ). Here are a problem with entity, because and space characters are not the same, they have different ASCII code, so trim() function will not remove spaces. Here are two possible solutions, the first uncommented part will only replace &nbsp and check for white space characters, while the second commented one will decode all html entities and also check for spaces.
PHP:
$text = file_get_contents( 'http://pastebin.com/raw.php?i=2Zgbs7hi' );
$text = strip_tags( $text );
$array = array_filter(
preg_split( '/\R/', $text ),
function( &$item ) {
$item = str_replace( ' ', ' ', $item );
return trim( $item );
// $item = html_entity_decode( $item );
// return trim( str_replace( "\xC2\xA0", ' ', $item ) );
}
);
foreach( $array as $value ) {
echo $value . '<br />';
}
Array output:
Array
(
[8] => Hi,
[11] => Ashley has explained that I need to ask for another line and broadband for the wifi to work, please can you arrange this.
[13] => Regards
[23] => Legal Disclaimer:
[24] => This email and its attachments are confidential. If you received it by mistake, please don’t share it. Let us know and then delete it. Its content does not necessarily represent the views of The Dragon Enterprise
[25] => Centre and we cannot guarantee the information it contains is complete. All emails are monitored and may be seen by another member of The Dragon Enterprise Centre's staff for internal use
)
Now you should have clear array with only items with value in it. By the way, newlines in HTML are expressed through <br />, not through \n, your example as response in a web browser still has them, but they are only visible in page source code. I hope I did not missed the point of the question.
try this get text output with line brakes
<?php
$ticket["summary"] = file_get_contents('http://pastebin.com/raw.php?i=2Zgbs7hi');
$TicketSummaryDisplay = nl2br($ticket["summary"]);
echo strip_tags($TicketSummaryDisplay,'<br>');
?>
You are asking on how to add line-breaks to your "one big block of text with no line breaks at all".
Short answer
After you stripped the HTML tags, apply wordwrap with a desired text-block length
$text = wordwrap($text, 90, "<br />\n");
I really wonder, why nobody suggested that function before.
there is also chunk_split around, which doesn't take words into account and just splits after a certain number of chars. breaking words - but that's not what you want, i guess.
PHP
<?php
$text = file_get_contents('http://pastebin.com/raw.php?i=2Zgbs7hi');
/**
* Returns string without html tags, also
* removes takes control chars, spaces and " " into account.
*/
function dropHtmlTags($string) {
// remove html tags
//$string = preg_replace ('/<[^>]*>/', ' ', $string);
$string = strip_tags($string);
// control characters and "&nbsp"
$string = str_replace("\r", '', $string); // remove
$string = str_replace("\n", ' ', $string); // replace with space
$string = str_replace("\t", ' ', $string); // replace with space
$string = str_replace(" ", ' ', $string);
// remove multiple spaces
$string = preg_replace('/ {2,}/', ' ', $string);
$string = trim($string);
return $string;
}
$text = dropHtmlTags($text);
// The Answer: insert line breaks after 95 chars,
// to get rid of the "one big block of text with no line breaks at all"
$text = wordwrap($text, 95, "<br />\n");
// if you want to insert line-breaks before the legal disclaimer,
// uncomment the next line
//$text = str_replace("Regards Legal Disclaimer", "<br /><br />Regards Legal Disclaimer", $text);
echo $text;
?>
Result
first section shows your text block
second section shows the text with wordwrap applied (code from above)
Hello it can be done as follows:
$abc= file_get_contents('http://pastebin.com/raw.php?i=2Zgbs7hi');
$abc = strip_tags("\n", $abc);
echo $abc;
Please, let me know whether it works
you may use
<?php
$a= file_get_contents('a.txt');
echo nl2br(htmlspecialchars($a));
?>
<?php
$handle = #fopen("pastebin.html", "r");
if ($handle) {
while (!feof($handle)) {
$buffer = fgetss($handle, 4096);
echo $buffer;
}
fclose($handle);
}
?>
output is
Hi,
Ashley has explained that I need to ask for another line and broadband for the wifi to work, please can you arrange this.
Regards
Legal Disclaimer:
This email and its attachments are confidential. If you received it by mistake, please don’t share it. Let us know and then delete it. Its content does not necessarily represent the views of The Dragon Enterprise
Centre and we cannot guarantee the information it contains is complete. All emails are monitored and may be seen by another member of The Dragon Enterprise Centre's staff for internal use
You can probably write additional code to convert to spaces etc.
I'm not sure I did understand everything correctly but this seems to be your expected result:
$txt = file_get_contents('http://pastebin.com/raw.php?i=2Zgbs7hi');
var_dump(preg_replace("/(\&nbsp\;(\s{1,})?)+/", "\n", trim(strip_tags(preg_replace("/(\s){1,}/", " ", $txt)))));
//more readable
$txt = preg_replace("/(\s){1,}/", " ", $txt);
$txt = trim(strip_tags($txt));
$txt = preg_replace("/(\&nbsp\;(\s{1,})?)+/", "\n", $txt);
The strip_tags() function strips HTML and PHP tags from a string, if that is what you are trying to accomplish.
Examples from the docs:
<?php
$text = '<p>Test paragraph.</p><!-- Comment --> Other text';
echo strip_tags($text);
echo "\n";
// Allow <p> and <a>
echo strip_tags($text, '<p><a>');
?>
The above example will output:
Test paragraph. Other text
<p>Test paragraph.</p> Other text

Display post excerpts, limited by word count

I am working on my php website (Not a Wordpress site) on the main index I display the two newest post. The thing is on the description it shows the entire article I find myself needing to display post excerpts maybe 35 word limit.
<?=$line["m_description"]?>
<?
$qresult3 = mysql_query("SELECT * FROM t_users WHERE u_id=".$line["m_userid"]." LIMIT 1");
if (mysql_num_rows($qresult3)<1) { ?>
<?php
// just the excerpt
function first_n_words($text, $number_of_words) {
// Where excerpts are concerned, HTML tends to behave
// like the proverbial ogre in the china shop, so best to strip that
$text = strip_tags($text);
// \w[\w'-]* allows for any word character (a-zA-Z0-9_) and also contractions
// and hyphenated words like 'range-finder' or "it's"
// the /s flags means that . matches \n, so this can match multiple lines
$text = preg_replace("/^\W*((\w[\w'-]*\b\W*){1,$number_of_words}).*/ms", '\\1', $text);
// strip out newline characters from our excerpt
return str_replace("\n", "", $text);
}
// excerpt plus link if shortened
function truncate_to_n_words($text, $number_of_words, $url, $readmore = 'read more') {
$text = strip_tags($text);
$excerpt = first_n_words($text, $number_of_words);
// we can't just look at the length or try == because we strip carriage returns
if( str_word_count($text) !== str_word_count($excerpt) ) {
$excerpt .= '... '.$readmore.'';
}
return $excerpt;
}
$src = <<<EOF
<b>My cool story</b>
<p>Here it is. It's really cool. I like it. I like lots of stuff.</p>
<p>I also like to read and write and carry on forever</p>
EOF;
echo first_n_words($src, 10);
echo "\n\n-----------------------------\n\n";
echo truncate_to_n_words($src, 10, 'http://www.google.com');
EDIT: Added functional example and accounted for punctuation and numbers in text
I have a function though other people may say it's not good because I'm still good at PHP too (tips welcome people) but this will give you what you are looking for, it may need better coding if anyone has suggestions.
function Short($text, $length, $url, $more){
$short = mb_substr($text, 0, $length);
if($short != $text) {
$lastspace = strrpos($short, ' ');
$short = substr($short , 0, $lastspace);
if(!$more){
$more = "Read Full Post";
} // end if more is blank
$short .= "...[<a href='$url'>$more</a>]";
} // end if content != short
$short = str_replace("’","'", $short);
$short = stripslashes($short);
$short = nl2br($short);
} // end short function
To Use:
say your article content is the variable $content
function($content, "35", "http://domain.com/article_post", "Read Full Story");
echo $short;
Similarly, you can adjust the function to remove $url and $more from it and just have the excerpt with ... at the end.

Categories