PHP: Strip everything except a match - php

The following native function of my script does strip some content from my article such as bbcodes, html tags, and http or https websites, and I'd like to modify it so that it does strip everything except anything which matches "https://my.website.com/variablepath".
function _bbcode_strip($text)
{
static $patterns = array();
if ($this->tp_bbcode)
{
// use text inside [topicpreview] bbcode as the topic preview
if (preg_match('#\[(topicpreview[^\[\]]+)\].*\[/\1\]#Usi', $text, $matches))
{
$text = $matches[0];
}
}
$text = smiley_text($text, true); // display smileys as text :)
$text = ($this->tp_line_breaks ? str_replace("\n", '
', $text) : $text); // preserve line breaks
// Loop through text stripping inner most nested BBCodes until all have been removed
$regex = '#\[(' . $this->strip_bbcodes . ')[^\[\]]+\]((?:(?!\[\1[^\[\]]+\]).)+)\[\/\1[^\[\]]+\]#Usi';
while(preg_match($regex, $text))
{
$text = preg_replace($regex, '', $text);
}
if (empty($patterns))
{
$patterns = array(
'#<!-- [lmw] --><a class="postlink[^>]*>(.*<\/a[^>]*>)?<!-- [lmw] -->#Usi', // Magic URLs
'#<[^>]*>(.*<[^>]*>)?#Usi', // HTML code
'#\[/?[^\[\]]+\]#mi', // Strip all bbcode tags
'#(http|https|ftp|mailto)(:|\&\#58;)\/\/[^\s]+#i', // Strip remaining URLs
'#"#', // Possible quotes from older board conversions
'#[\s]+#' // Multiple spaces
);
}
return trim(preg_replace($patterns, ' ', $text));
}
I tried myself working on this part of the script which is the regex which decides what to strip:
'#<!-- [lmw] --><a class="postlink[^>]*>(.*<\/a[^>]*>)?<!-- [lmw] -->#Usi', // Magic URLs
'#<[^>]*>(.*<[^>]*>)?#Usi', // HTML code
'#\[/?[^\[\]]+\]#mi', // Strip all bbcode tags
'#(http|https|ftp|mailto)(:|\&\#58;)\/\/[^\s]+#i', // Strip remaining URLs
'#"#', // Possible quotes from older board conversions
'#[\s]+#' // Multiple spaces
but I didn't manage to do what I'm trying to, so I came here to ask for help. Thank you.

Related

php strip_tags to allow comment

I need to strip all html tags but retain comment lines to extract for info.
Is it even possible?
$content = strip_tags($content, '<!-->');
This doesn't work and i have tried a few different variants.
you can protect your comment before strip them using following code
// create a random string for using in replace strings
$random = strtoupper(dechex(rand(0,10000000000)));
// replace comment starts
$html = preg_replace('/<!--/', '#MARKER-START-'. $random.'#', $html);
// replace comment ends
$html = preg_replace('/-->/', '#MARKER-END-'. $random.'#', $html);
// strip all html tags
$html = strip_tags($html);
// replace back comment starts
$html = preg_replace('/#MARKER-START-'. $random.'#/', '<!--', $html);
// replace back comment ends
$html = preg_replace('/#MARKER-END-'. $random.'#/', '-->', $html);
Instead of using strip_tags() use this regular expression:
$szRetVal = preg_replace( '%</?[a-z][a-z0-9]*[^<>]*>%sim','',$szHTML );

remove HTML from displaying in PHP

I have this text : http://pastebin.com/2Zgbs7hi
And i want to be able to remove the HTML code from it and just display the plain text but i want to keep at least one line break where there are currently a few line breaks
i have tried:
$ticket["summary"] = 'pastebin example';
$TicketSummaryDisplay = nl2br($ticket["summary"]);
$TicketSummaryDisplay = stripslashes($TicketSummaryDisplay);
$TicketSummaryDisplay = trim(strip_tags($TicketSummaryDisplay));
$TicketSummaryDisplay = preg_replace('/\n\s+$/m', '', $TicketSummaryDisplay);
echo $TicketSummaryDisplay;
that is displaying as plain text, but it shows it all as one big block of text with no line breaks at all
Maybe this will earn you some time.
<?php
libxml_use_internal_errors(true); //crazy o tags
$html = file_get_contents('http://pastebin.com/raw.php?i=2Zgbs7hi');
$dom = new DOMDocument;
$dom->loadHTML($html);
$result='';
foreach ($dom->getElementsByTagName('p') as $node) {
if (strstr($node->nodeValue, 'Legal Disclaimer:')){
break;
}
$result .= $node->nodeValue;
}
echo $result;
This example should successfully store text from html into an array of strings.
After stripping all the tags, you can use preg_split with \R special character ( matches any newline sequence ) to convert string into array. That array will now have several blank values, and there will be also some amount of html non-breaking space entities, so we will check the array for empty values with array_filter() function ( it will remove all items that do not satisfy the filter conditions, in our case, an empty value ). Here are a problem with entity, because and space characters are not the same, they have different ASCII code, so trim() function will not remove spaces. Here are two possible solutions, the first uncommented part will only replace &nbsp and check for white space characters, while the second commented one will decode all html entities and also check for spaces.
PHP:
$text = file_get_contents( 'http://pastebin.com/raw.php?i=2Zgbs7hi' );
$text = strip_tags( $text );
$array = array_filter(
preg_split( '/\R/', $text ),
function( &$item ) {
$item = str_replace( ' ', ' ', $item );
return trim( $item );
// $item = html_entity_decode( $item );
// return trim( str_replace( "\xC2\xA0", ' ', $item ) );
}
);
foreach( $array as $value ) {
echo $value . '<br />';
}
Array output:
Array
(
[8] => Hi,
[11] => Ashley has explained that I need to ask for another line and broadband for the wifi to work, please can you arrange this.
[13] => Regards
[23] => Legal Disclaimer:
[24] => This email and its attachments are confidential. If you received it by mistake, please don’t share it. Let us know and then delete it. Its content does not necessarily represent the views of The Dragon Enterprise
[25] => Centre and we cannot guarantee the information it contains is complete. All emails are monitored and may be seen by another member of The Dragon Enterprise Centre's staff for internal use
)
Now you should have clear array with only items with value in it. By the way, newlines in HTML are expressed through <br />, not through \n, your example as response in a web browser still has them, but they are only visible in page source code. I hope I did not missed the point of the question.
try this get text output with line brakes
<?php
$ticket["summary"] = file_get_contents('http://pastebin.com/raw.php?i=2Zgbs7hi');
$TicketSummaryDisplay = nl2br($ticket["summary"]);
echo strip_tags($TicketSummaryDisplay,'<br>');
?>
You are asking on how to add line-breaks to your "one big block of text with no line breaks at all".
Short answer
After you stripped the HTML tags, apply wordwrap with a desired text-block length
$text = wordwrap($text, 90, "<br />\n");
I really wonder, why nobody suggested that function before.
there is also chunk_split around, which doesn't take words into account and just splits after a certain number of chars. breaking words - but that's not what you want, i guess.
PHP
<?php
$text = file_get_contents('http://pastebin.com/raw.php?i=2Zgbs7hi');
/**
* Returns string without html tags, also
* removes takes control chars, spaces and " " into account.
*/
function dropHtmlTags($string) {
// remove html tags
//$string = preg_replace ('/<[^>]*>/', ' ', $string);
$string = strip_tags($string);
// control characters and "&nbsp"
$string = str_replace("\r", '', $string); // remove
$string = str_replace("\n", ' ', $string); // replace with space
$string = str_replace("\t", ' ', $string); // replace with space
$string = str_replace(" ", ' ', $string);
// remove multiple spaces
$string = preg_replace('/ {2,}/', ' ', $string);
$string = trim($string);
return $string;
}
$text = dropHtmlTags($text);
// The Answer: insert line breaks after 95 chars,
// to get rid of the "one big block of text with no line breaks at all"
$text = wordwrap($text, 95, "<br />\n");
// if you want to insert line-breaks before the legal disclaimer,
// uncomment the next line
//$text = str_replace("Regards Legal Disclaimer", "<br /><br />Regards Legal Disclaimer", $text);
echo $text;
?>
Result
first section shows your text block
second section shows the text with wordwrap applied (code from above)
Hello it can be done as follows:
$abc= file_get_contents('http://pastebin.com/raw.php?i=2Zgbs7hi');
$abc = strip_tags("\n", $abc);
echo $abc;
Please, let me know whether it works
you may use
<?php
$a= file_get_contents('a.txt');
echo nl2br(htmlspecialchars($a));
?>
<?php
$handle = #fopen("pastebin.html", "r");
if ($handle) {
while (!feof($handle)) {
$buffer = fgetss($handle, 4096);
echo $buffer;
}
fclose($handle);
}
?>
output is
Hi,
Ashley has explained that I need to ask for another line and broadband for the wifi to work, please can you arrange this.
Regards
Legal Disclaimer:
This email and its attachments are confidential. If you received it by mistake, please don’t share it. Let us know and then delete it. Its content does not necessarily represent the views of The Dragon Enterprise
Centre and we cannot guarantee the information it contains is complete. All emails are monitored and may be seen by another member of The Dragon Enterprise Centre's staff for internal use
You can probably write additional code to convert to spaces etc.
I'm not sure I did understand everything correctly but this seems to be your expected result:
$txt = file_get_contents('http://pastebin.com/raw.php?i=2Zgbs7hi');
var_dump(preg_replace("/(\&nbsp\;(\s{1,})?)+/", "\n", trim(strip_tags(preg_replace("/(\s){1,}/", " ", $txt)))));
//more readable
$txt = preg_replace("/(\s){1,}/", " ", $txt);
$txt = trim(strip_tags($txt));
$txt = preg_replace("/(\&nbsp\;(\s{1,})?)+/", "\n", $txt);
The strip_tags() function strips HTML and PHP tags from a string, if that is what you are trying to accomplish.
Examples from the docs:
<?php
$text = '<p>Test paragraph.</p><!-- Comment --> Other text';
echo strip_tags($text);
echo "\n";
// Allow <p> and <a>
echo strip_tags($text, '<p><a>');
?>
The above example will output:
Test paragraph. Other text
<p>Test paragraph.</p> Other text

PHP: How to use tidy to just minify a string and no repair it

I just want to remove comments and white space from an html string before saving in DB. I don't want it to be repaired and add head tags etc.
I've spent hours searching for this but can't find anything, can someone who has done this tell me what config I need and which php tidy function will just "minify" and not try and make a valid html document from an html string?
Below example may help you:
<?php
function html2txt($document){
$search = array('#<script[^>]*?>.*?</script>#si', // Strip out javascript
'#<[\/\!]*?[^<>]*?>#si', // Strip out HTML tags
'#<style[^>]*?>.*?</style>#siU', // Strip style tags properly
'#<![\s\S]*?--[ \t\n\r]*>#' // Strip multi-line comments including CDATA
);
$text = preg_replace($search, '', $document);
return $text;
}
?>
You can get more info on http://php.net/manual/en/function.strip-tags.php
Can you try this,
below function is used to remove unwanted HTML comments & WhiteSpace,
function remove_html_comments_white_spaces($content = '') {
$content = preg_replace('~>\s+<~', '><', $content);
$content = preg_replace('/<!--(.|\s)*?-->/', '', $content);
return $content;
}
Even if you want to remove tags, then you can use PHP inbuilt function strip_tags();

Display post excerpts, limited by word count

I am working on my php website (Not a Wordpress site) on the main index I display the two newest post. The thing is on the description it shows the entire article I find myself needing to display post excerpts maybe 35 word limit.
<?=$line["m_description"]?>
<?
$qresult3 = mysql_query("SELECT * FROM t_users WHERE u_id=".$line["m_userid"]." LIMIT 1");
if (mysql_num_rows($qresult3)<1) { ?>
<?php
// just the excerpt
function first_n_words($text, $number_of_words) {
// Where excerpts are concerned, HTML tends to behave
// like the proverbial ogre in the china shop, so best to strip that
$text = strip_tags($text);
// \w[\w'-]* allows for any word character (a-zA-Z0-9_) and also contractions
// and hyphenated words like 'range-finder' or "it's"
// the /s flags means that . matches \n, so this can match multiple lines
$text = preg_replace("/^\W*((\w[\w'-]*\b\W*){1,$number_of_words}).*/ms", '\\1', $text);
// strip out newline characters from our excerpt
return str_replace("\n", "", $text);
}
// excerpt plus link if shortened
function truncate_to_n_words($text, $number_of_words, $url, $readmore = 'read more') {
$text = strip_tags($text);
$excerpt = first_n_words($text, $number_of_words);
// we can't just look at the length or try == because we strip carriage returns
if( str_word_count($text) !== str_word_count($excerpt) ) {
$excerpt .= '... '.$readmore.'';
}
return $excerpt;
}
$src = <<<EOF
<b>My cool story</b>
<p>Here it is. It's really cool. I like it. I like lots of stuff.</p>
<p>I also like to read and write and carry on forever</p>
EOF;
echo first_n_words($src, 10);
echo "\n\n-----------------------------\n\n";
echo truncate_to_n_words($src, 10, 'http://www.google.com');
EDIT: Added functional example and accounted for punctuation and numbers in text
I have a function though other people may say it's not good because I'm still good at PHP too (tips welcome people) but this will give you what you are looking for, it may need better coding if anyone has suggestions.
function Short($text, $length, $url, $more){
$short = mb_substr($text, 0, $length);
if($short != $text) {
$lastspace = strrpos($short, ' ');
$short = substr($short , 0, $lastspace);
if(!$more){
$more = "Read Full Post";
} // end if more is blank
$short .= "...[<a href='$url'>$more</a>]";
} // end if content != short
$short = str_replace("’","'", $short);
$short = stripslashes($short);
$short = nl2br($short);
} // end short function
To Use:
say your article content is the variable $content
function($content, "35", "http://domain.com/article_post", "Read Full Story");
echo $short;
Similarly, you can adjust the function to remove $url and $more from it and just have the excerpt with ... at the end.

How can I retrieve clean text inside HTML tags using PHP?

I have a form which is accepts HTML data, but we need only their respective text, not anything else. Is there any particular way to extract the text out of the HTML in PHP?
Use strip_tags().
Surely it can be done:
Just look at this function and use it as you like:
function html2txt ($document)
{
$search = array (
"'<script[^>]*?>.*?</script>'si", // Strip out JavaScript code
"'<[\/\!]*?[^<>]*?>'si", // Strip out HTML tags
"'([\r\n])[\s]+'", // Strip out white space
"'#<![\s\S]*?�[ \t\n\r]*>#'",
"'&(quot|#34|#034|#x22);'i", // Replace HTML entities
"'&(amp|#38|#038|#x26);'i", // Added hexadecimal values
"'&(lt|#60|#060|#x3c);'i",
"'&(gt|#62|#062|#x3e);'i",
"'&(nbsp|#160|#xa0);'i",
"'&(iexcl|#161);'i",
"'&(cent|#162);'i",
"'&(pound|#163);'i",
"'&(copy|#169);'i",
"'&(reg|#174);'i",
"'&(deg|#176);'i",
"'&(#39|#039|#x27);'",
"'&(euro|#8364);'i", // Europe
"'&a(uml|UML);'", // German
"'&o(uml|UML);'",
"'&u(uml|UML);'",
"'&A(uml|UML);'",
"'&O(uml|UML);'",
"'&U(uml|UML);'",
"'ß'i",
);
$replace = array ( "",
"",
" ",
"\"",
"&",
"<",
">",
" ",
chr(161),
chr(162),
chr(163),
chr(169),
chr(174),
chr(176),
chr(39),
chr(128),
"ä",
"ö",
"ü",
"�",
"�",
"�",
"�",
);
$text = preg_replace($search, $replace, $document);
return trim ($text);
}
You can parse the HTML using DOMDocument::loadHTMLFile and extract what you need.
$doc = new DOMDocument();
$doc->loadHTMLFile("data.html");
$metaTags = $doc->getElementsByTagName('meta');
// Process $metaTags

Categories