I have the following function that I use in a PHP application to remove white space and line breaks from the source of a page.
It's based on some examples I have read on Stack Overflow, with some amends to handle JS and HTML comments. Note: I've not used an exisiting library because I wanted something simple without all the additional features that others include and with this code I have fine-grained control over what is stripped and what is not.
protected function MinifyHTML($str) {
$str = preg_replace("/(?<!\S)\/\/\s*[^\r\n]*/", "", $str); // strip JS/CSS comments
$str = preg_replace("/<!--(.*)-->/Uis", "", $str); // strip HTML comments
$protected_parts = array('<pre>,</pre>','<textarea>,</textarea>','<,>');
$extracted_values = array();
$i = 0;
foreach ($protected_parts as $part) {
$finished = false;
$search_offset = $first_offset = 0;
$end_offset = 1;
$startend = explode(',', $part);
if (count($startend) === 1) $startend[1] = $startend[0];
$len0 = strlen($startend[0]); $len1 = strlen($startend[1]);
while ($finished === false) {
$first_offset = strpos($str, $startend[0], $search_offset);
if ($first_offset === false) $finished = true;
else {
$search_offset = strpos($str, $startend[1], $first_offset + $len0);
$extracted_values[$i] = substr($str, $first_offset + $len0, $search_offset - $first_offset - $len0);
$str = substr($str, 0, $first_offset + $len0).'$$#'.$i.'$$'.substr($str, $search_offset);
$search_offset += $len1 + strlen((string)$i) + 5 - strlen($extracted_values[$i]);
++$i;
}
}
}
$str = preg_replace("/\s/", " ", $str);
$str = preg_replace("/\s{2,}/", " ", $str);
$replace = array('> <'=>'><', ' >'=>'>','< '=>'<','</ '=>'</');
$str = str_replace(array_keys($replace), array_values($replace), $str);
for ($d = 0; $d < $i; ++$d)
$str = str_replace('$$#'.$d.'$$', $extracted_values[$d], $str);
return $str;
}
However if I get a scenario like:
Link Link
It will remove that space between the two anchor tags.
I've added '</a> <a' to my $protected_parts in an attempt to stop this, but it still strips out the space between them. So I end up with LinkLink in the source which isn't what I want.
The same also happens with:
<p>This is <span class="">some</span> <span class="">styled</span> text.</p>
Also it seems the protected_parts arn't working as my textareas are being minified too so all the content inside them is compressed down into one line...
Any ideas on the fixes? I've also not been able to find alternatives to use instead that don't implement caching, gzipping and other features I don't want. I purely want a simple solution that strips spaces, line breaks and comments and that's it.
UPDATED 2014/02/25 (late):
Here's another workaround. Instead of touching $protected_parts I'm just adding another replace operation at the end that adds a space after every </a> -- again a workaround, but this shouldn't screw up any of your original operability, and the penalty this time is only one space character after every anchor tag, not bad. Here it is: http://phpfiddle.org/main/code/5qj-13z
UPDATED 2014/02/25:
I added '</a> ' to $protected_parts and it does not strip the space. I threw it into phpfiddle over here, http://phpfiddle.org/lite/code/dms-cud. This is only a workaround for a few lines of synethetic-emulated HTML... I'm not sure what kind of organic code you're running through your function. Obviously this workaround is not a universal fix either.
Original
I added '</a>',' <a ', to $protected_parts and it does not strip the space. I threw it into phpfiddle over here, http://phpfiddle.org/lite/code/ztz-5hf.
Your function is scary to me, but I like some of the basic functionality, like stripping HTML, JS and CSS comments. I'd still recommend using an apache extension or library. Using other people's open source code is the most powerful witchcraft a programmer can yield. :)
Related
I want to create a function that labels the location of certain HTML tags (e.g., italics tags) in a string with respect to the locations of characters in a tagless version of the string.
(I intend to use this label data to train a neural network for tag recovery from data that has had the tags stripped out.)
The magic function I want to create is label_italics() in the below code.
$string = 'Disney movies: <i>Aladdin</i>, <i>Beauty and the Beast</i>.';
$string_all_tags_stripped_but_italics = strip_tags($string, '<i>'); // same as $string in this example
$string_all_tags_stripped = strip_tags($string); // 'Disney movies: Aladdin, Beauty and the Beast.'
$featr_string = $string_all_tags_stripped.' '; // Add a single space at the end
$label_string = label_italics($string_all_tags_stripped_but_italics);
echo $featr_string; // 'Disney movies: Aladdin, Beauty and the Beast. '
echo $label_string; // '0000000000000001000000101000000000000000000010'
If a character is supposed to have an <i> or </i> tag immediately preceding it, it is labeled with a 1 in $label_string; otherwise, it is labeled with a 0 in $label_string. (I'm thinking I don't need to worry about the difference between <i> and </i> because the recoverer will simply alternate between <i> and </i> so as to maintain well-formed markup, but I'm open to reasons as to why I'm wrong about this.)
I'm just not sure what the best way to create label_italics() is.
I wrote this function that seems to work in most cases, but it also seems a little clunky and I'm posting here in hopes that there is a better way. (If this turns out to be the best way, the below function would be easily generalizable to any HTML tag passed in as a second argument to the function, which could be renamed label_tag().)
function label_italics($stripped) {
while ((stripos($stripped, '<i>') || stripos($stripped, '</i>')) !== FALSE) {
$position = stripos($stripped, '<i>');
if (is_numeric($position)) {
for ($c = 0; $c < $position; $c++) {
$output .= '0';
}
$output .= '1';
}
$stripped = substr($stripped, $position + 4, NULL);
$position = stripos($stripped, '</i>');
if (is_numeric($position)) {
for ($c = 0; $c < $position; $c++) {
$output .= '0';
}
$output .= '1';
}
$stripped = substr($stripped, $position + 5, NULL);
}
for ($c = 0; $c <= strlen($stripped); $c++) {
$output .= '0';
}
return $output;
}
The function produces bad output if the tags are surplus or the markup is badly formed in the input. For example, for the following input:
$string = 'Disney movies: <i><i>Aladdin</i>, <i>Beauty and the Beast</i>.';
The following misaligned output is given.
Disney movies: Aladdin, Beauty and the Beast.
0000000000000001000000000101000000000000000000010
(I'm also open to reasons why I'm going about the creation of the label data all wrong.)
I think I've got something. How about this:
function label_italics($string) {
return preg_replace(['/<i>/', '/<\/i>/', '/[^#]/', '/##0/', '/#0/'],
['#', '#', '0', '2', '1'], $string);
}
see: https://3v4l.org/cKG46
Note that you need to supply the string with the tags in it.
How does it work?
I use preg_replace() because it can use regular expressions, which I need once. This function goes through the two arrays and execute each replacement in order. First it replace all occurrences of <i> and </i> by # and anything else by 0. Then replaces ##0 by 2 and #0 by 1. The 2 is extra to be able to replace <i></i>. You can remove it, and simplify the function, if you don't need it.
The use of the # is arbitrary. You should use anything that doesn't clash with the content of your string.
Here's an updated version. It copes with tags at the end of the line and it ignores any # characters in the line.
function label_italics($string) {
return preg_replace(['/[^<\/i\>]/', '/<i>/', '/<\/i>/', '/i/', '/##0/', '/#0/'],
['0', '#', '#', '0', '2', '1'], $string . ' ');
}
See: https://3v4l.org/BTnLc
After some additional experimentation, this is what I arrived at:
$label_string = mb_ereg_replace('#0', '1', mb_ereg_replace('(#)\1+0', '1', mb_ereg_replace('\/', '0', mb_ereg_replace('i', '0', mb_ereg_replace('<\/i>', '#', mb_ereg_replace('<i>', '#', mb_ereg_replace('[^<\/i\>]', '0', mb_strtolower($featr_string))))))));
I couldn't get #KIKO Software's preg_replace()-based solution to work with multibyte strings. So I changed to this slightly ungainly, but better-operative, mb_ereg_replace()-based solution instead.
I am trying to determine the absolute position of certain words within a block of html, but only if they are outside of an actual html tag. For instance, if I wanted to determine the position of the word "join" using preg_match in this text:
<p>There are 14 more days until our holiday special so come join us!</p>
I could use:
preg_match('/join/', $post_content, $matches, PREG_OFFSET_CAPTURE, $offset);
The problem is that this is matching the word within the aria-label attribute, when what I need is the one just after the link. It would be fine to match between the <a> and </a>, just not inside the brackets themselves.
My actual end goal, most of what (I think) I have aside from this last element: I am trimming a block of html (not a full document) to cut off at a specific word count. I am trying to determine which character that last word ends at, and then joining the left side of the html block with only the html from the right side, so all html tags close gracefully. I thought I had it working until I ran into an example like I showed where the last word was also within an html attribute, causing me to split the string at the wrong location. This is my code so far:
$post_content = strip_tags ( $p->post_content, "<a><br><p><ul><li>" );
$post_content_stripped = strip_tags ( $p->post_content );
$post_content_stripped = preg_replace("/[^A-Za-z0-9 ]/", ' ', $post_content_stripped);
$post_content_stripped = preg_replace("/\s+/", ' ', $post_content_stripped);
$post_content_stripped_array = explode ( " " , trim($post_content_stripped) );
$excerpt_wordcount = count( $post_content_stripped_array );
$cutpos = 0;
while($excerpt_wordcount>48){
$thiswordrev = "/" . strrev($post_content_stripped_array[$excerpt_wordcount - 1]) . "/";
preg_match($thiswordrev, strrev($post_content), $matches, PREG_OFFSET_CAPTURE, $cutpos);
$cutpos = $matches[0][1] + (strlen($thiswordrev) - 2);
array_pop($post_content_stripped_array);
$excerpt_wordcount = count( $post_content_stripped_array );
}
if($pwordcount>$excerpt_wordcount){
preg_match_all('/<\/?[^>]*>/', substr( $post_content, strlen($post_content) - $cutpos ), $closetags_result);
$excerpt_closetags = "" . $closetags_result[0][0];
$post_excerpt = substr( $post_content, 0, strlen($post_content) - $cutpos ) . $excerpt_closetags;
}else{
$post_excerpt = $post_content;
}
I am actually searching the string in reverse in this case, since I am walking word by word backwards from the end of the string, so I know that my html brackets are backwards, eg:
>p/<!su nioj emoc os >a/<laiceps yadiloh>"su nioj"=lebal-aira "renepoon rerreferon"=ler "knalb_"=tegrat "lmth.egapemos/"=ferh a< ruo litnu syad erom 41 era erehT>p<
But it's easy enough to flip all of the brackets before doing the preg_match, or I am assuming should be easy enough to have the preg_match account for that.
Do not use regex to parse HTML.
You have a simple objective: limit the text content to a given number of words, ensuring that the HTML remains valid.
To this end, I would suggest looping through text nodes until you count a certain number of words, and then removing everything after that.
$dom = new DOMDocument();
$dom->loadHTML($post_content);
$xpath = new DOMXPath($dom);
$all_text_nodes = $xpath->query("//text()");
$words_left = 48;
foreach( $all_text_nodes as $text_node) {
$text = $text_node->textContent;
$words = explode(" ", $text); // TODO: maybe preg_split on /\s/ to support more whitespace types
$word_count = count($words);
if( $word_count < $words_left) {
$words_left -= $word_count;
continue;
}
// reached the threshold
$words_that_fit = implode(" ", array_slice($words, 0, $words_left));
// If the above TODO is implemented, this will need to be adjusted to keep the specific whitespace characters
$text_node->textContent = $words_that_fit;
$remove_after = $text_node;
while( $remove_after->parentNode) {
while( $remove_after->nextSibling) {
$remove_after->parentNode->removeChild($remove_after->nextSibling);
}
$remove_after = $remove_after->parentNode;
}
break;
}
$output = substr($dom->saveHTML($dom->getElementsByTagName("body")->item(0)), strlen("<body>"), -strlen("</body>"));
Live demo
Ok, I figured out a workaround. I don't know if this is the most elegant solution, so if someone sees a better one I would still love to hear it, but for now I realized that I don't have to actually have the html in the string I am searching to determine the position to cut, I just need it to be the same length. I grabbed all of the html elements and just created a dummy string replacing all of them with the same number of asterisks:
// create faux string with placeholders instead of html for search purposes
preg_match_all('/<\/?[^>]*>/', $post_content, $alltags_result);
$tagcount = count( $alltags_result );
$post_content_dummy = $post_content;
foreach($alltags_result[0] as $thistag){
$post_content_dummy = str_replace($thistag, str_repeat("*",strlen($thistag)), $post_content_dummy);
}
Then I just use $post_content_dummy in the while loop instead of $post_content, in order to find the cut position, and then $post_content for the actual cut. So far seems to be working fine.
I'm counting words in an article and removing common words such as "and" or "the".
I"m removing them by use of preg_replace
after it is done I do a quick clean of extra white space by using.
$search_body = preg_replace('/\s+/',' ',$search_body);
However I've got some very stubborn white space that will not go away. I've tried
if($word == "" OR $word == " "){
//chop it's head off
}
But the if statement does not see $word as being just whitespace. I've also tried printing it to the screen to get the raw data type of it and it's still just showing up blank.
Here is the full regex that I'm using.
$pattern = array(
'/\"\;/',
'/[0-9]/',
'/\,/',
'/\./',
'/\!/',
'/\#/',
'/\#/',
'/\$/',
'/\%/',
'/\^/',
'/\&/',
'/\*/',
'/\(/',
'/\)/',
'/\_/',
'/\"/',
'/\'/',
'/\:/',
'/\;/',
'/\?/',
'/\`/',
'/\~/',
'/\[/',
'/\]/',
'/\{/',
'/\}/',
'/\|/',
'/\+/',
'/\=/',
'/\-/',
'/–/',
'/°/',
'/\bthe\b/',
'/\band\b/',
'/\bthat\b/',
'/\bhave\b/',
'/\bfor\b/',
'/\bnot\b/',
'/\bwith\b/',
'/\byou\b/',
'/\bthis\b/',
'/\bbut\b/',
'/\bhis\b/',
'/\bfrom\b/',
'/\bthey\b/',
'/\bsay\b/',
'/\bher\b/',
'/\bshe\b/',
'/\bwill\b/',
'/\bone\b/',
'/\ball\b/',
'/\bwould\b/',
'/\bthere\b/',
'/\btheir\b/',
'/\bwhat\b/',
'/\bout\b/',
'/\babout\b/',
'/\bwho\b/',
'/\bget\b/',
'/\bwhich\b/',
'/\bwhen\b/',
'/\bmake\b/',
'/\bcan\b/',
'/\blike\b/',
'/\btime\b/',
'/\bjust\b/',
'/\bhim\b/',
'/\bknow\b/',
'/\btake\b/',
'/\bpeople\b/',
'/\binto\b/',
'/\byear\b/',
'/\byour\b/',
'/\bgood\b/',
'/\bsome\b/',
'/\bcould\b/',
'/\bthem\b/',
'/\bsee\b/',
'/\bother\b/',
'/\bthan\b/',
'/\bthen\b/',
'/\bnow\b/',
'/\blook\b/',
'/\bonly\b/',
'/\bcome\b/',
'/\bits\b/', //it's?
'/\bover\b/',
'/\bthink\b/',
'/\balso\b/',
'/\bback\b/',
'/\bafter\b/',
'/\buse\b/',
'/\btwo\b/',
'/\bhow\b/',
'/\bour\b/',
'/\bwork\b/',
'/\bfirst\b/',
'/\bwell\b/',
'/\bway\b/',
'/\beven\b/',
'/\bnew\b/',
'/\bwant\b/',
'/\bbecause\b/',
'/\bany\b/',
'/\bthese\b/',
'/\bgive\b/',
'/\bday\b/',
'/\bmost\b/',
'/\bare\b/',
'/\bwas\b/',
'/\<\w+\>/', '/\<\/\w+\>/',
'/\b\w{1}\b/', //1 letter word
'/\b\w{2}\b/', //2 letter word
'/\//',
'/\</',
'/\>/'
);
$search_body = strip_tags($body);
$search_body = strtolower($search_body);
$search_body = preg_replace($pattern, ' ', $search_body);
$search_body = preg_replace('/\s+/',' ',$search_body);
$search_body = explode(" ", $search_body);
When exploded blank values show up left and right
Example text that I am using is too long to post here. But I copied and pasted
This article to give it a test and it showed 32 counts of white space, not including the white space in front of or behind of other words even after using trim().
Here's a js.fiddle of the raw data that is being handled by php.
htmlentities and htmlspecialchars also show nothing.
Here's the code counts all the values and puts them into one.
$inhere = array();
$body_hold = array();
foreach($search_body as $value){
$value = trim($value);
if(in_array($value, $inhere) && $value != ""){
$key = array_search($value, $inhere);
$body_hold[$key]['count'] = $body_hold[$key]['count']+1;
}elseif($value != ""){
$inhere[] = $value;
$body_hold[] = array(
'count' => 1,
'word' => $value
);
}
}
rsort($body_hold);
Basic foreach to see values.
foreach($body_hold as $value){
$count = $value['count'];
$word = trim($value['word']);
echo "Count: ".$count;
echo " Word: ".$word;
echo '<br>';
}
Here's a PHP example of what it's returning
Are you sure you put the exact same data you're processing in the js.fiddle? Or did you get it from a subsequent post-processed step?
It's obviously a Wikipedia article. I went to that article on Wikipedia and opened it in Edit mode, and saw that there are s in the raw wikitext. However, those nbsp's don't appear in your js.fiddle data.
TL;DR: Check for in your processing (and convert to spaces, etc.).
This character 160 looks like space but it's not, replacing all of them to the regular spaces (32) and then removing all the double spaces will fix your problem.
$search_body = str_replace(chr(160), chr(32), $search_body);
$search_body = trim(preg_replace('/\s+/', ' ', $search_body));
I'm working in PHP and I want to create a function that, given a text of arbitrary length and height, returns a restricted version of the same text with a maximum of 500 characters and 10 lines.
This is what I have so far:
function preview($str)
{
$partialPreview = explode("\n", substr($str, 0, 500));
$partialPreviewHeight = count($partialPreview);
$finalPreview = "";
// if it has more than 10 lines
if ($partialPreviewHeight > 10) {
for ($i = 0; $i < 10; $i++) {
$finalPreview .= $partialPreview[$i];
}
} else {
$finalPreview = substr($str, 0, 500);
}
return $finalPreview;
}
I have two questions:
Is using \n proper to detect new line feeds? I know that some
systems use \n, other \r\n and others \r, but \n is the most
common.
Sometimes, if there's an HTML entity like " (quotation mark) at
the end, it's left as ", and therefore it's not valid HTML. How
can I prevent this?
First replace <br /> tags with <br />\n and </p><p> or </div><div> with </p>\n<p> and </div>\n<div> respectively.
Then use the PHP function for strip tags which should yield a nice plain text with newlines in everyplace a newline should be.
Then you could replace \r\n with \n for consistency. And only after that you could extract the desired length of text.
You may want to use word wrapping to achieve your 10 line goal. For word wraps to work you need to define a number of characters per line and word wraps takes care of not braking mid-word.
You may want to use the html_entity_decode before using wordwrap as #PeeHaa suggested.
Is using \n proper to detect new line feeds? I know that some systems use \n, other \r\n and others \r, but \n is the most common.
It depends where the data is coming from. Different operating systems have different line breaks.
Windows uses \r\n, *nix (including mac OS) uses \n, (very) old macs used \r. If the data is coming from the web (e.g. a textarea) it will (/ should) always be \r\n. Because that's what the spec states user agents should do.
Sometimes, if there's an HTML entity like " (quotation mark) at the end, it's left as ", and therefore it's not valid HTML. How can I prevent this?
Before cutting the text you may want to convert html entities back to normal text. By using either htmlspecialchars_decode() or html_entity_decode depending on your needs. Now you won't have the problem of breaking the entities (don't forget to encode it again if needed).
Another option would be to only break the text on whitespace characters rather than a hard character limit. This way you will only have whole words in your "summary".
I've created a class which should deal with most issues. As I already stated when the data is coming from a textarea it will always be \r\n, but to be able to parse other linebreaks I came up with something like the following (untested):
class Preview
{
protected $maxCharacters;
protected $maxLines;
protected $encoding;
protected $lineBreaks;
public function __construct($maxCharacters = 500, $maxLines = 10, $encoding = 'UTF-8', array $lineBreaks = array("\r\n", "\r", "\n"))
{
$this->maxCharacters = $maxCharacters;
$this->maxLines = $maxLines;
$this->encoding = $encoding;
$this->lineBreaks = $lineBreaks;
}
public function makePreview($text)
{
$text = $this->normalizeLinebreaks($text);
// this prevents the breaking of the "e; etc
$text = html_entity_decode($text, ENT_QUOTES, $this->encoding);
$text = $this->limitLines($text);
if (mb_strlen($text, $this->encoding) > $this->maxCharacters) {
$text = $this->limitCharacters($text);
}
return html_entity_decode($text, ENT_QUOTES, $this->encoding);
}
protected function normalizeLinebreaks($text)
{
return str_replace($lineBreaks, "\n", $text);
}
protected function limitLines($text)
{
$lines = explode("\n", $text);
$limitedLines = array_slice($lines, 0, $this->maxLines);
return implode("\n", $limitedLines);
}
protected function limitCharacters($text)
{
return substr($text, 0, $this->maxCharacters);
}
}
$preview = new Preview();
echo $preview->makePreview('Some text which will be turned into a preview.');
I have a script that outputs status updates and I need to write a script that automatically changes something like www.example.com into a hyper link in a chunk of text like Twitter and Facebook do. What functions can I use for this in PHP? If you know a tutorial please post it.
$string = " fasfasd http://webarto.com fasfsafa";
echo preg_replace("#http://([\S]+?)#Uis", '<a rel="nofollow" href="http://\\1">\\1</a>', $string);
Output:
fasfasd <a rel="nofollow" href="http://webarto.com">webarto.com</a> fasfsafa
You can use a regex to replace the url with a link. Look at the answers on this thread: PHP - Add link to a URL in a string.
Great solution!
I wanted to auto-link web links and also to truncate the displayed URL text, because long URLs were breaking out of the layout on some platforms.
After much fiddling around with regex, I realised the solution is actually CSS - this site gives a simple solution using CSS white-space.
Here is the working Function
function AutoLinkUrls($str,$popup = FALSE){
if (preg_match_all("#(^|\s|\()((http(s?)://)|(www\.))(\w+[^\s\)\<]+)#i", $str, $matches)){
$pop = ($popup == TRUE) ? " target=\"_blank\" " : "";
for ($i = 0; $i < count($matches['0']); $i++){
$period = '';
if (preg_match("|\.$|", $matches['6'][$i])){
$period = '.';
$matches['6'][$i] = substr($matches['6'][$i], 0, -1);
}
$str = str_replace($matches['0'][$i],
$matches['1'][$i].'</xmp><a href="http'.
$matches['4'][$i].'://'.
$matches['5'][$i].
$matches['6'][$i].'"'.$pop.'>http'.
$matches['4'][$i].'://'.
$matches['5'][$i].
$matches['6'][$i].'</a><xmp>'.
$period, $str);
}//end for
}//end if
return $str; }