Problem with PHP strip tag - php

I am trying to get my site feed working.
I need to select some content and display in my feed. After selecting, i strip tags then display.
The problem is this:
The data still displays as if the tags still exist (but no visible html tag) eg. after stripping, in my source ill have:
Hello (just illustrating)
----There will be gap in between as if html character still exist, but cant see any when i view my source-----
Hi
How can i fix this . Thanks
EDIT:
To make it clearer, after stripping i still get text like this:
This is my first line
This is my second line with a gap in between the first line and second line as if there is a paragraph tag
UPDATE
i am using this:
$body=substr(strip_tags(preg_replace('/\n{2,}/',"\n",$row["post_content"])),0,150);
when i echo $body, it still maintains new lines

you may have a \n which was at the end of the paragraphs after the closing tags you stripped.
preg_replace('/[\p{Z}\s]{2,}/s',' ',$string);
will strip out all white space, tabs, new lines and double spaces and replace with single space.
\s Matches any white-space character. Equivalent to the Unicode character categories [\f\n\r\t\v\x85\p{Z}]. If ECMAScript-compliant behavior is specified with the ECMAScript option, \s is equivalent to [ \f\n\r\t\v].

strip_tags will literally strip the tags, leaving any other whitespace behind.
You could get rid of extra newlines and whitespace with regular expressions, but depending on your content, you might mangle it.
Remove newlines:
$string = preg_replace('/\n{2,}/',"\n",$string);
Remove extra spaces:
$string = preg_replace('/ {2,}/',' ',$string);

I was experiencing some very annoyingly similar. Solved with trim
$body=strip_tags(trim($row["post_content"]));

Related

Regular expressions - remove all non-alpha-numeric characters CRLF problem

First off, if it's not clear from the tag, I'm doing this in PHP - but that probably doesn't matter much.
I have this code:
$inputStr = strip_tags($inputStr);
$inputStr = preg_replace("/[^a-zA-Z\s]/", " ", $inputStr);
Which seems to remove all HTML tags and virtually all special and non-alphabetic characters perfectly. The one problem is, for some reason, it doesn't filter out carraige return/line feeds (just the combination).
If I add this line:
$inputStr = preg_replace("/\s+/", " ", $inputStr);
at the end, however, it works great. Can someone tell me:
Why doesn't the first preg_replace filter out the CR/LFs?
What this second preg_repalce is actually doing? I understand the first one for the most part, but hte second one is confusing me - it works but I don't know why.
Can I combine them into 1 line somehow?
You told it to remove everything except letters and whitespace. Newlines are whitespace, so they don't get removed. You could use \h instead of \s to only exclude horizontal whitespace.
It simply means "replace every sequence of one or more whitespace characters (\s+) with a single space."
preg_replace("/[^A-Za-z]+/", " ", ...) might do.
Your first regex is removing all characters that are not letters or whitespace. CRLFs are whitespace, so they aren't filtered out.
The second one is replacing whitespace with a space character. Essentially it condenses sequences of whitespace into a single space (due to the quantifier being greedy).
I suggest removing the \s from the first regex, see if that works.
\s matches whitespace such as \n.
It is replacing all whitespace characters with a space.
You could make it one unreadable line, but probably not one regex.

A regex to remove whitespace and line breaks from HTML document

I am using this regular expression to remove white space and line breaks from a HTML document.
However, it doesn't seem to handling line breaks very well.
preg_replace('/(?:(?<=\>)|(?<=\/\>))(\s+)(?=\<\/?)/', '', $HTML);
How can I improve the above?
I am only trying to remove spaces between beginning and end of HTML tags.
How about this regex? It's not perfect (it only handles whitespace at the beginning and end of the line) but it works for me.
$html = preg_replace('/[\t\s\n]*(<.*>)[\t\s\n]*/', '$1', $html);

Regex whitespace

preg_match('/<div class="prices">/s<h3>(.+?)<\/h3>/is', $response, $matches);
There's whitespace and potentially new lines between the prices div and the h3 tag. How do I use /s to match that?
You don't use /s, you use \s*.
It's a backslash (\), not a slash (/).
The * afterwards means that it matches zero or more whitespace characters.
Also, please consider using an HTML parser if you are trying to find HTML tags. A proper HTML parser will be able to correctly handle whitespace, HTML comments and other features of HTML that your regular expression cannot handle.

regex: find new line character in string that isn't in textarea

heya, so I'm looking for a regex that would allow me to basically replace a newline character with whatever (eg. 'xxx'), but only if the newline character isn't within textarea tags.
For example, the following:
<strong>abcd
efg</strong>
<textarea>curious
george
</textarea>
<span>happy</span>
Would become:
<strong>abcdxxxefg</strong>xxx<textarea>curious
geroge
</textarea>xxx<span>happy</span>
Anyone have any idea on where I should start? I'm kinda clueless here :(
Thanks for any help possible.
I've got it, but you're not gonna like it. ;)
$result = preg_replace(
'~[\r\n]++(?=(?>[^<]++|<(?!/?textarea\b))*+(?!</textarea\b))~',
'XYZ', $source);
After matching a line break, the lookahead scans ahead, consuming any character that's not a left angle bracket, or any left angle bracket that's not the beginning of a <textarea> or </textarea> tag. When it runs out of those, the next thing it sees has to be one of those tags or the end of the string. If it's a </textarea> tag, that means the line break was found inside a textarea element, so the match fails, and that line break is not replaced.
I've included an expanded version below, and you can see it an action on ideone. You can adapt it to handle those other tags too, if you really want to. But it sounds to me like what you need is an HTML minimizer (or minifier); there are plenty of those available.
$re=<<<EOT
~
[\r\n]++
(?=
(?>
[^<]++ # not left angle brackets, or
|
<(?!/?textarea\b) # bracket if not for TA tag (opening or closing)
)*+
(?!</textarea\b) # first TA tag found must be opening, not closing
)
~x
EOT;
If you still want to go with regexp, you may try this - escape newlines inside special tags, delete newlines and then unescape:
<?php //5.3 syntax here
//Regex matches everything within textarea, pre or code tags
$str = preg_replace_callback('#<(?P<tag>textarea|pre|code)[^>]*?>.*</(?P=tag)>#sim',
function ($matches) {
//and then replaces every newline by some escape sequence
return str_replace("\n", "%ESCAPED_NEWLINE%", $matches[0]);
}, $str);
//after all we can safely remove newlines
//and then replace escape sequences by newlines
$str = str_replace(array("\n", "%ESCAPED_NEWLINE%"), array('', "\n"), $str);
Why use a regex for this? Why not use a very simple state machine to do it? Work through the string looking for opening <textarea> tags, and when inside them look for the closing tag instead. When you come across a newline, convert it or not based on whether you're currently inside a <textarea> or not.
What you are doing is parsing HTML. You cannot parse HTML with a regular expression.

Matching duplicate whitespace with preg_replace

I'm writing a WordPress plugin, and one of the features is removing duplicate whitespace.
My code looks like this:
return preg_replace('/\s\s+/u', ' ', $text, -1, $count);
I don't understand why I need the u
modifier. I've seen other plugins
that use preg_replace and don't
need to modify it for Unicode. I
believe I have a default installation
of WordPress .
Without the modifier, the code
replaces all the spaces with Unicode
replacement glyphs instead of spaces.
With the u modifier, I don't get
the glyphs, and it doesn't replace all the whitespace.
Each space below has from 1-10 spaces. The regex only removes on space from each group.
Before:
This sentence has extra space. This doesn’t. Extra space, Lots of extra space.
After:
This sentence has extra space. This doesn’t. Extra space, Lots of extra space.
$count = 9
How can I make the regex replace the whole match with the one space?
Update: If I try this with regular php, it works fine
$new_text = preg_replace('/\s\s+/', ' ', $text, -1, $count);
It only breaks when I use it within the wordpress plugin.
I'm using this function in a filter:
function jje_test( $text ) {
$new_text = preg_replace('/\s\s+/', ' ', $text, -1, $count);
echo "Count: $count";
return $new_text;
}
add_filter('the_content', 'jje_test');
I have tried:
Removing all other filters on the_content
remove_all_filters('the_content');
Changing the priority of the filter added to the_content, earlier or later
All kinds of permutations of \s+, \s\s+, [ ]+ etc.
Even replacing all single spaces with an empty string, will not replace the spaces
This will replace all sequences of two or more spaces, tabs, and/or line breaks with a single space:
return preg_replace('/[\p{Z}\s]{2,}/u', ' ', $text);
You need the /u flag if $text holds text encoded as UTF-8. Even if there are no Unicode characters in your regex, PCRE has to interpret $text correctly.
I added \p{Z} to the character class because PCRE only matches ASCII characters when using shorthands such as \s, even when using /u. Adding \p{Z} makes sure all Unicode whitespace is matched. There might be other spaces such as non-breaking spaces in your string.
I'm not sure if using echo in a WordPress filter is a good idea.
The u modifier simply puts it into UTF-8 mode, which is useful if you need to do anything specific with characters that have a code point above 0x7f. You can still work on UTF-8 encoded strings without using that modifier, you just won't be able to specifically match or transform such characters easily.
There are some whitespace characters in Unicode that are above 0x7f. It's pretty rare to encounter them in most data. But you may see, for example, a non-breaking space character, which is unicode \uA0, or some rarer characters.
I don't know why using it would cause Unicode "replacement" glyphs to be output. I'd say it would be a problem elsewhere... what character encoding are you outputting your script as?
To answer jjeaton's follow-up question in the comments to my first reply, the following replaces each sequence of spaces, tabs, and/or line breaks with the first character in that sequence. Effectively, this deletes the second and following whitespace characters in each sequence of two or more whitespace characters. A run of spaces is replaced with a single space, a run of tabs is replaced with a single tab, etc. A run of a space and a tab (in that order) is replaced with a space, and a run of a tab and a space is replaced with a tab, etc.
return preg_replace('/([\p{Z}\s])[\p{Z}\s]+/u', '$1', $text);
This regex works by first matching one space and capturing it with a capturing group, followed by one or more spaces. The replacement text is simply reinserts the text matched byt the first (and only) capturing group.
Don't know about any modifiers, but this did the trick:
<?php
$text = ' Hi, my name is Andrés. ';
echo preg_replace(array('/^\s+/', '/\s+$/', '/\s{2,}/'), ' ', $text);
/*
Hi, my name is Andrés.
*/
?>
preg_replace('!\s+!', ' ', 'This sentence has extra space. This doesn’t. Extra space, Lots of extra space.');

Categories