A regex to remove whitespace and line breaks from HTML document - php

I am using this regular expression to remove white space and line breaks from a HTML document.
However, it doesn't seem to handling line breaks very well.
preg_replace('/(?:(?<=\>)|(?<=\/\>))(\s+)(?=\<\/?)/', '', $HTML);
How can I improve the above?
I am only trying to remove spaces between beginning and end of HTML tags.

How about this regex? It's not perfect (it only handles whitespace at the beginning and end of the line) but it works for me.
$html = preg_replace('/[\t\s\n]*(<.*>)[\t\s\n]*/', '$1', $html);

Related

Strip leading and trailing whitespaces within a string matched using regex in php

I have a html string
$html = <p>I'm a para</p><b> I'm bold </b>
Now I can replace the bold tags with wiki markup(*) using php regex as:
$html = preg_replace('/<b>(.*?)<\/b>/', '*\1*', $html);
Apart from this I also want the leading and trailing whitespaces in the bold tag removed in the same php's preg_replace function.
Can anyone help me on how to do this?
Thanks in advance,
Varun
Try using:
$html = preg_replace('~<b>\s*(.*?)\s*</b>\s*~i', '*\1*', $html);
\s in between the tags and the string to keep will strip away the spaces to trim. The i flag just for case insensitivity and I used ~ as delimiters so you don't have to escape forward slashes.
Apart from this I also want the leading and trailing whitespaces in the bold tag removed
Easy enough this will do it just fine.
$html = preg_replace('/<b>\s+(.*?)\s+<\/b>/', '*\1*', $html);
See the demo
Use \s symbol for that (\s* means that there could be 0 or more occurrences):
$html = preg_replace('/\<b\>\s*(.*?)\s*\<\/b\>/i', '*\1*', $html);
-I also suggest to use i modifier since html tags are case insensitive. And, finally, symbols < and > should be escaped for more safety (they are part of some regex construsts. Your regex will work without escaping them, but it's a good habit to escape them so be sure to avoid errors with that)
(edit): it seems I've misunderstood 'trailing/leading' sense.

How to remove multiple spaces and new lines from a string in PHP?

I have a form with a text area, I need to remove from the string entered here eventuals multiple spaces and multiple new lines.
I have written this function to remove the multiple spaces
function fix_multi_spaces($string)
{
$reg_exp = '/\s+/';
return preg_replace($reg_exp," ",$string);
}
This function works good for spaces, but it also replace the new lines changing them into a single space.
I need to change multiple spaces into 1 space and multiple new lines into 1 new line.
How can I do?
Use
preg_replace('/(( )+|(\\n)+)/', '$2$3', $string);
This will work specifically for spaces and newlines; you will have to add other whitespace characters (such as \t for tabs) to the regex if you want to target them as well.
This regex works by matching either one or more spaces or one or more newlines and replacing the match with a space (but only if spaces were matched) and a newline (but only if newlines were matched).
Update: Turns out there's some regex functionality tailored for such cases which I didn't know about (many thanks to craniumonempty for the comment!). You can write the regex perhaps more appropriately as
preg_replace('/(?|( )+|(\\n)+)/', '$1', $string);
You know that \s in regex is for all whitepsaces, this means spaces, newlines, tab etc.
If You would like to replace multiple spaces by one and multiple newlines by one, You would have to rwrite the function to call preg_replace twice - once replacing spaces and once replacing newlines...
You can use following function for replace multiple space and lines with single space...
function test($content_area){
//Newline and tab space to single space
$content_area = str_replace(array("\r\n", "\r", "\n", "\t"), ' ', $content_area);
// Multiple spaces to single space ( using regular expression)
$content_area = ereg_replace(" {2,}", ' ',$content_area);
return $content_area;
}

Problem with PHP strip tag

I am trying to get my site feed working.
I need to select some content and display in my feed. After selecting, i strip tags then display.
The problem is this:
The data still displays as if the tags still exist (but no visible html tag) eg. after stripping, in my source ill have:
Hello (just illustrating)
----There will be gap in between as if html character still exist, but cant see any when i view my source-----
Hi
How can i fix this . Thanks
EDIT:
To make it clearer, after stripping i still get text like this:
This is my first line
This is my second line with a gap in between the first line and second line as if there is a paragraph tag
UPDATE
i am using this:
$body=substr(strip_tags(preg_replace('/\n{2,}/',"\n",$row["post_content"])),0,150);
when i echo $body, it still maintains new lines
you may have a \n which was at the end of the paragraphs after the closing tags you stripped.
preg_replace('/[\p{Z}\s]{2,}/s',' ',$string);
will strip out all white space, tabs, new lines and double spaces and replace with single space.
\s Matches any white-space character. Equivalent to the Unicode character categories [\f\n\r\t\v\x85\p{Z}]. If ECMAScript-compliant behavior is specified with the ECMAScript option, \s is equivalent to [ \f\n\r\t\v].
strip_tags will literally strip the tags, leaving any other whitespace behind.
You could get rid of extra newlines and whitespace with regular expressions, but depending on your content, you might mangle it.
Remove newlines:
$string = preg_replace('/\n{2,}/',"\n",$string);
Remove extra spaces:
$string = preg_replace('/ {2,}/',' ',$string);
I was experiencing some very annoyingly similar. Solved with trim
$body=strip_tags(trim($row["post_content"]));

Regex whitespace

preg_match('/<div class="prices">/s<h3>(.+?)<\/h3>/is', $response, $matches);
There's whitespace and potentially new lines between the prices div and the h3 tag. How do I use /s to match that?
You don't use /s, you use \s*.
It's a backslash (\), not a slash (/).
The * afterwards means that it matches zero or more whitespace characters.
Also, please consider using an HTML parser if you are trying to find HTML tags. A proper HTML parser will be able to correctly handle whitespace, HTML comments and other features of HTML that your regular expression cannot handle.

regex: find new line character in string that isn't in textarea

heya, so I'm looking for a regex that would allow me to basically replace a newline character with whatever (eg. 'xxx'), but only if the newline character isn't within textarea tags.
For example, the following:
<strong>abcd
efg</strong>
<textarea>curious
george
</textarea>
<span>happy</span>
Would become:
<strong>abcdxxxefg</strong>xxx<textarea>curious
geroge
</textarea>xxx<span>happy</span>
Anyone have any idea on where I should start? I'm kinda clueless here :(
Thanks for any help possible.
I've got it, but you're not gonna like it. ;)
$result = preg_replace(
'~[\r\n]++(?=(?>[^<]++|<(?!/?textarea\b))*+(?!</textarea\b))~',
'XYZ', $source);
After matching a line break, the lookahead scans ahead, consuming any character that's not a left angle bracket, or any left angle bracket that's not the beginning of a <textarea> or </textarea> tag. When it runs out of those, the next thing it sees has to be one of those tags or the end of the string. If it's a </textarea> tag, that means the line break was found inside a textarea element, so the match fails, and that line break is not replaced.
I've included an expanded version below, and you can see it an action on ideone. You can adapt it to handle those other tags too, if you really want to. But it sounds to me like what you need is an HTML minimizer (or minifier); there are plenty of those available.
$re=<<<EOT
~
[\r\n]++
(?=
(?>
[^<]++ # not left angle brackets, or
|
<(?!/?textarea\b) # bracket if not for TA tag (opening or closing)
)*+
(?!</textarea\b) # first TA tag found must be opening, not closing
)
~x
EOT;
If you still want to go with regexp, you may try this - escape newlines inside special tags, delete newlines and then unescape:
<?php //5.3 syntax here
//Regex matches everything within textarea, pre or code tags
$str = preg_replace_callback('#<(?P<tag>textarea|pre|code)[^>]*?>.*</(?P=tag)>#sim',
function ($matches) {
//and then replaces every newline by some escape sequence
return str_replace("\n", "%ESCAPED_NEWLINE%", $matches[0]);
}, $str);
//after all we can safely remove newlines
//and then replace escape sequences by newlines
$str = str_replace(array("\n", "%ESCAPED_NEWLINE%"), array('', "\n"), $str);
Why use a regex for this? Why not use a very simple state machine to do it? Work through the string looking for opening <textarea> tags, and when inside them look for the closing tag instead. When you come across a newline, convert it or not based on whether you're currently inside a <textarea> or not.
What you are doing is parsing HTML. You cannot parse HTML with a regular expression.

Categories