regex: change html before saving in database - php

Before saving into database i need to
delete all tags
delete all more then one white space characters
delete all more then one newlines
for it i do the following
$content = preg_replace('/<[^>]+>/', "", $content);
$content = preg_replace('/\n/', "NewLine", $content);it's for not to lose them when deleting more then one white space character
$content = preg_replace('/(\&nbsp\;){1,}/', " ", $content);
$content = preg_replace('/[\s]{2,}/', " ", $content);
and finnaly i must delete more then one "NewLine" words.
after first two points i get text in such format-
NewLineWordOfText
NewLine
NewLine
NewLine NewLine WordOfText "WordOfText WordOfText" WordOfText NewLine"WordOfText
...
how telede more then one newline from such content?
Thanks

First of all, while HTML is not regular and thus it is a bad idea to use regular expressions to parse it, PHP has a function that will remove tags for you: strip_tags
To squeeze spaces while preserving newlines:
$content = preg_replace('/[^\n\S]{2,}/', " ", $content);
$content = preg_replace('/\n{2,}/', "\n", $content);
The first line will squeeze all whitespace other than \n ([^\n\S] means all characters that aren't \n and not a non-whitespace character) into one space. The second will squeeze multiple newlines into a single newline.

why don't you use nl2br() and then preg_replace all <br /><br />s with just <br /> then all <br />s back to \n?

Related

PHP Remove extra spaces between break lines

I have a string in PHP, i'm able to remove multiple continuous break lines and multiple spaces, but what i'm still not able is to remove multiple break lines if i have an space in the middle.
For example:
Text \r\n \r\n extra text
I would like to clean this text as:
Text \r\nextra text
Could also be too:
Text \r\n \r\nextra text
Don't need to be an extra espace after the break line.
What i have right now is:
function clearHTML($text){
$text = strip_tags($text);
$text = str_replace(" ", " ", $text);
$text = preg_replace("/[[:blank:]]+/"," ",$text);
$text = preg_replace("/([\r\n]{4,}|[\n]{2,}|[\r]{2,})/", "\r\n", $text);
$text = trim($text);
return $text;
}
Any suggestions?
To remove extra whitespace between lines, you can use
preg_replace('~\h*(\R)\s*~', '$1', $text)
The regex matches:
\h* - 0 or more horizontal whitespaces
(\R) - Group 1: any line ending sequence (the replacement is $1, just this group vaue)
\s* - one or more whitespaces
The whitespace shrinking part can be merged to a single preg_replace call with the (?: |\h)+ regex that matches one or more occurrences of an string or a horizontal whitespace.
NOTE: If you have Unicode texts, you will need u flag.
The whole cleaning function can look like
function clearHTML($text){
$text = strip_tags($text);
$text = preg_replace("~(?: |\h)+~u", " ", $text);
$text = preg_replace('~\h*(\R)\s*~u', '$1', $text);
return trim($text);
}

PHP: Clean HTML by merging line breaks and removing whitespaces properly

I am using a WYSIWYG editor and have a bunch of regular expressions that take care of dirty HTML. Reason: My users often hit the enter key too often and produce many redundant new lines such as:
<br><br><br> ...
<p> <br /> </p>
<p> <br /><br /> </p>
<p> <br /> </p>
<p> <br /> </p>
<p> <br /> </p>
and many more varieties including p, and br
This is how I try to fight such inputs currently, trying to merge many successive line breaks into 1, using many different regular expressions:
// merge empty p tags into one
// http://stackoverflow.com/q/16809336/1066234
$content = preg_replace('/((<p\s*\/?>\s*) (<\/p\s*\/?>\s*))+/im', "<p> </p>\n", $content);
// remove sceditor's: <p>\n<br>\n</p> from end of string
// http://stackoverflow.com/questions/25269584/how-to-replace-pbr-p-from-end-of-string-that-contain-whitespaces-linebrea
// \s* matches any number of whitespace characters (" ", \t, \n, etc)
// (?:...)+ matches one or more (without capturing the group)
// $ forces match to only be made at the end of the string
$content = preg_replace("/(?:<p>\s*(<br>\s*)+\s*<\/p>\s*)+$/", "", $content);
// remove sceditor's double: http://http://
$content = str_replace('http://http://', 'http://', $content);
// remove spaces from end of string ( )
$content = preg_replace('/( )+$/', '', $content);
// remove also <p><br></p> from end of string
$content = preg_replace('/(<p><br><\/p>)+$/', '', $content);
// remove line breaks from end of string - $ is end of line, +$ is end of line including \n
// html with <p> </p>
$content = preg_replace('/(<p> <\/p>)+$/', '', $content);
$content = preg_replace('/(<br>)+$/', '', $content);
// remove line breaks from beginning of string
$content = preg_replace('/^(<p> <\/p>)+/', '', $content);
I am searching for a new solution. Is there any HTML parser that I can tell to merge line breaks and whitespaces? Or maybe someone has another approach to that problem.
The regex solutions above do not seem proper enough because new combinations of line break "attempts" by my users slip through.
I have developed following snippet that removes duplicate br-Tags.
<?php
$content = "<h1>Hello World</h1><p>Test\r\n<br>\r\n<br >\r\n<br >\r\n<br/>Test\r\n<br />\r\n<br /></p>";
echo "<code>{$content}</code><hr>\r\n\r\n\r\n\r\n";
$contentStripped = preg_replace('/(<br {0,}\/{0,1}>(\\r|\\n){0,}){2,}/', '<br class="reduced" />', $content);
echo "<code>{$contentStripped}</code>\r\n\r\n\r\n\r\n";
You may have to add more test cases.
You can use nl2br(strip_tags($content)) instead of above long code.

Php display as a html text the new line \n

I'm using echo to display result but the result contains line breaks /n /t /r.
I want to know if the result has is \n or \t or \r and how many. I need to know so I can replace it in a html tag like <p> or <div>.
The result is coming from on other website.
In pattern CreditTransaction/CustomerData:
Email does not contain any text
In pattern RecurUpdate/CustomerData:
Email does not contain any text
In pattern AccountInfo:
I want like this.
In pattern CreditTransaction/CustomerData:
\n
\n
\n
\n\tEmail does not contain any text
\n
In pattern RecurUpdate/CustomerData:
\n
\n
\n
\n\tEmail does not contain any text
\n\tIn pattern AccountInfo:
Your question is quite unclear but I'll do my best to provide an answer.
If you want to make \n, \r, and \t visible in the output you could just manually unescape them:
str_replace("\n", '\n', str_replace("\r", '\r', str_replace("\t", '\t', $string)));
Or if you want to unescape all escaped characters:
addslashes($string);
To count how many times a specific character/substring occurs:
substr_count($string, $character_or_substring);
To check if the string contains a specific character/substring:
if (substr_count($string, $character_or_substring) > 0) {
// your code
}
Or:
if (strpos($string, $character_or_substring) !== false) { // notice the !==
// your code
}
As mentioned earlier by someone else in a comment, if you want to convert the newlines to br tags:
nl2br($string);
If you want to make tabs indenting you could replace all tabs with  :
str_replace("\t", ' ', $string);
Use double quotes to find newline and tab characters.
$s = "In pattern CreditTransaction/CustomerData:
Email does not contain any text
In pattern RecurUpdate/CustomerData: ";
echo str_replace("\t", "*", $s); // Replace all tabs with '*'
echo str_replace("\n", "*", $s); // Replace all newlines with '*'

How to remove from inside tags, inside a string

I'm outputting a huge string which is created by a WYSIWYG editor. The following is an example of an output that the editor makes. I'm trying to prevent all the spaces created by the &nbsp's from the beginning and end (inside each tag), so that the tags tightly wrap the content. Rtrim and Ltrim don't work because they trim the whole string, not the tags inside it.
Here is an example of a string.
<div> Small amount of text, should be alot.</div>
<div>Small amount of text, should be alot. </div>
This will output something along the lines of the following (the div has been left in to show the extent of the spaces, but would be hidden on output.)
<div> Small amount of text, should be alot.</div>
<div>Small amount of text, should be alot. </div>
I would prefer this out output..
<div>Small amount of text, should be alot.</div>
<div>Small amount of text, should be alot.</div>
How can this be achieved?
Replace repetitions of spaces and/or with a single &nbsp:
preg_replace('/(?: | ){2,}/', ' ', $string);
If you want to convert all to spaces, and then collapse the spaces then:
preg_replace('/ {2,}/', ' ', str_replace(' ', ' ', $string));
Note that this is not going to remove single spaces from after the opening tag, or before the closing tag. For something like that you're getting into some pretty nasty territory with regular expressions and you'll want to parse the document using DOM or XML instead.
However, leading and trailing whitespace is generally insignificant in HTML, so this should get you where you're going.
I assume that your final goal is:
replace with " "
squeeze whitespaces
remove trailing and leading whitechars inside tags,
you can achieve that by:
//1 replace hard spaces with spaces
$text = str_replace(' ', ' ', $text);
//2 squeeze spaces
$text = preg_replace('/\s+/', ' ', $text);
//replace "> " & " <" with ">" & "<" respectively
$result = str_replace(array('> ', ' <'), array('>', '<'), $text);
Why not trying replacing the string?
$newStr = str_replace(" ", "", $oldStr);

Removing various sorts of whitespace in PHP

I am looking to remove multiple line breaks using regular expression. Say I have this text:
"On the Insert tab\n \n\nthe galleries include \n\n items that are designed"
then I want to replace it with
"On the Insert tab\nthe galleries include\nitems that are designed"
So my requirement is:
it will remove all multiple newlines and will replace with one newline
It will remove all multiple spaces and will replace with one space
Spaces will be trimmed as well
I do searched a lot but couldn't find solution - the closest I got was this one Removing redundant line breaks with regular expressions.
Use this :
echo trim(preg_replace('#(\s)+#',"$1",$string));
$text = str_replace("\r\n", "\n", $text); // converts Windows new lines to Linux ones
while (strpos($text, "\n\n") != false)
{
$text = str_replace("\n\n", "\n", $text);
}
That will sort out newline characters.
$text = trim($text);
preg_replace('/\s+/', ' ', $text);
preg_replace('/(?:\s*(?:\r\n|\r|\n)\s*){2}/s', "\n", $text);
Thanks to Removing redundant line breaks with regular expressions

Categories