Hello guys I currently have a problem with my preg_replace :
preg_replace('#[^a-zA-z\s]#', '', $string)
It keeps all alphabetic letters and white spaces but I want more than one white space to be reduced to only one. Any idea how this can be done ?
$output = preg_replace('!\s+!', ' ', $input);
From Regular Expression Basic Syntax Reference
\d, \w and \s
Shorthand character classes matching digits, word characters (letters, digits, and underscores), and whitespace (spaces, tabs, and line breaks). Can be used inside and outside character classes.
The character type \s stands for five different characters: horizontal tab (9), line feed (10), form feed (12), carriage return (13) and ordinary space (32). The following code will find every substring of $string which is composed entirely of \s. Only the first \s in the substring will be preserved. For example, if line feed, horizontal tab and ordinary space occur immediately after one another in a substring, line feed alone will remain after the replacement is done.
$string = preg_replace('#(\s)\s+#', '\1', $string);
preg_replace(array('#\s+#', '#[^a-zA-z\s]#'), array(' ', ''), $string);
Though it will replace all of whitespaces with spaces. If you want to replace consequent whitespaces (like two newlines with only one newline) - you should figure out logic for that, coz \s+ will match "\n \n \n" (5 whitespaces in a row).
try using trim instead
<?php
$something = " Error";
echo $something."\n";
echo "------"."\n";
echo trim($something);
?>
output
Error
------
Error
Question is old and miss some details. Let's assume OP wanted to reduce all consecutive horizontal whitespaces and replace by a space.
Exemple:
"\t\t \t \t" => " "
"\t\t \t\t" => "\t \t"
One possible solution would be simply to use the generic character type \h which stands for horizontal whitespace space:
preg_replace('/\h+/', ' ', $text)
Related
I was scouring through SO answers and found that the solution that most gave for replacing multiple spaces is:
$new_str = preg_replace("/\s+/", " ", $str);
But in many cases the white space characters include UTF characters that include line feed, form feed, carriage return, non-breaking space, etc. This wiki describes that UTF defines twenty-five characters defined as whitespace.
So how do we replace all these characters as well using regular expressions?
When passing u modifier, \s becomes Unicode-aware. So, a simple solution is to use
$new_str = preg_replace("/\s+/u", " ", $str);
^^
See the PHP online demo.
The first thing to do is to read this explanation of how unicode can be treated in regex. Coming specifically to PHP, we need to first of all include the PCRE modifier 'u' for the engine to recognize UTF characters. So this would be:
$pattern = "/<our-pattern-here>/u";
The next thing is to note that in PHP unicode characters have the pattern \x{00A0} where 00A0 is hex representation for non-breaking space. So if we want to replace consecutive non-breaking spaces with a single space we would have:
$pattern = "/\x{00A0}+/u";
$new_str = preg_replace($pattern," ",$str);
And if we were to include other types of spaces mentioned in the wiki like:
\x{000D} carriage return
\x{000C} form feed
\x{0085} next line
Our pattern becomes:
$pattern = "/[\x{00A0}\x{000D}\x{000C}\x{0085}]+/u";
But this is really not great since the regex engine will take forever to find out all combinations of these characters. This is because the characters are included in square brackets [ ] and we have a + for one or more occurrences.
A better way to then get faster results is by replacing all occurrences of each of these characters by a normal space first. And then replacing multiple spaces with a single normal space. We remove the [ ]+ and instead separate the characters with the or operator | :
$pattern = "/\x{00A0}|\x{000D}|\x{000C}|\x{0085}/u";
$new_str = preg_replace($pattern," ",$str); // we have one-to-one replacement of character by a normal space, so 5 unicode chars give 5 normal spaces
$final_str = preg_replace("/\s+/", " ", $new_str); // multiple normal spaces now become single normal space
A pattern that matches all Unicode whitespaces is [\pZ\pC]. Here is a unit test to prove it.
If you're parsing user input in UTF-8 and need to normalize it, it's important to base your match on that list. So to answer your question that would be:
$new_str = preg_replace("/[\pZ\pC]+/u", " ", $str);
I manage to remove the spaces but I can't understand why it would remove my returns as well. I have a textarea in my form and I want to allow up to two returns maximum. Here is what I have been using so far.
$string = preg_replace('/\s\s+/', ' ', $string); // supposed to remove more than one consecutive space - but also deletes my returns ...
$string = preg_replace('/\n\n\n+/', '\n\n', $string); // using this one by itself does not do as expected and removes all returns ...
It seems first line already gets rid of more than one spaces AND all returns ... Which is strange. Not sure than I am doing it right ...
Because \s will also match newline characters. So i suggest you to use \h for matching any kind of horizontal spaces.
$string = preg_replace('/\h\h+/', ' ', $string);
\s match any white space character [\r\n\t\f ]
See the deifinition of \s.It includes \n.Use
\h matches any horizontal whitespace character (equal to [[:blank:]])
Use \h for horizontal whitespaces.
For those of you who will need it, that's how you remove two carriage returns from a textarea.
preg_replace('/\n\r(\n\r)+/', "\n\r", $str);
For the space issue, as it has been posted above, replace \s by \h
Consider the following strings
breaking out a of a simple prison
this is b moving up
following me is x times better
All strings are lowercased already. I would like to remove any "loose" a-z characters, resulting in:
breaking out of simple prison
this is moving up
following me is times better
Is this possible with a single regex in php?
$str = "breaking out a of a simple prison
this is b moving up
following me is x times better";
$res = preg_replace("#\\b[a-z]\\b ?#i", "", $str);
echo $res;
How about:
preg_replace('/(^|\s)[a-z](\s|$)/', '$1', $string);
Note this also catches single characters that are at the beginning or end of the string, but not single characters that are adjacent to punctuation (they must be surrounded by whitespace).
If you also want to remove characters immediately before punctuation (e.g. 'the x.'), then this should work properly in most (English) cases:
preg_replace('/(^|\s)[a-z]\b/', '$1', $string);
As a one-liner:
$result = preg_replace('/\s\p{Ll}\b|\b\p{Ll}\s/u', '', $subject);
This matches a single lowercase letter (\p{Ll}) which is preceded or followed by whitespace (\s), removing both. The word boundaries (\b) ensure that only single letters are indeed matched. The /u modifier makes the regex Unicode-aware.
The result: A single letter surrounded by spaces on both sides is reduced to a single space. A single letter preceded by whitespace but not followed by whitespace is removed completely, as is a single letter only followed but not preceded by whitespace.
So
This a is my test sentence a. o How funny (what a coincidence a) this is!
is changed to
This is my test sentence. How funny (what coincidence) this is!
You could try something like this:
preg_replace('/\b\S\s\b/', "", $subject);
This is what it means:
\b # Assert position at a word boundary
\S # Match a single character that is a “non-whitespace character”
\s # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
\b # Assert position at a word boundary
Update
As raised by Radu, because I've used the \S this will match more than just a-zA-Z. It will also match 0-9_. Normally, it would match a lot more than that, but because it's preceded by \b, it can only match word characters.
As mentioned in the comments by Tim Pietzcker, be aware that this won't work if your subject string needs to remove single characters that are followed by non word characters like test a (hello). It will also fall over if there are extra spaces after the single character like this
test a hello
but you could fix that by changing the expression to \b\S\s*\b
Try this one:
$sString = preg_replace("#\b[a-z]{1}\b#m", ' ', $sString);
I'm writing a WordPress plugin, and one of the features is removing duplicate whitespace.
My code looks like this:
return preg_replace('/\s\s+/u', ' ', $text, -1, $count);
I don't understand why I need the u
modifier. I've seen other plugins
that use preg_replace and don't
need to modify it for Unicode. I
believe I have a default installation
of WordPress .
Without the modifier, the code
replaces all the spaces with Unicode
replacement glyphs instead of spaces.
With the u modifier, I don't get
the glyphs, and it doesn't replace all the whitespace.
Each space below has from 1-10 spaces. The regex only removes on space from each group.
Before:
This sentence has extra space. This doesn’t. Extra space, Lots of extra space.
After:
This sentence has extra space. This doesn’t. Extra space, Lots of extra space.
$count = 9
How can I make the regex replace the whole match with the one space?
Update: If I try this with regular php, it works fine
$new_text = preg_replace('/\s\s+/', ' ', $text, -1, $count);
It only breaks when I use it within the wordpress plugin.
I'm using this function in a filter:
function jje_test( $text ) {
$new_text = preg_replace('/\s\s+/', ' ', $text, -1, $count);
echo "Count: $count";
return $new_text;
}
add_filter('the_content', 'jje_test');
I have tried:
Removing all other filters on the_content
remove_all_filters('the_content');
Changing the priority of the filter added to the_content, earlier or later
All kinds of permutations of \s+, \s\s+, [ ]+ etc.
Even replacing all single spaces with an empty string, will not replace the spaces
This will replace all sequences of two or more spaces, tabs, and/or line breaks with a single space:
return preg_replace('/[\p{Z}\s]{2,}/u', ' ', $text);
You need the /u flag if $text holds text encoded as UTF-8. Even if there are no Unicode characters in your regex, PCRE has to interpret $text correctly.
I added \p{Z} to the character class because PCRE only matches ASCII characters when using shorthands such as \s, even when using /u. Adding \p{Z} makes sure all Unicode whitespace is matched. There might be other spaces such as non-breaking spaces in your string.
I'm not sure if using echo in a WordPress filter is a good idea.
The u modifier simply puts it into UTF-8 mode, which is useful if you need to do anything specific with characters that have a code point above 0x7f. You can still work on UTF-8 encoded strings without using that modifier, you just won't be able to specifically match or transform such characters easily.
There are some whitespace characters in Unicode that are above 0x7f. It's pretty rare to encounter them in most data. But you may see, for example, a non-breaking space character, which is unicode \uA0, or some rarer characters.
I don't know why using it would cause Unicode "replacement" glyphs to be output. I'd say it would be a problem elsewhere... what character encoding are you outputting your script as?
To answer jjeaton's follow-up question in the comments to my first reply, the following replaces each sequence of spaces, tabs, and/or line breaks with the first character in that sequence. Effectively, this deletes the second and following whitespace characters in each sequence of two or more whitespace characters. A run of spaces is replaced with a single space, a run of tabs is replaced with a single tab, etc. A run of a space and a tab (in that order) is replaced with a space, and a run of a tab and a space is replaced with a tab, etc.
return preg_replace('/([\p{Z}\s])[\p{Z}\s]+/u', '$1', $text);
This regex works by first matching one space and capturing it with a capturing group, followed by one or more spaces. The replacement text is simply reinserts the text matched byt the first (and only) capturing group.
Don't know about any modifiers, but this did the trick:
<?php
$text = ' Hi, my name is Andrés. ';
echo preg_replace(array('/^\s+/', '/\s+$/', '/\s{2,}/'), ' ', $text);
/*
Hi, my name is Andrés.
*/
?>
preg_replace('!\s+!', ' ', 'This sentence has extra space. This doesn’t. Extra space, Lots of extra space.');
I need to remove blank lines (with whitespace or absolutely blank) in PHP. I use this regular expression, but it does not work:
$str = ereg_replace('^[ \t]*$\r?\n', '', $str);
$str = preg_replace('^[ \t]*$\r?\n', '', $str);
I want a result of:
blahblah
blahblah
adsa
sad asdasd
will:
blahblah
blahblah
adsa
sad asdasd
// New line is required to split non-blank lines
preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", "\n", $string);
The above regular expression says:
/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/
1st Capturing group (^[\r\n]*|[\r\n]+)
1st Alternative: ^[\r\n]*
^ assert position at start of the string
[\r\n]* match a single character present in the list below
Quantifier: Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
\r matches a carriage return (ASCII 13)
\n matches a fine-feed (newline) character (ASCII 10)
2nd Alternative: [\r\n]+
[\r\n]+ match a single character present in the list below
Quantifier: Between one and unlimited times, as many times as possible, giving back as needed [greedy]
\r matches a carriage return (ASCII 13)
\n matches a fine-feed (newline) character (ASCII 10)
[\s\t]* match a single character present in the list below
Quantifier: Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
\s match any white space character [\r\n\t\f ]
\tTab (ASCII 9)
[\r\n]+ match a single character present in the list below
Quantifier: Between one and unlimited times, as many times as possible, giving back as needed [greedy]
\r matches a carriage return (ASCII 13)
\n matches a fine-feed (newline) character (ASCII 10)
Your ereg-replace() solution is wrong because the ereg/eregi methods are deprecated. Your preg_replace() won't even compile, but if you add delimiters and set multiline mode, it will work fine:
$str = preg_replace('/^[ \t]*[\r\n]+/m', '', $str);
The m modifier allows ^ to match the beginning of a logical line rather than just the beginning of the whole string. The start-of-line anchor is necessary because without it the regex would match the newline at the end of every line, not just the blank ones. You don't need the end-of-line anchor ($) because you're actively matching the newline characters, but it doesn't hurt.
The accepted answer gets the job done, but it's more complicated than it needs to be. The regex has to match either the beginning of the string (^[\r\n]*, multiline mode not set) or at least one newline ([\r\n]+), followed by at least one newline ([\r\n]+). So, in the special case of a string that starts with one or more blank lines, they'll be replaced with one blank line. I'm pretty sure that's not the desired outcome.
But most of the time it replaces two or more consecutive newlines, along with any horizontal whitespace (spaces or tabs) that lies between them, with one linefeed. That's the intent, anyway. The author seems to expect \s to match just the space character (\x20), when in fact it matches any whitespace character. That's a very common mistake. The actual list varies from one regex flavor to the next, but at minimum you can expect \s to match whatever [ \t\f\r\n] matches.
Actually, in PHP you have a better option:
$str = preg_replace('/^\h*\v+/m', '', $str);
\h matches any horizontal whitespace character, and \v matches vertical whitespace.
Just explode the lines of the text to an array, remove empty lines using array_filter and implode the array again.
$tmp = explode("\n", $str);
$tmp = array_filter($tmp);
$str = implode("\n", $tmp);
Or in one line:
$str = implode("\n", array_filter(explode("\n", $str)));
I don't know, but this is maybe faster than preg_replace.
The comment from Bythos from Jamie's link above worked for me:
/^\n+|^[\t\s]*\n+/m
I didn't want to strip all of the new lines, just the empty/whitespace ones. This does the trick!
There is no need to overcomplicate things. This can be achieved with a simple short regular expression:
$text = preg_replace("/(\R){2,}/", "$1", $text);
The (\R) matches all newlines.
The {2,} matches two or more occurrences.
The $1 Uses the first backreference (platform specific EOL) as the replacement.
This has been already answered long time ago but can greatly benefit for preg_replace and a much simplified pattern:
$result = preg_replace('/\s*($|\n)/', '\1', $subject);
Pattern: Remove all white-space before a new-line -or- at the end of the string.
Longest match wins:
As the white-space \s has a greedy quantifier * and contains \n consecutive empty lines are matched.
As \s contains \r as well, \r\n new-line sequences are supported, however single \r (without \n) are not.
And when $ matches the end of the buffer the backreference \1 is empty allowing to handle trailing whitespace at the very end, too.
If leading (empty) lines need to be removed as well, they have to match while not capturing, too (this was not directly asked for but could be appropriate):
$result = preg_replace('/^(?:\s*\n)+|\s*($|\n)/', '\1', $subject);
# '----------'
Pattern: Also remove all leading white-space (first line(s) are empty).
And if the new-line at the end of the buffer should be normalized differently (always a newline at the end instead of never), it needs to be added: . "\n".
This variant is portable to \r\n, \r and \n new-line sequences ((?>\r\n|\r|\n)) or \R:
$result = preg_replace('/^(?> |\t|\r\n|\r|\n)+|(?> |\t|\r\n|\r|\n)*($|(?>\r\n|\r|\n))/', '\1', $subject);
# or:
$result = preg_replace('/^(?:\s*\R)+|\s*($|\R)/', '\1', $subject);
Pattern: Support all new-line sequences.
This is with the downside that the new-lines can not be normalized (e.g. any of the three to \n).
Therefore, it can make sense to normalize new-lines before removing:
$result = preg_replace(['/(?>\r\n|\n|\r)/', '/\s*($|\n)/'], ["\n", '\1'], $subject);
# or:
$result = preg_replace(['/\R/u', '/\s*($|\n)/'], ["\n", '\1'], $subject);
It ships with the opportunity to do some normalization apart from the line handling.
For example removal of the trailing white-space and fixing the missing new-line at the end of file.
Then doing more advanced line normalization, for example zero empty lines at the beginning and end; otherwise not more than two consecutive empty lines:
$result = preg_replace(
['/[ \t]*($|\R)/u', '/^\n*|(\n)\n*$|(\n{3})\n+/'],
["\n" , '\1\2' ],
$subject
);
The secondary pattern benefits from the first patterns replacements already.
The power with preg_replace relies here in choosing the backreference(s) to replace with wisely.
Also using multiple patterns can greatly simplify things and keep the process maintainable.
Try this one:
$str = preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", "\r\n", $str);
If you output this to a text file, it will give the same output in the simple Notepad, WordPad and also in text editors, for example Notepad++.
Use this:
$str = preg_replace('/^\s+\r?\n$/D', '', $str);
function trimblanklines($str) {
return preg_replace('`\A[ \t]*\r?\n|\r?\n[ \t]*\Z`','',$str);
}
This one only removes them from the beginning and end, not the middle (if anyone else was looking for this).
The accepted answer leaves an extra line-break at the end of the string. Using rtrim() will remove this final linebreak:
rtrim(preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", "\n", $string));
From this answer, the following works fine for me!
$str = "<html>
<body>";
echo str_replace(array("\r", "\n"), '', $str);
<?php
function del_blanklines_in_array_q($ar){
$strip = array();
foreach($ar as $k => $v){
$ll = strlen($v);
while($ll--){
if(ord($v[$ll]) > 32){ //hex /0x20 int 32 ascii SPACE
$strip[] = $v; break;
}
}
}
return $strip;
}
function del_blanklines_in_file_q($in, $out){
// in filename, out filename
$strip = del_blanklines_in_array_q(file($in));
file_put_contents($out, $strip );
}
$file = "file_name.txt";
$file_data = file_get_contents($file);
$file_data_after_remove_blank_line = preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", "\n", $file_data );
file_put_contents($file,$file_data_after_remove_blank_line);
nl2br(preg_replace('/^\v+/m', '', $r_msg))