Normalize spaces in a string? - php

I need to normalize the spaces in a string:
Remove multiple adjacent spaces
Remove spaces at the beginning and end of the string
E.g. " my name is " => my name is
I tried
str_replace(' ',' ',$str);
I also tried php Replacing multiple spaces with a single space but that didn't work either.

Replace any occurrence of 2 or more spaces with a single space, and trim:
$str = preg_replace('/ {2,}/', ' ', trim($input));
Note: using the whitespace character class \s here is a fairly bad idea since it will match linebreaks and other whitespace that you might not expect.

Use a regex
$text = preg_replace("~\\s{2,}~", " ", $text);

The \s approach strips away newlines too, and / {2,}/ approach ignores tabs and spaces at beginning of line right after a newline.
If you want to save newlines and get a more accurate result, I'd suggest this impressive answer to similar question, and this improvement of the previous answer. According to their note, the answer to your question is:
$norm_str = preg_replace('/[^\S\r\n]+/', ' ', trim($str));
In short, this is taking advantage of double negation. Read the links to get an in-depth explanation of the trick.

Related

How to correctly replace multiple white spaces with a single white space in PHP?

I was scouring through SO answers and found that the solution that most gave for replacing multiple spaces is:
$new_str = preg_replace("/\s+/", " ", $str);
But in many cases the white space characters include UTF characters that include line feed, form feed, carriage return, non-breaking space, etc. This wiki describes that UTF defines twenty-five characters defined as whitespace.
So how do we replace all these characters as well using regular expressions?
When passing u modifier, \s becomes Unicode-aware. So, a simple solution is to use
$new_str = preg_replace("/\s+/u", " ", $str);
^^
See the PHP online demo.
The first thing to do is to read this explanation of how unicode can be treated in regex. Coming specifically to PHP, we need to first of all include the PCRE modifier 'u' for the engine to recognize UTF characters. So this would be:
$pattern = "/<our-pattern-here>/u";
The next thing is to note that in PHP unicode characters have the pattern \x{00A0} where 00A0 is hex representation for non-breaking space. So if we want to replace consecutive non-breaking spaces with a single space we would have:
$pattern = "/\x{00A0}+/u";
$new_str = preg_replace($pattern," ",$str);
And if we were to include other types of spaces mentioned in the wiki like:
\x{000D} carriage return
\x{000C} form feed
\x{0085} next line
Our pattern becomes:
$pattern = "/[\x{00A0}\x{000D}\x{000C}\x{0085}]+/u";
But this is really not great since the regex engine will take forever to find out all combinations of these characters. This is because the characters are included in square brackets [ ] and we have a + for one or more occurrences.
A better way to then get faster results is by replacing all occurrences of each of these characters by a normal space first. And then replacing multiple spaces with a single normal space. We remove the [ ]+ and instead separate the characters with the or operator | :
$pattern = "/\x{00A0}|\x{000D}|\x{000C}|\x{0085}/u";
$new_str = preg_replace($pattern," ",$str); // we have one-to-one replacement of character by a normal space, so 5 unicode chars give 5 normal spaces
$final_str = preg_replace("/\s+/", " ", $new_str); // multiple normal spaces now become single normal space
A pattern that matches all Unicode whitespaces is [\pZ\pC]. Here is a unit test to prove it.
If you're parsing user input in UTF-8 and need to normalize it, it's important to base your match on that list. So to answer your question that would be:
$new_str = preg_replace("/[\pZ\pC]+/u", " ", $str);

PHP Trim() of everything that is not "text"

I have a source of strings that typically looks like this
word1
phrase with more words than one
a phrase prefaced by whitespace that is not whitespace in code
wordX
NOTE! The whitespace before the words and phrases comes out as whitespace to the naked eye but is not being trimmed by using "trim()".
Is there any way to use either Trim() or preg_replace() to KEEP the whitespaces within the phrases but trim it outside (which looks like whitespaces but isn't).
EDIT: I have no idea what "char" the whitespacelooking spaces before and after the words and phrases are.
This will replace all whitespace characters (spaces, tabs, and line breaks) to a single space:
$output = preg_replace('!\s+!', ' ', $input);
EDIT:
For the first-whitespace, you can either trim() it, or use this instead:
$output = preg_replace('!^\s+!', '', preg_replace('!\s+!', ' ', $input));
I think it could be done as a single RegExp, if a RegExpu guru manages to do it, I'd want this person to have his answer accepted instead.

How to remove multiple spaces and new lines from a string in PHP?

I have a form with a text area, I need to remove from the string entered here eventuals multiple spaces and multiple new lines.
I have written this function to remove the multiple spaces
function fix_multi_spaces($string)
{
$reg_exp = '/\s+/';
return preg_replace($reg_exp," ",$string);
}
This function works good for spaces, but it also replace the new lines changing them into a single space.
I need to change multiple spaces into 1 space and multiple new lines into 1 new line.
How can I do?
Use
preg_replace('/(( )+|(\\n)+)/', '$2$3', $string);
This will work specifically for spaces and newlines; you will have to add other whitespace characters (such as \t for tabs) to the regex if you want to target them as well.
This regex works by matching either one or more spaces or one or more newlines and replacing the match with a space (but only if spaces were matched) and a newline (but only if newlines were matched).
Update: Turns out there's some regex functionality tailored for such cases which I didn't know about (many thanks to craniumonempty for the comment!). You can write the regex perhaps more appropriately as
preg_replace('/(?|( )+|(\\n)+)/', '$1', $string);
You know that \s in regex is for all whitepsaces, this means spaces, newlines, tab etc.
If You would like to replace multiple spaces by one and multiple newlines by one, You would have to rwrite the function to call preg_replace twice - once replacing spaces and once replacing newlines...
You can use following function for replace multiple space and lines with single space...
function test($content_area){
//Newline and tab space to single space
$content_area = str_replace(array("\r\n", "\r", "\n", "\t"), ' ', $content_area);
// Multiple spaces to single space ( using regular expression)
$content_area = ereg_replace(" {2,}", ' ',$content_area);
return $content_area;
}

Regex to add spacing between sentences in a string in php

I use a spanish dictionary api that returns definitions with small issues. This specific problem happens when the definition has more than 1 sentence. Sometimes the sentences are not properly separated by a space character, so I receive something like this:
This is a sentence.Some other sentence.Sometimes there are no spaces between dots. See?
Im looking for a regex that would replace "." for ". " when the dot is immediately followed by a char different than the space character. The preg_replace() should return:
This is a sentence. Some other sentence. Sometimes there are no spaces between dots. See?
So far I have this:
echo preg_replace('/(?<=[a-zA-Z])[.]/','. ',$string);
The problem is that it also adds a space when there is already a space after the dot. Any ideas? Thanks!
Try this regular expression:
echo preg_replace('/(?<!\.)\.(?!(\s|$|\,|\w\.))/', '. ', $string);
echo preg_replace( '/\.([^, ])/', '. $1', $string);
It works!
You just need to apply a look-ahead to so adds a space if the next character is something other than a space or is not the end of the string:
$string = preg_replace('/(?<=[a-zA-Z])[.](?![\s$])/','. ',$string);

regex: delete white characters

I try to delete more then one white characters from my string:
$content = preg_replace('/\s+/', " ", $content); //in some cases it doesn't work
but when i wrote
$content = preg_replace('/\s\s+/', " ", $content); //works fine
could somebody explain why?
because when i write /\s+/ it must match all with one or more white character, why it doesn't work?
Thanks
What is the minimum number of whitespace characters you want to match?
\s+ is equivalent to \s\s* -- one mandatory whitespace character followed by any number more of them.
\s\s+ is equivalent to \s\s\s* -- two mandatory whitespace characters followed by any number more (if this is what you want, it might be clearer as \s{2,}).
Also note that $content = preg_replace('/\s+/', " ", $content); will replace any single spaces in $content with a single space. In other words, if your string only contains single spaces, the result will be no change.
I just wanted to add to that the reason why your /s+/ worked sometimes and not others, is that regular expressions are very greedy, so it is going to try to match one or more space characters, and as many as it can match. I think that is where you got tripped up in finding a solution.
Sorry I'm not yet able to add comments, or I would have just added this comment to Daniel's answer, which is good.
Are you using the Ungreedy option (/U)? It doesn't say so in your code, but if so, it would explain why the first preg_replace() is replacing each single space with a single space (no change). In that case, the second preg_replace() would be replacing each double space with a single space. If you try the second one on a string of four spaces and the result is a double space, I would suspect ungreediness.
try preg_replace("/([\s]{2,})/", " ", $text)

Categories