PHP Trim() of everything that is not "text" - php

I have a source of strings that typically looks like this
word1
phrase with more words than one
a phrase prefaced by whitespace that is not whitespace in code
wordX
NOTE! The whitespace before the words and phrases comes out as whitespace to the naked eye but is not being trimmed by using "trim()".
Is there any way to use either Trim() or preg_replace() to KEEP the whitespaces within the phrases but trim it outside (which looks like whitespaces but isn't).
EDIT: I have no idea what "char" the whitespacelooking spaces before and after the words and phrases are.

This will replace all whitespace characters (spaces, tabs, and line breaks) to a single space:
$output = preg_replace('!\s+!', ' ', $input);
EDIT:
For the first-whitespace, you can either trim() it, or use this instead:
$output = preg_replace('!^\s+!', '', preg_replace('!\s+!', ' ', $input));
I think it could be done as a single RegExp, if a RegExpu guru manages to do it, I'd want this person to have his answer accepted instead.

Related

Multiple preg_replace on a variable

It is safe to use multiple preg_replace and str_replace on a variable?
$this->document->setDescription(tokenTruncate(str_replace(array("\r", "\n"), ' ', preg_replace( '/\s+/', ' ',preg_replace("/[^\w\d ]/ui", ' ', $custom_meta_description))),160));
This is a code which I am using to remove newlines, whitespaces and all non-alphanumeric characters (excluding unicode). The last preg_replace is for the non-alphanumeric characters, but dots are removed too. Is there any way to keep dots, commas, - separators?
Thanks!
What you want can be done in a single expression:
preg_replace('/(?:\s|[^\w.,-])+/u', ' ', $custom_meta_description);
It replaces either spaces (tabs, newlines as well) or things that aren't word-like, digits or punctuation.
What you're trying to do can be achieved with a single preg_replace statement:
$str = preg_replace('#\P{Xwd}++#', '', $str);
$this->document->setDescription($desc, tokenTruncate($str, 160));
The above preg_replace() statement will replace anything that's not a Unicode digit, letter or whitespace from the supplied string.
See the Unicode Reference for more details.

How to remove more than one whitespace

Hello guys I currently have a problem with my preg_replace :
preg_replace('#[^a-zA-z\s]#', '', $string)
It keeps all alphabetic letters and white spaces but I want more than one white space to be reduced to only one. Any idea how this can be done ?
$output = preg_replace('!\s+!', ' ', $input);
From Regular Expression Basic Syntax Reference
\d, \w and \s
Shorthand character classes matching digits, word characters (letters, digits, and underscores), and whitespace (spaces, tabs, and line breaks). Can be used inside and outside character classes.
The character type \s stands for five different characters: horizontal tab (9), line feed (10), form feed (12), carriage return (13) and ordinary space (32). The following code will find every substring of $string which is composed entirely of \s. Only the first \s in the substring will be preserved. For example, if line feed, horizontal tab and ordinary space occur immediately after one another in a substring, line feed alone will remain after the replacement is done.
$string = preg_replace('#(\s)\s+#', '\1', $string);
preg_replace(array('#\s+#', '#[^a-zA-z\s]#'), array(' ', ''), $string);
Though it will replace all of whitespaces with spaces. If you want to replace consequent whitespaces (like two newlines with only one newline) - you should figure out logic for that, coz \s+ will match "\n \n \n" (5 whitespaces in a row).
try using trim instead
<?php
$something = " Error";
echo $something."\n";
echo "------"."\n";
echo trim($something);
?>
output
Error
------
Error
Question is old and miss some details. Let's assume OP wanted to reduce all consecutive horizontal whitespaces and replace by a space.
Exemple:
"\t\t \t \t" => " "
"\t\t \t\t" => "\t \t"
One possible solution would be simply to use the generic character type \h which stands for horizontal whitespace space:
preg_replace('/\h+/', ' ', $text)

Normalize spaces in a string?

I need to normalize the spaces in a string:
Remove multiple adjacent spaces
Remove spaces at the beginning and end of the string
E.g. " my name is " => my name is
I tried
str_replace(' ',' ',$str);
I also tried php Replacing multiple spaces with a single space but that didn't work either.
Replace any occurrence of 2 or more spaces with a single space, and trim:
$str = preg_replace('/ {2,}/', ' ', trim($input));
Note: using the whitespace character class \s here is a fairly bad idea since it will match linebreaks and other whitespace that you might not expect.
Use a regex
$text = preg_replace("~\\s{2,}~", " ", $text);
The \s approach strips away newlines too, and / {2,}/ approach ignores tabs and spaces at beginning of line right after a newline.
If you want to save newlines and get a more accurate result, I'd suggest this impressive answer to similar question, and this improvement of the previous answer. According to their note, the answer to your question is:
$norm_str = preg_replace('/[^\S\r\n]+/', ' ', trim($str));
In short, this is taking advantage of double negation. Read the links to get an in-depth explanation of the trick.

How to remove multiple spaces and new lines from a string in PHP?

I have a form with a text area, I need to remove from the string entered here eventuals multiple spaces and multiple new lines.
I have written this function to remove the multiple spaces
function fix_multi_spaces($string)
{
$reg_exp = '/\s+/';
return preg_replace($reg_exp," ",$string);
}
This function works good for spaces, but it also replace the new lines changing them into a single space.
I need to change multiple spaces into 1 space and multiple new lines into 1 new line.
How can I do?
Use
preg_replace('/(( )+|(\\n)+)/', '$2$3', $string);
This will work specifically for spaces and newlines; you will have to add other whitespace characters (such as \t for tabs) to the regex if you want to target them as well.
This regex works by matching either one or more spaces or one or more newlines and replacing the match with a space (but only if spaces were matched) and a newline (but only if newlines were matched).
Update: Turns out there's some regex functionality tailored for such cases which I didn't know about (many thanks to craniumonempty for the comment!). You can write the regex perhaps more appropriately as
preg_replace('/(?|( )+|(\\n)+)/', '$1', $string);
You know that \s in regex is for all whitepsaces, this means spaces, newlines, tab etc.
If You would like to replace multiple spaces by one and multiple newlines by one, You would have to rwrite the function to call preg_replace twice - once replacing spaces and once replacing newlines...
You can use following function for replace multiple space and lines with single space...
function test($content_area){
//Newline and tab space to single space
$content_area = str_replace(array("\r\n", "\r", "\n", "\t"), ' ', $content_area);
// Multiple spaces to single space ( using regular expression)
$content_area = ereg_replace(" {2,}", ' ',$content_area);
return $content_area;
}

Matching duplicate whitespace with preg_replace

I'm writing a WordPress plugin, and one of the features is removing duplicate whitespace.
My code looks like this:
return preg_replace('/\s\s+/u', ' ', $text, -1, $count);
I don't understand why I need the u
modifier. I've seen other plugins
that use preg_replace and don't
need to modify it for Unicode. I
believe I have a default installation
of WordPress .
Without the modifier, the code
replaces all the spaces with Unicode
replacement glyphs instead of spaces.
With the u modifier, I don't get
the glyphs, and it doesn't replace all the whitespace.
Each space below has from 1-10 spaces. The regex only removes on space from each group.
Before:
This sentence has extra space. This doesn’t. Extra space, Lots of extra space.
After:
This sentence has extra space. This doesn’t. Extra space, Lots of extra space.
$count = 9
How can I make the regex replace the whole match with the one space?
Update: If I try this with regular php, it works fine
$new_text = preg_replace('/\s\s+/', ' ', $text, -1, $count);
It only breaks when I use it within the wordpress plugin.
I'm using this function in a filter:
function jje_test( $text ) {
$new_text = preg_replace('/\s\s+/', ' ', $text, -1, $count);
echo "Count: $count";
return $new_text;
}
add_filter('the_content', 'jje_test');
I have tried:
Removing all other filters on the_content
remove_all_filters('the_content');
Changing the priority of the filter added to the_content, earlier or later
All kinds of permutations of \s+, \s\s+, [ ]+ etc.
Even replacing all single spaces with an empty string, will not replace the spaces
This will replace all sequences of two or more spaces, tabs, and/or line breaks with a single space:
return preg_replace('/[\p{Z}\s]{2,}/u', ' ', $text);
You need the /u flag if $text holds text encoded as UTF-8. Even if there are no Unicode characters in your regex, PCRE has to interpret $text correctly.
I added \p{Z} to the character class because PCRE only matches ASCII characters when using shorthands such as \s, even when using /u. Adding \p{Z} makes sure all Unicode whitespace is matched. There might be other spaces such as non-breaking spaces in your string.
I'm not sure if using echo in a WordPress filter is a good idea.
The u modifier simply puts it into UTF-8 mode, which is useful if you need to do anything specific with characters that have a code point above 0x7f. You can still work on UTF-8 encoded strings without using that modifier, you just won't be able to specifically match or transform such characters easily.
There are some whitespace characters in Unicode that are above 0x7f. It's pretty rare to encounter them in most data. But you may see, for example, a non-breaking space character, which is unicode \uA0, or some rarer characters.
I don't know why using it would cause Unicode "replacement" glyphs to be output. I'd say it would be a problem elsewhere... what character encoding are you outputting your script as?
To answer jjeaton's follow-up question in the comments to my first reply, the following replaces each sequence of spaces, tabs, and/or line breaks with the first character in that sequence. Effectively, this deletes the second and following whitespace characters in each sequence of two or more whitespace characters. A run of spaces is replaced with a single space, a run of tabs is replaced with a single tab, etc. A run of a space and a tab (in that order) is replaced with a space, and a run of a tab and a space is replaced with a tab, etc.
return preg_replace('/([\p{Z}\s])[\p{Z}\s]+/u', '$1', $text);
This regex works by first matching one space and capturing it with a capturing group, followed by one or more spaces. The replacement text is simply reinserts the text matched byt the first (and only) capturing group.
Don't know about any modifiers, but this did the trick:
<?php
$text = ' Hi, my name is Andrés. ';
echo preg_replace(array('/^\s+/', '/\s+$/', '/\s{2,}/'), ' ', $text);
/*
Hi, my name is Andrés.
*/
?>
preg_replace('!\s+!', ' ', 'This sentence has extra space. This doesn’t. Extra space, Lots of extra space.');

Categories