How can I identify what characters are showing up as whitespace in a string?
The string is (There's actually a blank line before it, but it doesn't show up in StackOverflow's parser):
\n
\n
\n
When I paste it in regex101.com to try to add a regex to eliminate this spacing/characters it pastes as:
... which explains why trim() is not seeing it as empty spaces.
How can I find out which characters are producing these bullets so I can trim them?
What I would do is parse the string and get the ASCII character
$str = str_split('your string here');
foreach($str as $char) echo ord($char);
You'll then have the ASCII code of the character. You can theoretically work backwards from there
Related
I was scouring through SO answers and found that the solution that most gave for replacing multiple spaces is:
$new_str = preg_replace("/\s+/", " ", $str);
But in many cases the white space characters include UTF characters that include line feed, form feed, carriage return, non-breaking space, etc. This wiki describes that UTF defines twenty-five characters defined as whitespace.
So how do we replace all these characters as well using regular expressions?
When passing u modifier, \s becomes Unicode-aware. So, a simple solution is to use
$new_str = preg_replace("/\s+/u", " ", $str);
^^
See the PHP online demo.
The first thing to do is to read this explanation of how unicode can be treated in regex. Coming specifically to PHP, we need to first of all include the PCRE modifier 'u' for the engine to recognize UTF characters. So this would be:
$pattern = "/<our-pattern-here>/u";
The next thing is to note that in PHP unicode characters have the pattern \x{00A0} where 00A0 is hex representation for non-breaking space. So if we want to replace consecutive non-breaking spaces with a single space we would have:
$pattern = "/\x{00A0}+/u";
$new_str = preg_replace($pattern," ",$str);
And if we were to include other types of spaces mentioned in the wiki like:
\x{000D} carriage return
\x{000C} form feed
\x{0085} next line
Our pattern becomes:
$pattern = "/[\x{00A0}\x{000D}\x{000C}\x{0085}]+/u";
But this is really not great since the regex engine will take forever to find out all combinations of these characters. This is because the characters are included in square brackets [ ] and we have a + for one or more occurrences.
A better way to then get faster results is by replacing all occurrences of each of these characters by a normal space first. And then replacing multiple spaces with a single normal space. We remove the [ ]+ and instead separate the characters with the or operator | :
$pattern = "/\x{00A0}|\x{000D}|\x{000C}|\x{0085}/u";
$new_str = preg_replace($pattern," ",$str); // we have one-to-one replacement of character by a normal space, so 5 unicode chars give 5 normal spaces
$final_str = preg_replace("/\s+/", " ", $new_str); // multiple normal spaces now become single normal space
A pattern that matches all Unicode whitespaces is [\pZ\pC]. Here is a unit test to prove it.
If you're parsing user input in UTF-8 and need to normalize it, it's important to base your match on that list. So to answer your question that would be:
$new_str = preg_replace("/[\pZ\pC]+/u", " ", $str);
I'm using the following regex to remove all invisible characters from an UTF-8 string:
$string = preg_replace('/\p{C}+/u', '', $string);
This works fine, but how do I alter it so that it removes all invisible characters EXCEPT newlines? I tried some stuff using [^\n] etc. but it doesn't work.
Thanks for helping out!
Edit: newline character is '\n'
Use a "double negation":
$string = preg_replace('/[^\P{C}\n]+/u', '', $string);
Explanation:
\P{C} is the same as [^\p{C}].
Therefore [^\P{C}] is the same as \p{C}.
Since we now have a negated character class, we can substract other characters like \n from it.
My using a negative assertion you can a character class except what the assertion matches, so:
$res = preg_replace('/(?!\n)\p{C}/', '', $input);
(PHP's dialect of regular expressions doesn't support character class subtraction which would, otherwise, be another approach: [\p{C}-[\n]].)
Before you do it, replace newlines (I suppose you are using something like \n) with a random string like ++++++++ (any string that will not be removed by your regular expression and does not naturally occur in your string in the first place), then run your preg_replace, then replace ++++++++ with \n again.
$string=str_replace('\n','++++++++',$string); //Replace \n
$string=preg_replace('/\p{C}+/u', '', $string); //Use your regexp
$string=str_replace('++++++++','\n',$string); //Insert \n again
That should do. If you are using <br/> instead of \n simply use nl2br to preserve line breaks and replace <br/> instead of \n
I have this code for saving line breaks in text area input to database:
$input = preg_replace("/(\n)+/m", '\n', $input);
On examining the input in the database, the line breaks are actually saved.
But the problem is when I want to echo it out, the line breaks do not appear
in the output, how do I preserve the line breaks in input and also echo them
out. I don't want to use <pre></pre>.
You are replacing actual sequences of newlines in the $input with the literal two character sequence \n (backslash + n) - not a newline. These need to be converted back to newlines when read from the database. Although I suspect you intend to keep these actual newlines in the database and should be using a double quoted string instead...
$input = preg_replace('/(\n)+/m', "\n", $input);
Note that I have replaced the first string delimiters with single quotes and the second string with double quotes. \n is a special sequence in a regular expression to indicate a newline (ASCII 10), but it is also a character escape in PHP to indicate the same - the two can sometimes conflict.
I think PHP has the solution:
nl2br — Inserts HTML line breaks before all newlines in a string
Edit: You may need to replace CR / LF to make it work properly.
// Convert line endings to Unix style (NL)
$input = str_replace(array("\r\n", "\r"), "\n", $input);
// Remove multiple lines
$input = preg_replace("/(\n)+/m", '\n', $input);
// Print formatted text
echo nl2br($input);
I am trying to get my site feed working.
I need to select some content and display in my feed. After selecting, i strip tags then display.
The problem is this:
The data still displays as if the tags still exist (but no visible html tag) eg. after stripping, in my source ill have:
Hello (just illustrating)
----There will be gap in between as if html character still exist, but cant see any when i view my source-----
Hi
How can i fix this . Thanks
EDIT:
To make it clearer, after stripping i still get text like this:
This is my first line
This is my second line with a gap in between the first line and second line as if there is a paragraph tag
UPDATE
i am using this:
$body=substr(strip_tags(preg_replace('/\n{2,}/',"\n",$row["post_content"])),0,150);
when i echo $body, it still maintains new lines
you may have a \n which was at the end of the paragraphs after the closing tags you stripped.
preg_replace('/[\p{Z}\s]{2,}/s',' ',$string);
will strip out all white space, tabs, new lines and double spaces and replace with single space.
\s Matches any white-space character. Equivalent to the Unicode character categories [\f\n\r\t\v\x85\p{Z}]. If ECMAScript-compliant behavior is specified with the ECMAScript option, \s is equivalent to [ \f\n\r\t\v].
strip_tags will literally strip the tags, leaving any other whitespace behind.
You could get rid of extra newlines and whitespace with regular expressions, but depending on your content, you might mangle it.
Remove newlines:
$string = preg_replace('/\n{2,}/',"\n",$string);
Remove extra spaces:
$string = preg_replace('/ {2,}/',' ',$string);
I was experiencing some very annoyingly similar. Solved with trim
$body=strip_tags(trim($row["post_content"]));
I'm writing a WordPress plugin, and one of the features is removing duplicate whitespace.
My code looks like this:
return preg_replace('/\s\s+/u', ' ', $text, -1, $count);
I don't understand why I need the u
modifier. I've seen other plugins
that use preg_replace and don't
need to modify it for Unicode. I
believe I have a default installation
of WordPress .
Without the modifier, the code
replaces all the spaces with Unicode
replacement glyphs instead of spaces.
With the u modifier, I don't get
the glyphs, and it doesn't replace all the whitespace.
Each space below has from 1-10 spaces. The regex only removes on space from each group.
Before:
This sentence has extra space. This doesn’t. Extra space, Lots of extra space.
After:
This sentence has extra space. This doesn’t. Extra space, Lots of extra space.
$count = 9
How can I make the regex replace the whole match with the one space?
Update: If I try this with regular php, it works fine
$new_text = preg_replace('/\s\s+/', ' ', $text, -1, $count);
It only breaks when I use it within the wordpress plugin.
I'm using this function in a filter:
function jje_test( $text ) {
$new_text = preg_replace('/\s\s+/', ' ', $text, -1, $count);
echo "Count: $count";
return $new_text;
}
add_filter('the_content', 'jje_test');
I have tried:
Removing all other filters on the_content
remove_all_filters('the_content');
Changing the priority of the filter added to the_content, earlier or later
All kinds of permutations of \s+, \s\s+, [ ]+ etc.
Even replacing all single spaces with an empty string, will not replace the spaces
This will replace all sequences of two or more spaces, tabs, and/or line breaks with a single space:
return preg_replace('/[\p{Z}\s]{2,}/u', ' ', $text);
You need the /u flag if $text holds text encoded as UTF-8. Even if there are no Unicode characters in your regex, PCRE has to interpret $text correctly.
I added \p{Z} to the character class because PCRE only matches ASCII characters when using shorthands such as \s, even when using /u. Adding \p{Z} makes sure all Unicode whitespace is matched. There might be other spaces such as non-breaking spaces in your string.
I'm not sure if using echo in a WordPress filter is a good idea.
The u modifier simply puts it into UTF-8 mode, which is useful if you need to do anything specific with characters that have a code point above 0x7f. You can still work on UTF-8 encoded strings without using that modifier, you just won't be able to specifically match or transform such characters easily.
There are some whitespace characters in Unicode that are above 0x7f. It's pretty rare to encounter them in most data. But you may see, for example, a non-breaking space character, which is unicode \uA0, or some rarer characters.
I don't know why using it would cause Unicode "replacement" glyphs to be output. I'd say it would be a problem elsewhere... what character encoding are you outputting your script as?
To answer jjeaton's follow-up question in the comments to my first reply, the following replaces each sequence of spaces, tabs, and/or line breaks with the first character in that sequence. Effectively, this deletes the second and following whitespace characters in each sequence of two or more whitespace characters. A run of spaces is replaced with a single space, a run of tabs is replaced with a single tab, etc. A run of a space and a tab (in that order) is replaced with a space, and a run of a tab and a space is replaced with a tab, etc.
return preg_replace('/([\p{Z}\s])[\p{Z}\s]+/u', '$1', $text);
This regex works by first matching one space and capturing it with a capturing group, followed by one or more spaces. The replacement text is simply reinserts the text matched byt the first (and only) capturing group.
Don't know about any modifiers, but this did the trick:
<?php
$text = ' Hi, my name is Andrés. ';
echo preg_replace(array('/^\s+/', '/\s+$/', '/\s{2,}/'), ' ', $text);
/*
Hi, my name is Andrés.
*/
?>
preg_replace('!\s+!', ' ', 'This sentence has extra space. This doesn’t. Extra space, Lots of extra space.');