UTF 8 String remove all invisible characters except newline

UTF 8 String remove all invisible characters except newline - php

I'm using the following regex to remove all invisible characters from an UTF-8 string:
$string = preg_replace('/\p{C}+/u', '', $string);
This works fine, but how do I alter it so that it removes all invisible characters EXCEPT newlines? I tried some stuff using [^\n] etc. but it doesn't work.
Thanks for helping out!
Edit: newline character is '\n'

Use a "double negation":
$string = preg_replace('/[^\P{C}\n]+/u', '', $string);
Explanation:
\P{C} is the same as [^\p{C}].
Therefore [^\P{C}] is the same as \p{C}.
Since we now have a negated character class, we can substract other characters like \n from it.

My using a negative assertion you can a character class except what the assertion matches, so:
$res = preg_replace('/(?!\n)\p{C}/', '', $input);
(PHP's dialect of regular expressions doesn't support character class subtraction which would, otherwise, be another approach: [\p{C}-[\n]].)

Before you do it, replace newlines (I suppose you are using something like \n) with a random string like ++++++++ (any string that will not be removed by your regular expression and does not naturally occur in your string in the first place), then run your preg_replace, then replace ++++++++ with \n again.
$string=str_replace('\n','++++++++',$string); //Replace \n
$string=preg_replace('/\p{C}+/u', '', $string); //Use your regexp
$string=str_replace('++++++++','\n',$string); //Insert \n again
That should do. If you are using <br/> instead of \n simply use nl2br to preserve line breaks and replace <br/> instead of \n

Related

Replace all non printable characters except newline characters

I want to replace all non printable characters, especially emojis from a text but want to retain the newline characters like \n and \r
I currently have this for escaping the non printable characters but it escapes \n and \r also:
preg_replace('/[[:^print:]]/', '', $value);

[:print:] is a POSIX character class for printable chars. If you use it in a negated character class, you can further add characters that you do not want to match with this pattern, i.e. you can use
preg_replace('/[^\r\n[:print:]]/', '', $value)
See the PHP demo:
$value = "One\tline\r\nThe second line";
echo preg_replace('/[^\r\n[:print:]]/', '', $value);
// => Oneline
// The second line
The [^\r\n[:print:]] pattern matches all chars but printable, CR and LF chars.

The general idea for a regex to "match something, but not when something else" is to first match the "something else" and then instruct the engine to skip it.
So something like...
preg_replace('/[\r\n](*SKIP)(*FAIL)|[[:^print:]]/', '', $value);
This matches newline characters, and then discards the match. Any other non-printable characters are still matched by the second half, and replaced with the empty string.

I think this would do it:
preg_replace('/(?![\r\n])[[:^print:]]/', '', $value);
(?![\r\n]) - make sure the next char is not \r nor \n
[[:^print:]] - capture the non-printable char
An alternate solution with reversed logic to achieve the same goal would like this:
preg_replace('/(?=[^\r\n])[[:^print:]]/', '', $value);

How to replace all occurrences of \ only if they are not followed by n?

Seems to be relatively simple task that gets me stuck in one PHP application. I have a string which has a bunch of \n in it. Those are fine and need to stay there. However, there are also single occurrences of the character \ and those I need to replace or remove, let's just say with empty character without removing the ones that are followed by n.
The best I came up with was to first replace all \n with something else, some weird character, then replace the remaining \ with empty space and then convert back the weird character to \n. However, that seems to be a waste of time and resources, besides, nothing guarantees me that I'll find weird enough character that will never be encountered in the rest of the string...
Any tips?

You need
$s = preg_replace('~\\\\(?!n)~', '', $s);
See the PHP demo:
$s = '\\n \\t \\';
$s = preg_replace('~\\\\(?!n)~', '', $s);
echo $s; // => \n t
We need 4 backslashes to pass 2 literal backslashes to the regex engine to match 1 literal backslash in the input string. The (?!n) is a negative lookahead that fails all matches of a backslash that is immediately followed with n.

You should be able to do this with a negative lookahead assertion:
\\(?!n)
The \\ looks for the backslash, the (?!n) asserts that the next character is not an n, but does not match the character.
To use this in PHP:
$text = 'foo\nbar\nb\az\n';
$newtext = preg_replace('/\\\\(?!n)/', '', $text);
Details here: https://regex101.com/r/F2qhAP/1

PHP Regex: How to match \r and \n without using [\r\n]?

I have tested \v (vertical white space) for matching \r\n and their combinations, but I found out that \v does not match \r and \n. Below is my code that I am using..
$string = "
Test
";
if (preg_match("#\v+#", $string )) {
echo "Matched";
} else {
echo "Not Matched";
}
To be more clear, my question is, is there any other alternative to match \r\n?

PCRE and newlines
PCRE has a superfluity of newline related escape sequences and alternatives.
Well, a nifty escape sequence that you can use here is \R. By default \R will match Unicode newlines sequences, but it can be configured using different alternatives.
To match any Unicode newline sequence that is in the ASCII range.
preg_match('~\R~', $string);
This is equivalent to the following group:
(?>\r\n|\n|\r|\f|\x0b|\x85)
To match any Unicode newline sequence; including newline characters outside the ASCII range and both the line separator (U+2028) and paragraph separator (U+2029), you want to turn on the u (unicode) flag.
preg_match('~\R~u', $string);
The u (unicode) modifier turns on additional functionality of PCRE and Pattern strings are treated as (UTF-8).
The is equivalent to the following group:
(?>\r\n|\n|\r|\f|\x0b|\x85|\x{2028}|\x{2029})
It is possible to restrict \R to match CR, LF, or CRLF only:
preg_match('~(*BSR_ANYCRLF)\R~', $string);
The is equivalent to the following group:
(?>\r\n|\n|\r)
Additional
Five different conventions for indicating line breaks in strings are supported:
(*CR) carriage return
(*LF) linefeed
(*CRLF) carriage return, followed by linefeed
(*ANYCRLF) any of the three above
(*ANY) all Unicode newline sequences
Note: \R does not have special meaning inside of a character class. Like other unrecognized escape sequences, it is treated as the literal character "R" by default.

This doesn't answer the question for alternatives, because \v works perfectly well
\v matches any character considered vertical whitespace; this includes the platform's carriage return and line feed characters (newline) plus several other characters, all listed in the table below.
You only need to change "#\v+#" to either
"#\\v+#" escape the backslash
or
'#\v+#' use single quotes
In both cases, you will get a match for any combination of \r and \n.
Update:
Just to make the scope of \v clear in comparison to \R, from perlrebackslash
\R
\R matches a generic newline; that is, anything considered a linebreak sequence by Unicode. This includes all characters matched by \v (vertical whitespace), ...

If there is some strange requirement that prevents you from using a literal [\r\n] in your pattern, you can always use hexadecimal escape sequences instead:
preg_match('#[\xD\xA]+#', $string)
This is pattern is equivalent to [\r\n]+.

To match every LINE of a given String, simple use the ^$ Anchors and advice your regex engine to operate in multi-line mode. Then ^$ will match the start and end of each line, instead of the whole strings start and end.
http://php.net/manual/en/reference.pcre.pattern.modifiers.php
in PHP, that would be the m modifier after the pattern. /^(.*?)$/m will simple match each line, seperated by any vertical space inside the given string.
Btw: For line-Splitting, you could also use split() and the PHP_EOL constant:
$lines = explode(PHP_EOL, $string);

The problem is that you need the multiline option, or dotall option if using dot. It goes at the end of the delimiter.
http://www.php.net/manual/en/regexp.reference.internal-options.php
$string = "
Test
";
if(preg_match("#\v+#m", $string ))
echo "Matched";
else
echo "Not Matched";

To match a newline in PHP, use the php constant PHP_EOL. This is crossplatform.
if (preg_match('/\v+' . PHP_EOL ."/", $text, $matches ))
print_R($matches );

This regex also matches newline \n and carriage return \r characters.
(?![ \t\f])\s
DEMO
To match one or more newline or carriage return characters, you could use the below regex.
(?:(?![ \t\f])\s)+
DEMO

Regular expressions - remove all non-alpha-numeric characters CRLF problem

First off, if it's not clear from the tag, I'm doing this in PHP - but that probably doesn't matter much.
I have this code:
$inputStr = strip_tags($inputStr);
$inputStr = preg_replace("/[^a-zA-Z\s]/", " ", $inputStr);
Which seems to remove all HTML tags and virtually all special and non-alphabetic characters perfectly. The one problem is, for some reason, it doesn't filter out carraige return/line feeds (just the combination).
If I add this line:
$inputStr = preg_replace("/\s+/", " ", $inputStr);
at the end, however, it works great. Can someone tell me:
Why doesn't the first preg_replace filter out the CR/LFs?
What this second preg_repalce is actually doing? I understand the first one for the most part, but hte second one is confusing me - it works but I don't know why.
Can I combine them into 1 line somehow?

You told it to remove everything except letters and whitespace. Newlines are whitespace, so they don't get removed. You could use \h instead of \s to only exclude horizontal whitespace.
It simply means "replace every sequence of one or more whitespace characters (\s+) with a single space."
preg_replace("/[^A-Za-z]+/", " ", ...) might do.

Your first regex is removing all characters that are not letters or whitespace. CRLFs are whitespace, so they aren't filtered out.
The second one is replacing whitespace with a space character. Essentially it condenses sequences of whitespace into a single space (due to the quantifier being greedy).
I suggest removing the \s from the first regex, see if that works.

\s matches whitespace such as \n.
It is replacing all whitespace characters with a space.
You could make it one unreadable line, but probably not one regex.

Matching duplicate whitespace with preg_replace

I'm writing a WordPress plugin, and one of the features is removing duplicate whitespace.
My code looks like this:
return preg_replace('/\s\s+/u', ' ', $text, -1, $count);
I don't understand why I need the u
modifier. I've seen other plugins
that use preg_replace and don't
need to modify it for Unicode. I
believe I have a default installation
of WordPress .
Without the modifier, the code
replaces all the spaces with Unicode
replacement glyphs instead of spaces.
With the u modifier, I don't get
the glyphs, and it doesn't replace all the whitespace.
Each space below has from 1-10 spaces. The regex only removes on space from each group.
Before:
This sentence has extra space. This doesn’t. Extra space, Lots of extra space.
After:
This sentence has extra space. This doesn’t. Extra space, Lots of extra space.
$count = 9
How can I make the regex replace the whole match with the one space?
Update: If I try this with regular php, it works fine
$new_text = preg_replace('/\s\s+/', ' ', $text, -1, $count);
It only breaks when I use it within the wordpress plugin.
I'm using this function in a filter:
function jje_test( $text ) {
$new_text = preg_replace('/\s\s+/', ' ', $text, -1, $count);
echo "Count: $count";
return $new_text;
}
add_filter('the_content', 'jje_test');
I have tried:
Removing all other filters on the_content
remove_all_filters('the_content');
Changing the priority of the filter added to the_content, earlier or later
All kinds of permutations of \s+, \s\s+, [ ]+ etc.
Even replacing all single spaces with an empty string, will not replace the spaces

This will replace all sequences of two or more spaces, tabs, and/or line breaks with a single space:
return preg_replace('/[\p{Z}\s]{2,}/u', ' ', $text);
You need the /u flag if $text holds text encoded as UTF-8. Even if there are no Unicode characters in your regex, PCRE has to interpret $text correctly.
I added \p{Z} to the character class because PCRE only matches ASCII characters when using shorthands such as \s, even when using /u. Adding \p{Z} makes sure all Unicode whitespace is matched. There might be other spaces such as non-breaking spaces in your string.
I'm not sure if using echo in a WordPress filter is a good idea.

The u modifier simply puts it into UTF-8 mode, which is useful if you need to do anything specific with characters that have a code point above 0x7f. You can still work on UTF-8 encoded strings without using that modifier, you just won't be able to specifically match or transform such characters easily.
There are some whitespace characters in Unicode that are above 0x7f. It's pretty rare to encounter them in most data. But you may see, for example, a non-breaking space character, which is unicode \uA0, or some rarer characters.
I don't know why using it would cause Unicode "replacement" glyphs to be output. I'd say it would be a problem elsewhere... what character encoding are you outputting your script as?

To answer jjeaton's follow-up question in the comments to my first reply, the following replaces each sequence of spaces, tabs, and/or line breaks with the first character in that sequence. Effectively, this deletes the second and following whitespace characters in each sequence of two or more whitespace characters. A run of spaces is replaced with a single space, a run of tabs is replaced with a single tab, etc. A run of a space and a tab (in that order) is replaced with a space, and a run of a tab and a space is replaced with a tab, etc.
return preg_replace('/([\p{Z}\s])[\p{Z}\s]+/u', '$1', $text);
This regex works by first matching one space and capturing it with a capturing group, followed by one or more spaces. The replacement text is simply reinserts the text matched byt the first (and only) capturing group.

Don't know about any modifiers, but this did the trick:
<?php
$text = ' Hi, my name is Andrés. ';
echo preg_replace(array('/^\s+/', '/\s+$/', '/\s{2,}/'), ' ', $text);
/*
Hi, my name is Andrés.
*/
?>

preg_replace('!\s+!', ' ', 'This sentence has extra space. This doesn’t. Extra space, Lots of extra space.');

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

UTF 8 String remove all invisible characters except newline - php

Use a "double negation": $string = preg_replace('/[^\P{C}\n]+/u', '', $string); Explanation: \P{C} is the same as [^\p{C}]. Therefore [^\P{C}] is the same as \p{C}. Since we now have a negated character class, we can substract other characters like \n from it.

My using a negative assertion you can a character class except what the assertion matches, so: $res = preg_replace('/(?!\n)\p{C}/', '', $input); (PHP's dialect of regular expressions doesn't support character class subtraction which would, otherwise, be another approach: [\p{C}-[\n]].)

Related

Replace all non printable characters except newline characters

How to replace all occurrences of \ only if they are not followed by n?

PHP Regex: How to match \r and \n without using [\r\n]?

Regular expressions - remove all non-alpha-numeric characters CRLF problem

Matching duplicate whitespace with preg_replace

Categories

Resources