Replace all non printable characters except newline characters - php

I want to replace all non printable characters, especially emojis from a text but want to retain the newline characters like \n and \r
I currently have this for escaping the non printable characters but it escapes \n and \r also:
preg_replace('/[[:^print:]]/', '', $value);

[:print:] is a POSIX character class for printable chars. If you use it in a negated character class, you can further add characters that you do not want to match with this pattern, i.e. you can use
preg_replace('/[^\r\n[:print:]]/', '', $value)
See the PHP demo:
$value = "One\tline\r\nThe second line";
echo preg_replace('/[^\r\n[:print:]]/', '', $value);
// => Oneline
// The second line
The [^\r\n[:print:]] pattern matches all chars but printable, CR and LF chars.

The general idea for a regex to "match something, but not when something else" is to first match the "something else" and then instruct the engine to skip it.
So something like...
preg_replace('/[\r\n](*SKIP)(*FAIL)|[[:^print:]]/', '', $value);
This matches newline characters, and then discards the match. Any other non-printable characters are still matched by the second half, and replaced with the empty string.

I think this would do it:
preg_replace('/(?![\r\n])[[:^print:]]/', '', $value);
(?![\r\n]) - make sure the next char is not \r nor \n
[[:^print:]] - capture the non-printable char
An alternate solution with reversed logic to achieve the same goal would like this:
preg_replace('/(?=[^\r\n])[[:^print:]]/', '', $value);

Related

How to remove special characters except for characters "ñ / Ñ" and dash "-" in PHP Laravel

I want to remove special characters in a string but have "ñ / Ñ" and "-" remain.
So if I have:
»¿Antonio Ramon-Peñaą
Result would be:
Antonio Ramon-Peña
You could use PCRE verbs to allow for a set list of characters, if it matches that ignore it. Then for all other characters (because it is not a character you care about) remove them.
preg_replace('/[\w\h-](*SKIP)(*FAIL)|./u', '', '»¿Antonio Ramon-Peñaą')
For a longer regex explanation see https://regex101.com/r/pCwGuy/1/. Basically [\w\h-] are any horizontal spaces, word characters, or hyphens. The u modifier after the closing delimiter expands the \w to include word characters outside ascii set. The . is any single character.
Alternatively you could match all valid characters then rejoin them.
preg_match('/[\w\h-]+/u', '»¿Antonio Ramon-Peñaą', $match);
echo implode('', $match);

PHP Regex: How to match \r and \n without using [\r\n]?

I have tested \v (vertical white space) for matching \r\n and their combinations, but I found out that \v does not match \r and \n. Below is my code that I am using..
$string = "
Test
";
if (preg_match("#\v+#", $string )) {
echo "Matched";
} else {
echo "Not Matched";
}
To be more clear, my question is, is there any other alternative to match \r\n?
PCRE and newlines
PCRE has a superfluity of newline related escape sequences and alternatives.
Well, a nifty escape sequence that you can use here is \R. By default \R will match Unicode newlines sequences, but it can be configured using different alternatives.
To match any Unicode newline sequence that is in the ASCII range.
preg_match('~\R~', $string);
This is equivalent to the following group:
(?>\r\n|\n|\r|\f|\x0b|\x85)
To match any Unicode newline sequence; including newline characters outside the ASCII range and both the line separator (U+2028) and paragraph separator (U+2029), you want to turn on the u (unicode) flag.
preg_match('~\R~u', $string);
The u (unicode) modifier turns on additional functionality of PCRE and Pattern strings are treated as (UTF-8).
The is equivalent to the following group:
(?>\r\n|\n|\r|\f|\x0b|\x85|\x{2028}|\x{2029})
It is possible to restrict \R to match CR, LF, or CRLF only:
preg_match('~(*BSR_ANYCRLF)\R~', $string);
The is equivalent to the following group:
(?>\r\n|\n|\r)
Additional
Five different conventions for indicating line breaks in strings are supported:
(*CR) carriage return
(*LF) linefeed
(*CRLF) carriage return, followed by linefeed
(*ANYCRLF) any of the three above
(*ANY) all Unicode newline sequences
Note: \R does not have special meaning inside of a character class. Like other unrecognized escape sequences, it is treated as the literal character "R" by default.
This doesn't answer the question for alternatives, because \v works perfectly well
\v matches any character considered vertical whitespace; this includes the platform's carriage return and line feed characters (newline) plus several other characters, all listed in the table below.
You only need to change "#\v+#" to either
"#\\v+#" escape the backslash
or
'#\v+#' use single quotes
In both cases, you will get a match for any combination of \r and \n.
Update:
Just to make the scope of \v clear in comparison to \R, from perlrebackslash
\R
\R matches a generic newline; that is, anything considered a linebreak sequence by Unicode. This includes all characters matched by \v (vertical whitespace), ...
If there is some strange requirement that prevents you from using a literal [\r\n] in your pattern, you can always use hexadecimal escape sequences instead:
preg_match('#[\xD\xA]+#', $string)
This is pattern is equivalent to [\r\n]+.
To match every LINE of a given String, simple use the ^$ Anchors and advice your regex engine to operate in multi-line mode. Then ^$ will match the start and end of each line, instead of the whole strings start and end.
http://php.net/manual/en/reference.pcre.pattern.modifiers.php
in PHP, that would be the m modifier after the pattern. /^(.*?)$/m will simple match each line, seperated by any vertical space inside the given string.
Btw: For line-Splitting, you could also use split() and the PHP_EOL constant:
$lines = explode(PHP_EOL, $string);
The problem is that you need the multiline option, or dotall option if using dot. It goes at the end of the delimiter.
http://www.php.net/manual/en/regexp.reference.internal-options.php
$string = "
Test
";
if(preg_match("#\v+#m", $string ))
echo "Matched";
else
echo "Not Matched";
To match a newline in PHP, use the php constant PHP_EOL. This is crossplatform.
if (preg_match('/\v+' . PHP_EOL ."/", $text, $matches ))
print_R($matches );
This regex also matches newline \n and carriage return \r characters.
(?![ \t\f])\s
DEMO
To match one or more newline or carriage return characters, you could use the below regex.
(?:(?![ \t\f])\s)+
DEMO

remove in php any character but not symbols and letters

how I can use str_ireplace or other functions to remove any characters but not letters,numbers or symbols that are commonly used in HTML as : " ' ; : . - + =... etc. I also wants to remove /n, white spaces, tabs and other.
I need that text, comes from doing ("textContent"). innerHTML in IE10 and Chrome, which a php variable are the same size, regardless of which browser do it.Therefore I need the same encoding in both texts and characters that are rare or different are removed.
I try this, but it dont work for me:
$textForMatch=iconv(mb_detect_encoding($text, mb_detect_order(), true), "UTF-8", $text);
$textoForMatc = str_replace(array('\s', "\n", "\t", "\r"), '', $textoForMatch);
$text contains the result of the function ("textContent"). innerHTML, I want to delete characters as �é³..
The easiest option is to simply use preg_replace with a whitelist. I.e. use a pattern listing the things you want to keep, and replace anything not in that list:
$input = 'The quick brown 123 fox said "�é³". Man was I surprised';
$stripped = preg_replace('/[^-\w:";:+=\.\']/', '', $input);
$output = 'Thequickbrownfoxsaid"".ManwasIsurprised';
regex explanation
/ - start regex
[^ - Begin inverted character class, match NON-matching characters
- - litteral character
\w - Match word characters. Equivalent to A-Za-z0-9_
:";:+= - litteral characters
\. - escaped period (because a dot has meaning in a regex)
\' - escaped quote (because the string is in single quotes)
] - end character class
/ - end of regex
This will therefore remove anything that isn't words, numbers or the specific characters listed in the regex.

UTF 8 String remove all invisible characters except newline

I'm using the following regex to remove all invisible characters from an UTF-8 string:
$string = preg_replace('/\p{C}+/u', '', $string);
This works fine, but how do I alter it so that it removes all invisible characters EXCEPT newlines? I tried some stuff using [^\n] etc. but it doesn't work.
Thanks for helping out!
Edit: newline character is '\n'
Use a "double negation":
$string = preg_replace('/[^\P{C}\n]+/u', '', $string);
Explanation:
\P{C} is the same as [^\p{C}].
Therefore [^\P{C}] is the same as \p{C}.
Since we now have a negated character class, we can substract other characters like \n from it.
My using a negative assertion you can a character class except what the assertion matches, so:
$res = preg_replace('/(?!\n)\p{C}/', '', $input);
(PHP's dialect of regular expressions doesn't support character class subtraction which would, otherwise, be another approach: [\p{C}-[\n]].)
Before you do it, replace newlines (I suppose you are using something like \n) with a random string like ++++++++ (any string that will not be removed by your regular expression and does not naturally occur in your string in the first place), then run your preg_replace, then replace ++++++++ with \n again.
$string=str_replace('\n','++++++++',$string); //Replace \n
$string=preg_replace('/\p{C}+/u', '', $string); //Use your regexp
$string=str_replace('++++++++','\n',$string); //Insert \n again
That should do. If you are using <br/> instead of \n simply use nl2br to preserve line breaks and replace <br/> instead of \n

Regex to remove single characters from string

Consider the following strings
breaking out a of a simple prison
this is b moving up
following me is x times better
All strings are lowercased already. I would like to remove any "loose" a-z characters, resulting in:
breaking out of simple prison
this is moving up
following me is times better
Is this possible with a single regex in php?
$str = "breaking out a of a simple prison
this is b moving up
following me is x times better";
$res = preg_replace("#\\b[a-z]\\b ?#i", "", $str);
echo $res;
How about:
preg_replace('/(^|\s)[a-z](\s|$)/', '$1', $string);
Note this also catches single characters that are at the beginning or end of the string, but not single characters that are adjacent to punctuation (they must be surrounded by whitespace).
If you also want to remove characters immediately before punctuation (e.g. 'the x.'), then this should work properly in most (English) cases:
preg_replace('/(^|\s)[a-z]\b/', '$1', $string);
As a one-liner:
$result = preg_replace('/\s\p{Ll}\b|\b\p{Ll}\s/u', '', $subject);
This matches a single lowercase letter (\p{Ll}) which is preceded or followed by whitespace (\s), removing both. The word boundaries (\b) ensure that only single letters are indeed matched. The /u modifier makes the regex Unicode-aware.
The result: A single letter surrounded by spaces on both sides is reduced to a single space. A single letter preceded by whitespace but not followed by whitespace is removed completely, as is a single letter only followed but not preceded by whitespace.
So
This a is my test sentence a. o How funny (what a coincidence a) this is!
is changed to
This is my test sentence. How funny (what coincidence) this is!
You could try something like this:
preg_replace('/\b\S\s\b/', "", $subject);
This is what it means:
\b # Assert position at a word boundary
\S # Match a single character that is a “non-whitespace character”
\s # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
\b # Assert position at a word boundary
Update
As raised by Radu, because I've used the \S this will match more than just a-zA-Z. It will also match 0-9_. Normally, it would match a lot more than that, but because it's preceded by \b, it can only match word characters.
As mentioned in the comments by Tim Pietzcker, be aware that this won't work if your subject string needs to remove single characters that are followed by non word characters like test a (hello). It will also fall over if there are extra spaces after the single character like this
test a hello
but you could fix that by changing the expression to \b\S\s*\b
Try this one:
$sString = preg_replace("#\b[a-z]{1}\b#m", ' ', $sString);

Categories