PHP Regex: How to match \r and \n without using [\r\n]? - php

I have tested \v (vertical white space) for matching \r\n and their combinations, but I found out that \v does not match \r and \n. Below is my code that I am using..
$string = "
Test
";
if (preg_match("#\v+#", $string )) {
echo "Matched";
} else {
echo "Not Matched";
}
To be more clear, my question is, is there any other alternative to match \r\n?

PCRE and newlines
PCRE has a superfluity of newline related escape sequences and alternatives.
Well, a nifty escape sequence that you can use here is \R. By default \R will match Unicode newlines sequences, but it can be configured using different alternatives.
To match any Unicode newline sequence that is in the ASCII range.
preg_match('~\R~', $string);
This is equivalent to the following group:
(?>\r\n|\n|\r|\f|\x0b|\x85)
To match any Unicode newline sequence; including newline characters outside the ASCII range and both the line separator (U+2028) and paragraph separator (U+2029), you want to turn on the u (unicode) flag.
preg_match('~\R~u', $string);
The u (unicode) modifier turns on additional functionality of PCRE and Pattern strings are treated as (UTF-8).
The is equivalent to the following group:
(?>\r\n|\n|\r|\f|\x0b|\x85|\x{2028}|\x{2029})
It is possible to restrict \R to match CR, LF, or CRLF only:
preg_match('~(*BSR_ANYCRLF)\R~', $string);
The is equivalent to the following group:
(?>\r\n|\n|\r)
Additional
Five different conventions for indicating line breaks in strings are supported:
(*CR) carriage return
(*LF) linefeed
(*CRLF) carriage return, followed by linefeed
(*ANYCRLF) any of the three above
(*ANY) all Unicode newline sequences
Note: \R does not have special meaning inside of a character class. Like other unrecognized escape sequences, it is treated as the literal character "R" by default.

This doesn't answer the question for alternatives, because \v works perfectly well
\v matches any character considered vertical whitespace; this includes the platform's carriage return and line feed characters (newline) plus several other characters, all listed in the table below.
You only need to change "#\v+#" to either
"#\\v+#" escape the backslash
or
'#\v+#' use single quotes
In both cases, you will get a match for any combination of \r and \n.
Update:
Just to make the scope of \v clear in comparison to \R, from perlrebackslash
\R
\R matches a generic newline; that is, anything considered a linebreak sequence by Unicode. This includes all characters matched by \v (vertical whitespace), ...

If there is some strange requirement that prevents you from using a literal [\r\n] in your pattern, you can always use hexadecimal escape sequences instead:
preg_match('#[\xD\xA]+#', $string)
This is pattern is equivalent to [\r\n]+.

To match every LINE of a given String, simple use the ^$ Anchors and advice your regex engine to operate in multi-line mode. Then ^$ will match the start and end of each line, instead of the whole strings start and end.
http://php.net/manual/en/reference.pcre.pattern.modifiers.php
in PHP, that would be the m modifier after the pattern. /^(.*?)$/m will simple match each line, seperated by any vertical space inside the given string.
Btw: For line-Splitting, you could also use split() and the PHP_EOL constant:
$lines = explode(PHP_EOL, $string);

The problem is that you need the multiline option, or dotall option if using dot. It goes at the end of the delimiter.
http://www.php.net/manual/en/regexp.reference.internal-options.php
$string = "
Test
";
if(preg_match("#\v+#m", $string ))
echo "Matched";
else
echo "Not Matched";

To match a newline in PHP, use the php constant PHP_EOL. This is crossplatform.
if (preg_match('/\v+' . PHP_EOL ."/", $text, $matches ))
print_R($matches );

This regex also matches newline \n and carriage return \r characters.
(?![ \t\f])\s
DEMO
To match one or more newline or carriage return characters, you could use the below regex.
(?:(?![ \t\f])\s)+
DEMO

Related

Unexpected behavior of preg_replace() with regular expression containing \h on à [duplicate]

I want to replace all empty spaces on the beginning of all new lines. I have two regex replacements:
$txt = preg_replace("/^ +/m", '', $txt);
$txt = preg_replace("/^[^\S\r\n]+/m", '', $txt);
Each of them matches different kinds of empty spaces. However, there may be chances that both of the empty spaces exist and in different orders, so I want to match occurences of all of them at the beginning of new lines. How can I do that?
NOTE: The first regex matches an ideographic space, \u3000 char, which is only possible to check in the question raw body (SO rendering is not doing the right job here). The second regex matches only ASCII whitespace chars other than LF and CR. Here is a demo proving the second regex does not match what the first regex matches.
Since you want to remove any horizontal whitespace from a Unicode string you need to use
\h regex escape ("any horizontal whitespace character (since PHP 5.2.4)")
u modifier (see Pattern Modifiers)
Use
$txt = preg_replace("/^\h+/mu", '', $txt);
Details
^ - start of a line (m modifier makes ^ match all line start positions, not just string start position)
\h+ - one or more horizontal whitespaces
u modifier will make sure the Unicode text is treated as a sequence of Unicode code points, not just code units, and will make all regex escapes in the pattern Unicode aware.

UTF 8 String remove all invisible characters except newline

I'm using the following regex to remove all invisible characters from an UTF-8 string:
$string = preg_replace('/\p{C}+/u', '', $string);
This works fine, but how do I alter it so that it removes all invisible characters EXCEPT newlines? I tried some stuff using [^\n] etc. but it doesn't work.
Thanks for helping out!
Edit: newline character is '\n'
Use a "double negation":
$string = preg_replace('/[^\P{C}\n]+/u', '', $string);
Explanation:
\P{C} is the same as [^\p{C}].
Therefore [^\P{C}] is the same as \p{C}.
Since we now have a negated character class, we can substract other characters like \n from it.
My using a negative assertion you can a character class except what the assertion matches, so:
$res = preg_replace('/(?!\n)\p{C}/', '', $input);
(PHP's dialect of regular expressions doesn't support character class subtraction which would, otherwise, be another approach: [\p{C}-[\n]].)
Before you do it, replace newlines (I suppose you are using something like \n) with a random string like ++++++++ (any string that will not be removed by your regular expression and does not naturally occur in your string in the first place), then run your preg_replace, then replace ++++++++ with \n again.
$string=str_replace('\n','++++++++',$string); //Replace \n
$string=preg_replace('/\p{C}+/u', '', $string); //Use your regexp
$string=str_replace('++++++++','\n',$string); //Insert \n again
That should do. If you are using <br/> instead of \n simply use nl2br to preserve line breaks and replace <br/> instead of \n

How to replace different newline styles in PHP the smartest way?

I have a text which might have different newline styles.
I want to replace all newlines '\r\n', '\n','\r' with the same newline (in this case \r\n ).
What's the fastest way to do this? My current solution looks like this which is way sucky:
$sNicetext = str_replace("\r\n",'%%%%somthing%%%%', $sNicetext);
$sNicetext = str_replace(array("\r","\n"),array("\r\n","\r\n"), $sNicetext);
$sNicetext = str_replace('%%%%somthing%%%%',"\r\n", $sNicetext);
Problem is that you can't do this with one replace because the \r\n will be duplicated to \r\n\r\n .
Thank you for your help!
$string = preg_replace('~\R~u', "\r\n", $string);
If you don't want to replace all Unicode newlines but only CRLF style ones, use:
$string = preg_replace('~(*BSR_ANYCRLF)\R~', "\r\n", $string);
\R matches these newlines, u is a modifier to treat the input string as UTF-8.
From the PCRE docs:
What \R matches
By default, the sequence \R in a pattern matches any Unicode newline
sequence, whatever has been selected as the line ending sequence. If
you specify
--enable-bsr-anycrlf
the default is changed so that \R matches only CR, LF, or CRLF. Whatever is selected when PCRE is built can be overridden when the library
functions are called.
and
Newline sequences
Outside a character class, by default, the escape sequence \R matches
any Unicode newline sequence. In non-UTF-8 mode \R is equivalent to the
following:
(?>\r\n|\n|\x0b|\f|\r|\x85)
This is an example of an "atomic group", details of which are given
below. This particular group matches either the two-character sequence
CR followed by LF, or one of the single characters LF (linefeed,
U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage
return, U+000D), or NEL (next line, U+0085). The two-character sequence
is treated as a single unit that cannot be split.
In UTF-8 mode, two additional characters whose codepoints are greater
than 255 are added: LS (line separator, U+2028) and PS (paragraph separator, U+2029). Unicode character property support is not needed for
these characters to be recognized.
It is possible to restrict \R to match only CR, LF, or CRLF (instead of
the complete set of Unicode line endings) by setting the option
PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched.
(BSR is an abbrevation for "backslash R".) This can be made the default
when PCRE is built; if this is the case, the other behaviour can be
requested via the PCRE_BSR_UNICODE option. It is also possible to
specify these settings by starting a pattern string with one of the
following sequences:
(*BSR_ANYCRLF) CR, LF, or CRLF only
(*BSR_UNICODE) any Unicode newline sequence
These override the default and the options given to pcre_compile() or
pcre_compile2(), but they can be overridden by options given to
pcre_exec() or pcre_dfa_exec(). Note that these special settings, which
are not Perl-compatible, are recognized only at the very start of a
pattern, and that they must be in upper case. If more than one of them
is present, the last one is used. They can be combined with a change of
newline convention; for example, a pattern can start with:
(*ANY)(*BSR_ANYCRLF)
They can also be combined with the (*UTF8) or (*UCP) special sequences.
Inside a character class, \R is treated as an unrecognized escape
sequence, and so matches the letter "R" by default, but causes an error
if PCRE_EXTRA is set.
To normalize newlines I always use:
$str = preg_replace('~\r\n?~', "\n", $str);
It replaces the old Mac (\r) and the Windows (\r\n) newlines with the Unix equivalent (\n).
I preffer using \n because it only takes one byte instead of two, but you can easily change it to \r\n.
How about
$sNicetext = preg_replace('/\r\n|\r|\n/', "\r\n", $sNicetext);
i think the smartest/simplest way to convert to CRLF is:
$output = str_replace("\n", "\r\n", str_replace("\r", '', $input));
to convert to LF only:
$output = str_replace("\r", '', $input);
it's much more easier than regular expressions.
$sNicetext = str_replace(["\r\n", "\r"], "\n", $sNicetext);
also works

How to replace a string which contains any sequence of linefeeds and spaces with a single line feed

I'm trying to construct a PHP string replacement / regex function that takes a string with any sequence of linefeeds and spaces and replaces it with a single line feed.
Is this possible and, if so, how would it be done ?
You could try:
= preg_replace('/\s*[\r\n]+\s*/', "\n", $text);
It will look out for a single CR or LF to detect Unix, Windows and old Mac line breaks. And after that any whitespace (space, tab, CR, LF) will be removed.
I would however remove the first \s* to ignore spaces on the preceding line.
The last \s* could also be [\r\n ]* if you want to keep \tabs.
See also https://stackoverflow.com/questions/89718/is-there-anything-like-regexbuddy-in-the-open-source-world if you want to refine this regex.
If the spaces/breaks can come in any sequence and from any potential OS, then this would be a shotgun approach:
$fixed_string = preg_replace('/[\s\n\r]+/', "\n", $bad_string);
It'll look for one-or-more whitespace (\s), newline (\n) and carriage return (\r) characters and replace them with a newline.
Try this:
http://www.regular-expressions.info/tutorial.html
Great source material!

Matching duplicate whitespace with preg_replace

I'm writing a WordPress plugin, and one of the features is removing duplicate whitespace.
My code looks like this:
return preg_replace('/\s\s+/u', ' ', $text, -1, $count);
I don't understand why I need the u
modifier. I've seen other plugins
that use preg_replace and don't
need to modify it for Unicode. I
believe I have a default installation
of WordPress .
Without the modifier, the code
replaces all the spaces with Unicode
replacement glyphs instead of spaces.
With the u modifier, I don't get
the glyphs, and it doesn't replace all the whitespace.
Each space below has from 1-10 spaces. The regex only removes on space from each group.
Before:
This sentence has extra space. This doesn’t. Extra space, Lots of extra space.
After:
This sentence has extra space. This doesn’t. Extra space, Lots of extra space.
$count = 9
How can I make the regex replace the whole match with the one space?
Update: If I try this with regular php, it works fine
$new_text = preg_replace('/\s\s+/', ' ', $text, -1, $count);
It only breaks when I use it within the wordpress plugin.
I'm using this function in a filter:
function jje_test( $text ) {
$new_text = preg_replace('/\s\s+/', ' ', $text, -1, $count);
echo "Count: $count";
return $new_text;
}
add_filter('the_content', 'jje_test');
I have tried:
Removing all other filters on the_content
remove_all_filters('the_content');
Changing the priority of the filter added to the_content, earlier or later
All kinds of permutations of \s+, \s\s+, [ ]+ etc.
Even replacing all single spaces with an empty string, will not replace the spaces
This will replace all sequences of two or more spaces, tabs, and/or line breaks with a single space:
return preg_replace('/[\p{Z}\s]{2,}/u', ' ', $text);
You need the /u flag if $text holds text encoded as UTF-8. Even if there are no Unicode characters in your regex, PCRE has to interpret $text correctly.
I added \p{Z} to the character class because PCRE only matches ASCII characters when using shorthands such as \s, even when using /u. Adding \p{Z} makes sure all Unicode whitespace is matched. There might be other spaces such as non-breaking spaces in your string.
I'm not sure if using echo in a WordPress filter is a good idea.
The u modifier simply puts it into UTF-8 mode, which is useful if you need to do anything specific with characters that have a code point above 0x7f. You can still work on UTF-8 encoded strings without using that modifier, you just won't be able to specifically match or transform such characters easily.
There are some whitespace characters in Unicode that are above 0x7f. It's pretty rare to encounter them in most data. But you may see, for example, a non-breaking space character, which is unicode \uA0, or some rarer characters.
I don't know why using it would cause Unicode "replacement" glyphs to be output. I'd say it would be a problem elsewhere... what character encoding are you outputting your script as?
To answer jjeaton's follow-up question in the comments to my first reply, the following replaces each sequence of spaces, tabs, and/or line breaks with the first character in that sequence. Effectively, this deletes the second and following whitespace characters in each sequence of two or more whitespace characters. A run of spaces is replaced with a single space, a run of tabs is replaced with a single tab, etc. A run of a space and a tab (in that order) is replaced with a space, and a run of a tab and a space is replaced with a tab, etc.
return preg_replace('/([\p{Z}\s])[\p{Z}\s]+/u', '$1', $text);
This regex works by first matching one space and capturing it with a capturing group, followed by one or more spaces. The replacement text is simply reinserts the text matched byt the first (and only) capturing group.
Don't know about any modifiers, but this did the trick:
<?php
$text = ' Hi, my name is Andrés. ';
echo preg_replace(array('/^\s+/', '/\s+$/', '/\s{2,}/'), ' ', $text);
/*
Hi, my name is Andrés.
*/
?>
preg_replace('!\s+!', ' ', 'This sentence has extra space. This doesn’t. Extra space, Lots of extra space.');

Categories