How to replace different newline styles in PHP the smartest way? - php

I have a text which might have different newline styles.
I want to replace all newlines '\r\n', '\n','\r' with the same newline (in this case \r\n ).
What's the fastest way to do this? My current solution looks like this which is way sucky:
$sNicetext = str_replace("\r\n",'%%%%somthing%%%%', $sNicetext);
$sNicetext = str_replace(array("\r","\n"),array("\r\n","\r\n"), $sNicetext);
$sNicetext = str_replace('%%%%somthing%%%%',"\r\n", $sNicetext);
Problem is that you can't do this with one replace because the \r\n will be duplicated to \r\n\r\n .
Thank you for your help!

$string = preg_replace('~\R~u', "\r\n", $string);
If you don't want to replace all Unicode newlines but only CRLF style ones, use:
$string = preg_replace('~(*BSR_ANYCRLF)\R~', "\r\n", $string);
\R matches these newlines, u is a modifier to treat the input string as UTF-8.
From the PCRE docs:
What \R matches
By default, the sequence \R in a pattern matches any Unicode newline
sequence, whatever has been selected as the line ending sequence. If
you specify
--enable-bsr-anycrlf
the default is changed so that \R matches only CR, LF, or CRLF. Whatever is selected when PCRE is built can be overridden when the library
functions are called.
and
Newline sequences
Outside a character class, by default, the escape sequence \R matches
any Unicode newline sequence. In non-UTF-8 mode \R is equivalent to the
following:
(?>\r\n|\n|\x0b|\f|\r|\x85)
This is an example of an "atomic group", details of which are given
below. This particular group matches either the two-character sequence
CR followed by LF, or one of the single characters LF (linefeed,
U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage
return, U+000D), or NEL (next line, U+0085). The two-character sequence
is treated as a single unit that cannot be split.
In UTF-8 mode, two additional characters whose codepoints are greater
than 255 are added: LS (line separator, U+2028) and PS (paragraph separator, U+2029). Unicode character property support is not needed for
these characters to be recognized.
It is possible to restrict \R to match only CR, LF, or CRLF (instead of
the complete set of Unicode line endings) by setting the option
PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched.
(BSR is an abbrevation for "backslash R".) This can be made the default
when PCRE is built; if this is the case, the other behaviour can be
requested via the PCRE_BSR_UNICODE option. It is also possible to
specify these settings by starting a pattern string with one of the
following sequences:
(*BSR_ANYCRLF) CR, LF, or CRLF only
(*BSR_UNICODE) any Unicode newline sequence
These override the default and the options given to pcre_compile() or
pcre_compile2(), but they can be overridden by options given to
pcre_exec() or pcre_dfa_exec(). Note that these special settings, which
are not Perl-compatible, are recognized only at the very start of a
pattern, and that they must be in upper case. If more than one of them
is present, the last one is used. They can be combined with a change of
newline convention; for example, a pattern can start with:
(*ANY)(*BSR_ANYCRLF)
They can also be combined with the (*UTF8) or (*UCP) special sequences.
Inside a character class, \R is treated as an unrecognized escape
sequence, and so matches the letter "R" by default, but causes an error
if PCRE_EXTRA is set.

To normalize newlines I always use:
$str = preg_replace('~\r\n?~', "\n", $str);
It replaces the old Mac (\r) and the Windows (\r\n) newlines with the Unix equivalent (\n).
I preffer using \n because it only takes one byte instead of two, but you can easily change it to \r\n.

How about
$sNicetext = preg_replace('/\r\n|\r|\n/', "\r\n", $sNicetext);

i think the smartest/simplest way to convert to CRLF is:
$output = str_replace("\n", "\r\n", str_replace("\r", '', $input));
to convert to LF only:
$output = str_replace("\r", '', $input);
it's much more easier than regular expressions.

$sNicetext = str_replace(["\r\n", "\r"], "\n", $sNicetext);
also works

Related

Replace any combination of (CR) and (LF) with a single (CRLF) in a text file using php

In different text files (or also inside single text file) I have different end-of-lines combinations (see example below).
How to uniform all combinations of CR and LF with one simple CRLF?
In a few words I need to replace every CR and LF combinations with one single CRLF using PHP.
With str_replace I can replace them but my issue is the right search string to use.
$textfile=str_replace("search string i need","CRLF to replace", $textfile);
Example of a generic text file to fix:
text line 1 CRLFLFCRCRLF
text line 2 LFLFCRLFCRCR
text line 3 CRLF
text line 4 CR
text line 5 LF
I need to replace all the \r \n random combinations with only \r\n like this:
text line 1 CRLF
text line 2 CRLF
text line 3 CRLF
text line 4 CRLF
text line 5 CRLF
PCRE has an alias for any newline combination: \R
You can do that:
$text = preg_replace('~\R~', "\r\n", $text);
In 8 bit mode, \R matches CR, LF, or CRLF, but also the vertical tabulation (VT), the form feed (FF) and the next line character (NEL).
In other words, \R is an alias for (?>\r\n|\n|\x0b|\f|\r|\x85). But since VT, FF and NEL are rarely (never?) used todays... However, it's possible to restrict \R to only CR, LF and CRLF using (*BSR_ANYCRLF) at the start of the pattern:
$text = preg_replace('~(*BSR_ANYCRLF)\R~', "\r\n", $text);
if you want to extend the meaning of \R to any unicode newline sequences, use the u modifier:
$text = preg_replace('~\R~u', "\r\n", $text);
Concretly it adds the Line Separator U+2028 and the Paragraph Separator U+2029 to the list of newline sequences.
Take care that \R is an alias and not a shorthand character class. So you can't put it inside a character class.
With intl transliterator.
It can be interesting to use the intl transliterator instead of a simple replacement function with regex or not, in particular if you need to include other modifications to your strings. All of them can be centralized in a unique set of rules:
$tls = Transliterator::createFromRules('[\r\n]+ > \r\n;');
$text = $tls->transliterate($text);
To replace all combinations of '\r\n' with '\r\n' use:
$result = preg_replace('/[\r\n]+/', "\r\n", $text);
This will also replace single '\r' or '\n' with '\r\n'.
You don't really need regex for that:
str_replace(["\r\n", "\r", "\n"], "\r\n", $str);
This will replace every one of the strings in the array (and keep the order, so if you have \r\n it will not be replaced to \r\n\r\n).

PHP Regex: How to match \r and \n without using [\r\n]?

I have tested \v (vertical white space) for matching \r\n and their combinations, but I found out that \v does not match \r and \n. Below is my code that I am using..
$string = "
Test
";
if (preg_match("#\v+#", $string )) {
echo "Matched";
} else {
echo "Not Matched";
}
To be more clear, my question is, is there any other alternative to match \r\n?
PCRE and newlines
PCRE has a superfluity of newline related escape sequences and alternatives.
Well, a nifty escape sequence that you can use here is \R. By default \R will match Unicode newlines sequences, but it can be configured using different alternatives.
To match any Unicode newline sequence that is in the ASCII range.
preg_match('~\R~', $string);
This is equivalent to the following group:
(?>\r\n|\n|\r|\f|\x0b|\x85)
To match any Unicode newline sequence; including newline characters outside the ASCII range and both the line separator (U+2028) and paragraph separator (U+2029), you want to turn on the u (unicode) flag.
preg_match('~\R~u', $string);
The u (unicode) modifier turns on additional functionality of PCRE and Pattern strings are treated as (UTF-8).
The is equivalent to the following group:
(?>\r\n|\n|\r|\f|\x0b|\x85|\x{2028}|\x{2029})
It is possible to restrict \R to match CR, LF, or CRLF only:
preg_match('~(*BSR_ANYCRLF)\R~', $string);
The is equivalent to the following group:
(?>\r\n|\n|\r)
Additional
Five different conventions for indicating line breaks in strings are supported:
(*CR) carriage return
(*LF) linefeed
(*CRLF) carriage return, followed by linefeed
(*ANYCRLF) any of the three above
(*ANY) all Unicode newline sequences
Note: \R does not have special meaning inside of a character class. Like other unrecognized escape sequences, it is treated as the literal character "R" by default.
This doesn't answer the question for alternatives, because \v works perfectly well
\v matches any character considered vertical whitespace; this includes the platform's carriage return and line feed characters (newline) plus several other characters, all listed in the table below.
You only need to change "#\v+#" to either
"#\\v+#" escape the backslash
or
'#\v+#' use single quotes
In both cases, you will get a match for any combination of \r and \n.
Update:
Just to make the scope of \v clear in comparison to \R, from perlrebackslash
\R
\R matches a generic newline; that is, anything considered a linebreak sequence by Unicode. This includes all characters matched by \v (vertical whitespace), ...
If there is some strange requirement that prevents you from using a literal [\r\n] in your pattern, you can always use hexadecimal escape sequences instead:
preg_match('#[\xD\xA]+#', $string)
This is pattern is equivalent to [\r\n]+.
To match every LINE of a given String, simple use the ^$ Anchors and advice your regex engine to operate in multi-line mode. Then ^$ will match the start and end of each line, instead of the whole strings start and end.
http://php.net/manual/en/reference.pcre.pattern.modifiers.php
in PHP, that would be the m modifier after the pattern. /^(.*?)$/m will simple match each line, seperated by any vertical space inside the given string.
Btw: For line-Splitting, you could also use split() and the PHP_EOL constant:
$lines = explode(PHP_EOL, $string);
The problem is that you need the multiline option, or dotall option if using dot. It goes at the end of the delimiter.
http://www.php.net/manual/en/regexp.reference.internal-options.php
$string = "
Test
";
if(preg_match("#\v+#m", $string ))
echo "Matched";
else
echo "Not Matched";
To match a newline in PHP, use the php constant PHP_EOL. This is crossplatform.
if (preg_match('/\v+' . PHP_EOL ."/", $text, $matches ))
print_R($matches );
This regex also matches newline \n and carriage return \r characters.
(?![ \t\f])\s
DEMO
To match one or more newline or carriage return characters, you could use the below regex.
(?:(?![ \t\f])\s)+
DEMO

UTF 8 String remove all invisible characters except newline

I'm using the following regex to remove all invisible characters from an UTF-8 string:
$string = preg_replace('/\p{C}+/u', '', $string);
This works fine, but how do I alter it so that it removes all invisible characters EXCEPT newlines? I tried some stuff using [^\n] etc. but it doesn't work.
Thanks for helping out!
Edit: newline character is '\n'
Use a "double negation":
$string = preg_replace('/[^\P{C}\n]+/u', '', $string);
Explanation:
\P{C} is the same as [^\p{C}].
Therefore [^\P{C}] is the same as \p{C}.
Since we now have a negated character class, we can substract other characters like \n from it.
My using a negative assertion you can a character class except what the assertion matches, so:
$res = preg_replace('/(?!\n)\p{C}/', '', $input);
(PHP's dialect of regular expressions doesn't support character class subtraction which would, otherwise, be another approach: [\p{C}-[\n]].)
Before you do it, replace newlines (I suppose you are using something like \n) with a random string like ++++++++ (any string that will not be removed by your regular expression and does not naturally occur in your string in the first place), then run your preg_replace, then replace ++++++++ with \n again.
$string=str_replace('\n','++++++++',$string); //Replace \n
$string=preg_replace('/\p{C}+/u', '', $string); //Use your regexp
$string=str_replace('++++++++','\n',$string); //Insert \n again
That should do. If you are using <br/> instead of \n simply use nl2br to preserve line breaks and replace <br/> instead of \n

How to replace a string which contains any sequence of linefeeds and spaces with a single line feed

I'm trying to construct a PHP string replacement / regex function that takes a string with any sequence of linefeeds and spaces and replaces it with a single line feed.
Is this possible and, if so, how would it be done ?
You could try:
= preg_replace('/\s*[\r\n]+\s*/', "\n", $text);
It will look out for a single CR or LF to detect Unix, Windows and old Mac line breaks. And after that any whitespace (space, tab, CR, LF) will be removed.
I would however remove the first \s* to ignore spaces on the preceding line.
The last \s* could also be [\r\n ]* if you want to keep \tabs.
See also https://stackoverflow.com/questions/89718/is-there-anything-like-regexbuddy-in-the-open-source-world if you want to refine this regex.
If the spaces/breaks can come in any sequence and from any potential OS, then this would be a shotgun approach:
$fixed_string = preg_replace('/[\s\n\r]+/', "\n", $bad_string);
It'll look for one-or-more whitespace (\s), newline (\n) and carriage return (\r) characters and replace them with a newline.
Try this:
http://www.regular-expressions.info/tutorial.html
Great source material!

Remove control characters from PHP string

How can I remove control characters like STX from a PHP string? I played around with
preg_replace("/[^a-zA-Z0-9 .\-_;!:?äÄöÖüÜß<>='\"]/","",$pString)
but found that it removed way to much. Is there a way to remove only
control chars?
If you mean by control characters the first 32 ascii characters and \x7F (that includes the carriage return, etc!), then this will work:
preg_replace('/[\x00-\x1F\x7F]/', '', $input);
(Note the single quotes: with double quotes the use of \x00 causes a parse error, somehow.)
The line feed and carriage return (often written \r and \n) may be saved from removal like so:
preg_replace('/[\x00-\x09\x0B\x0C\x0E-\x1F\x7F]/', '', $input);
I must say that I think Bobby's answer is better, in the sense that [:cntrl:] better conveys what the code does than [\x00-\x1F\x7F].
WARNING: ereg_replace is deprecated in PHP >= 5.3.0 and removed in PHP >= 7.0.0!, please use preg_replace instead of ereg_replace:
preg_replace('/[[:cntrl:]]/', '', $input);
For Unicode input, this will remove all control characters, unassigned, private use, formatting and surrogate code points (that are not also space characters, such as tab, new line) from your input text. I use this to remove all non-printable characters from my input.
<?php
$clean = preg_replace('/[^\PC\s]/u', '', $input);
for more info on \p{C} see http://www.regular-expressions.info/unicode.html#category
PHP does support POSIX-Classes so you can use [:cntrl:] instead of some fancy character-magic-stuff:
ereg_replace("[:cntrl:]", "", $pString);
Edit:
A extra pair of square brackets might be needed in 5.3.
ereg_replace("[[:cntrl:]]", "", $pString);
TLDR Answer
Use this Regex...
/[^\PCc^\PCn^\PCs]/u
Like this...
$text = preg_replace('/[^\PCc^\PCn^\PCs]/u', '', $text);
TLDR Explanation
^\PCc : Do not match control characters.
^\PCn : Do not match unassigned characters.
^\PCs : Do not match UTF-8-invalid characters.
Working Demo
Simple demo to demonstrate: IDEOne Demo
$text = "\u{0019}hello";
print($text . "\n\n");
$text = preg_replace('/[^\PCc^\PCn^\PCs]/u', '', $text);
print($text);
Output:
(-Broken-Character)hello
hello
Alternatives
^\PC : Match only visible characters. Do not match any invisible characters.
^\PCc : Match only non-control characters. Do not match any control characters.
^\PCc^\PCn : Match only non-control characters that have been assigned. Do not match any control or unassigned characters.
^\PCc^\PCn^\PCs : Match only non-control characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, or UTF-8-invalid characters.
^\PCc^\PCn^\PCs^\PCf : Match only non-control, non-formatting characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, formatting, or UTF-8-invalid characters.
Source and Explanation
Take a look at the Unicode Character Properties available that can be used to test within a regex. You should be able to use these regexes in Microsoft .NET, JavaScript, Python, Java, PHP, Ruby, Perl, Golang, and even Adobe. Knowing Unicode character classes is very transferable knowledge, so I recommend using it!
This regex will match anything visible, given in both its short-hand and long-hand form...
\PL\PM\PN\PP\PS\PZ
\PLetter\PMark\PNumber\PPunctuation\PSymbol\PSeparator
Normally, \p indicates that it's something we want to match and we use \P (capitalized) to indicate something that does not match. But PHP doesn't have this functionality, so we need to use ^ in the regex to do a manual negation.
A simpler regex then would be ^\PC, but this might be too restrictive in deleting invisible formatting. You may want to look closely and see what's best, but one of the alternatives should fit your needs.
All Matchable Unicode Character Sets
If you want to know any other character sets available, check out regular-expressions.info...
\PL or \PLetter: any kind of letter from any language.
\PLl or \PLowercase_Letter: a lowercase letter that has an uppercase variant.
\PLu or \PUppercase_Letter: an uppercase letter that has a lowercase variant.
\PLt or \PTitlecase_Letter: a letter that appears at the start of a word when only the first letter of the word is capitalized.
\PL& or \PCased_Letter: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
\PLm or \PModifier_Letter: a special character that is used like a letter.
\PLo or \POther_Letter: a letter or ideograph that does not have lowercase and uppercase
\PM or \PMark: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
\PMn or \PNon_Spacing_Mark: a character intended to be combined with another
character without taking up extra space (e.g. accents, umlauts, etc.).
\PMc or \PSpacing_Combining_Mark: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
\PMe or \PEnclosing_Mark: a character that encloses the character it is combined with (circle, square, keycap, etc.).
\PZ or \PSeparator: any kind of whitespace or invisible separator.
\PZs or \PSpace_Separator: a whitespace character that is invisible, but does take up space.
\PZl or \PLine_Separator: line separator character U+2028.
\PZp or \PParagraph_Separator: paragraph separator character U+2029.
\PS or \PSymbol: math symbols, currency signs, dingbats, box-drawing characters, etc.
\PSm or \PMath_Symbol: any mathematical symbol.
\PSc or \PCurrency_Symbol: any currency sign.
\PSk or \PModifier_Symbol: a combining character (mark) as a full character on its own.
\PSo or \POther_Symbol: various symbols that are not math symbols, currency signs, or combining characters.
\PN or \PNumber: any kind of numeric character in any script.
\PNd or \PDecimal_Digit_Number: a digit zero through nine in any script except ideographic scripts.
\PNl or \PLetter_Number: a number that looks like a letter, such as a Roman numeral.
\PNo or \POther_Number: a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from ideographic scripts).
\PP or \PPunctuation: any kind of punctuation character.
\PPd or \PDash_Punctuation: any kind of hyphen or dash.
\PPs or \POpen_Punctuation: any kind of opening bracket.
\PPe or \PClose_Punctuation: any kind of closing bracket.
\PPi or \PInitial_Punctuation: any kind of opening quote.
\PPf or \PFinal_Punctuation: any kind of closing quote.
\PPc or \PConnector_Punctuation: a punctuation character such as an underscore that connects words.
\PPo or \POther_Punctuation: any kind of punctuation character that is not a dash, bracket, quote or connector.
\PC or \POther: invisible control characters and unused code points.
\PCc or \PControl: an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F.
\PCf or \PFormat: invisible formatting indicator.
\PCo or \PPrivate_Use: any code point reserved for private use.
\PCs or \PSurrogate: one half of a surrogate pair in UTF-16 encoding.
\PCn or \PUnassigned: any code point to which no character has been assigned.
To keep the control characters but make them compatible for JSON, I had to to
$str = preg_replace(
array(
'/\x00/', '/\x01/', '/\x02/', '/\x03/', '/\x04/',
'/\x05/', '/\x06/', '/\x07/', '/\x08/', '/\x09/', '/\x0A/',
'/\x0B/','/\x0C/','/\x0D/', '/\x0E/', '/\x0F/', '/\x10/', '/\x11/',
'/\x12/','/\x13/','/\x14/','/\x15/', '/\x16/', '/\x17/', '/\x18/',
'/\x19/','/\x1A/','/\x1B/','/\x1C/','/\x1D/', '/\x1E/', '/\x1F/'
),
array(
"\u0000", "\u0001", "\u0002", "\u0003", "\u0004",
"\u0005", "\u0006", "\u0007", "\u0008", "\u0009", "\u000A",
"\u000B", "\u000C", "\u000D", "\u000E", "\u000F", "\u0010", "\u0011",
"\u0012", "\u0013", "\u0014", "\u0015", "\u0016", "\u0017", "\u0018",
"\u0019", "\u001A", "\u001B", "\u001C", "\u001D", "\u001E", "\u001F"
),
$str
);
(The JSON rules state: “All Unicode characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).”)
regex free method
If you are only zapping the control characters I'm familiar with (those under 32 and 127), try this out:
for($control = 0; $control < 32; $control++) {
$pString = str_replace(chr($control), "", $pString;
}
$pString = str_replace(chr(127), "", $pString;
The loop gets rid of all but DEL, which we just add to the end.
I'm thinking this will be a lot less stressful on you and the script then dealing with regex and the regex library.
Updated regex free method
Just for kicks, I came up with another way to do it. This one does it using an array of control characters:
$ctrls = range(chr(0), chr(31));
$ctrls[] = chr(127);
$clean_string = str_replace($ctrls, "", $string);

Categories