How to write a regex with matches whitespace but no tabs and new line?
thanks everything
[[:blank:]]{2,} <-- Even though this isn't good for me because its whitespace or tab but not newlines.
As per my original comment, you can use this.
Code
See regex in use here
Note: The link contains whitespace characters: tab, newline, and space. Only space is matched.
[^\S\t\n\r]
So your regex would be [^\S\t\n\r]{2,}
Explanation
[^\S\t\n\r] Match any character not present in the set.
\S Matches any non-whitespace character. Since it's a double negative it will actually match any whitespace character. Adding \t, \n, and \r to the negated set ensures we exclude those specific characters as well. Basically, this regex is saying:
Match any whitespace character except \t\n\r
This principle in regex is often used with word characters \w to negate the underscore _ character: [^\W_]
[ ]{2,} works normally (not sure about php)
or even / {2,}/
I need to sanitaze some user input and I need to remove all the characters can cause problems such as Null Byte or useless ones(such as \n or \t), because the inputs are or strings or html code.
At this moment I'm using this to remove the tab, break-line, etc:
preg_replace('/\s+/','',$_POST['id'])
but isn't sufficent, I have found this:
preg_replace( '/[^[:print:]]/',' ',$_POST['val'])
But I don't understand if it strips also characters that shouldn't be deleted, such as german or arabic chars or punctation or symbols
According to the PHP regex character classes [:print:] includes "printing characters, including space".
That means "Visible characters and spaces (i.e. anything except control characters, etc.)" (see http://www.regular-expressions.info/posixbrackets.html)
ASCII characters: [\x20-\x7E]
Unicode: \P{C}
I am looking to remove all alpha-numeric characters from a string and replace with a space (using PHP). The input is coming from a textarea that has data pasted into it from various places like word, excel, websites, emails etc.
I was using this regex
/[^a-zA-Z0-9\s]/
But I found that there are still Vertical Tabs (ascii #13). I want my end result to only include letters and numbers, no newline, tab, vertical tabs etc
Many thanks!
Vertical tabs are matched by the whitespace character (\s)
If you want to replace every non-alpha-numeric character with a space, use
preg_replace('/[^a-zA-Z0-9]/', ' ', $string)
If you want to replace every group (consecutive characters) of non-alnums with a single space, use
preg_replace('/[^a-zA-Z0-9]+/', ' ', $string)
try removing the \s
/[^a-zA-Z0-9]/
\s is probably used for vertical spaces.
So just remove that:
/[^a-zA-Z0-9]/
I know that there are many types of space (em space, en space, thin space, non-breaking space, etc), but, all these, that I refered, have HTML entities (at least, PHP's htmlentities() return something like .
But, what about those spaces that have no HTML entities?
Example: [example URL not valid anymore]
Look at the nickname of this account. It has many " " (spaces) at the front, which are visible for us (this doesn't happen with the ).
I tried already filter with regular expressions, using \x escape, filter with str_replace(), with the space as the argument, and no luck at all!
Do you have any suggestion on how to filter ALL types of whitespace?
\s by default, will not match whitespace characters with values greater than 128. To get at those, you can instead make good use of other UTF-8-aware sequences.
(Standard disclaimer: I'm skimming the PCRE source code to compile the lists below, I may miss a character or type something incorrectly. Please forgive me.)
\p{Zs} matches:
U+0020 Space
U+00A0 No-break space
U+1680 Ogham space mark
U+180E Mongolian vowel separator
U+2000 En quad
U+2001 Em quad
U+2002 En space
U+2003 Em space
U+2004 Three-per-em space
U+2005 Four-per-em space
U+2006 Six-per-em space
U+2007 Figure space
U+2008 Punctuation space
U+2009 Thin space
U+200A Hair space
U+202F Narrow no-break space
U+205F Medium mathematical space
U+3000 Ideographic space
\h (Horizontal whitespace) matches the same as \p{Zs} above, plus
U+0009 Horizontal tab.
Similarly for matching vertical whitespace there are a few options.
\p{Zl} matches U+2028 Line separator.
\p{Zp} matches U+2029 Paragraph separator.
\v (Vertical whitespace) matches \p{Zl}, \p{Zp} and the following
U+000A Linefeed
U+000B Vertical tab
U+000C Formfeed
U+000D Carriage return
U+0085 Next line
Going back to the beginning, in UTF-8 mode (i.e. using the u pattern modifier) \s will match any character that \p{Z} matches (which is anything that \p{Zs}, \p{Zl} and \p{Zp} will match), plus
U+0009 Horizontal tab
U+000A Linefeed
U+000C Formfeed
U+000D Carriage return
To cut a long story short (I bet you read all of the above, didn't you?) you might want to use \s but make sure to be in UTF-8 mode like /\s/u. Putting that to some practical use, to filter out those matching whitespace characters from a string you would do something like
$new_string = preg_replace('/\s/u', '', $old_string);
Finally, if you really, really care about the vertical whitespaces which aren't included in \s (LF and NEL) then you can use the character class [\s\v] to match all 26 of the whitespace characters listed above.
They are all plain spaces (returning character code 32) that can be caught with regular expressions or trim().
Try this:
preg_replace("/\s{2,}/", " ", $text);
$result = preg_replace('/\s/', '', $yourString)
See http://www.php.net/manual/en/regexp.reference.backslash.php for more infos on the \s
How can I remove control characters like STX from a PHP string? I played around with
preg_replace("/[^a-zA-Z0-9 .\-_;!:?äÄöÖüÜß<>='\"]/","",$pString)
but found that it removed way to much. Is there a way to remove only
control chars?
If you mean by control characters the first 32 ascii characters and \x7F (that includes the carriage return, etc!), then this will work:
preg_replace('/[\x00-\x1F\x7F]/', '', $input);
(Note the single quotes: with double quotes the use of \x00 causes a parse error, somehow.)
The line feed and carriage return (often written \r and \n) may be saved from removal like so:
preg_replace('/[\x00-\x09\x0B\x0C\x0E-\x1F\x7F]/', '', $input);
I must say that I think Bobby's answer is better, in the sense that [:cntrl:] better conveys what the code does than [\x00-\x1F\x7F].
WARNING: ereg_replace is deprecated in PHP >= 5.3.0 and removed in PHP >= 7.0.0!, please use preg_replace instead of ereg_replace:
preg_replace('/[[:cntrl:]]/', '', $input);
For Unicode input, this will remove all control characters, unassigned, private use, formatting and surrogate code points (that are not also space characters, such as tab, new line) from your input text. I use this to remove all non-printable characters from my input.
<?php
$clean = preg_replace('/[^\PC\s]/u', '', $input);
for more info on \p{C} see http://www.regular-expressions.info/unicode.html#category
PHP does support POSIX-Classes so you can use [:cntrl:] instead of some fancy character-magic-stuff:
ereg_replace("[:cntrl:]", "", $pString);
Edit:
A extra pair of square brackets might be needed in 5.3.
ereg_replace("[[:cntrl:]]", "", $pString);
TLDR Answer
Use this Regex...
/[^\PCc^\PCn^\PCs]/u
Like this...
$text = preg_replace('/[^\PCc^\PCn^\PCs]/u', '', $text);
TLDR Explanation
^\PCc : Do not match control characters.
^\PCn : Do not match unassigned characters.
^\PCs : Do not match UTF-8-invalid characters.
Working Demo
Simple demo to demonstrate: IDEOne Demo
$text = "\u{0019}hello";
print($text . "\n\n");
$text = preg_replace('/[^\PCc^\PCn^\PCs]/u', '', $text);
print($text);
Output:
(-Broken-Character)hello
hello
Alternatives
^\PC : Match only visible characters. Do not match any invisible characters.
^\PCc : Match only non-control characters. Do not match any control characters.
^\PCc^\PCn : Match only non-control characters that have been assigned. Do not match any control or unassigned characters.
^\PCc^\PCn^\PCs : Match only non-control characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, or UTF-8-invalid characters.
^\PCc^\PCn^\PCs^\PCf : Match only non-control, non-formatting characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, formatting, or UTF-8-invalid characters.
Source and Explanation
Take a look at the Unicode Character Properties available that can be used to test within a regex. You should be able to use these regexes in Microsoft .NET, JavaScript, Python, Java, PHP, Ruby, Perl, Golang, and even Adobe. Knowing Unicode character classes is very transferable knowledge, so I recommend using it!
This regex will match anything visible, given in both its short-hand and long-hand form...
\PL\PM\PN\PP\PS\PZ
\PLetter\PMark\PNumber\PPunctuation\PSymbol\PSeparator
Normally, \p indicates that it's something we want to match and we use \P (capitalized) to indicate something that does not match. But PHP doesn't have this functionality, so we need to use ^ in the regex to do a manual negation.
A simpler regex then would be ^\PC, but this might be too restrictive in deleting invisible formatting. You may want to look closely and see what's best, but one of the alternatives should fit your needs.
All Matchable Unicode Character Sets
If you want to know any other character sets available, check out regular-expressions.info...
\PL or \PLetter: any kind of letter from any language.
\PLl or \PLowercase_Letter: a lowercase letter that has an uppercase variant.
\PLu or \PUppercase_Letter: an uppercase letter that has a lowercase variant.
\PLt or \PTitlecase_Letter: a letter that appears at the start of a word when only the first letter of the word is capitalized.
\PL& or \PCased_Letter: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
\PLm or \PModifier_Letter: a special character that is used like a letter.
\PLo or \POther_Letter: a letter or ideograph that does not have lowercase and uppercase
\PM or \PMark: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
\PMn or \PNon_Spacing_Mark: a character intended to be combined with another
character without taking up extra space (e.g. accents, umlauts, etc.).
\PMc or \PSpacing_Combining_Mark: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
\PMe or \PEnclosing_Mark: a character that encloses the character it is combined with (circle, square, keycap, etc.).
\PZ or \PSeparator: any kind of whitespace or invisible separator.
\PZs or \PSpace_Separator: a whitespace character that is invisible, but does take up space.
\PZl or \PLine_Separator: line separator character U+2028.
\PZp or \PParagraph_Separator: paragraph separator character U+2029.
\PS or \PSymbol: math symbols, currency signs, dingbats, box-drawing characters, etc.
\PSm or \PMath_Symbol: any mathematical symbol.
\PSc or \PCurrency_Symbol: any currency sign.
\PSk or \PModifier_Symbol: a combining character (mark) as a full character on its own.
\PSo or \POther_Symbol: various symbols that are not math symbols, currency signs, or combining characters.
\PN or \PNumber: any kind of numeric character in any script.
\PNd or \PDecimal_Digit_Number: a digit zero through nine in any script except ideographic scripts.
\PNl or \PLetter_Number: a number that looks like a letter, such as a Roman numeral.
\PNo or \POther_Number: a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from ideographic scripts).
\PP or \PPunctuation: any kind of punctuation character.
\PPd or \PDash_Punctuation: any kind of hyphen or dash.
\PPs or \POpen_Punctuation: any kind of opening bracket.
\PPe or \PClose_Punctuation: any kind of closing bracket.
\PPi or \PInitial_Punctuation: any kind of opening quote.
\PPf or \PFinal_Punctuation: any kind of closing quote.
\PPc or \PConnector_Punctuation: a punctuation character such as an underscore that connects words.
\PPo or \POther_Punctuation: any kind of punctuation character that is not a dash, bracket, quote or connector.
\PC or \POther: invisible control characters and unused code points.
\PCc or \PControl: an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F.
\PCf or \PFormat: invisible formatting indicator.
\PCo or \PPrivate_Use: any code point reserved for private use.
\PCs or \PSurrogate: one half of a surrogate pair in UTF-16 encoding.
\PCn or \PUnassigned: any code point to which no character has been assigned.
To keep the control characters but make them compatible for JSON, I had to to
$str = preg_replace(
array(
'/\x00/', '/\x01/', '/\x02/', '/\x03/', '/\x04/',
'/\x05/', '/\x06/', '/\x07/', '/\x08/', '/\x09/', '/\x0A/',
'/\x0B/','/\x0C/','/\x0D/', '/\x0E/', '/\x0F/', '/\x10/', '/\x11/',
'/\x12/','/\x13/','/\x14/','/\x15/', '/\x16/', '/\x17/', '/\x18/',
'/\x19/','/\x1A/','/\x1B/','/\x1C/','/\x1D/', '/\x1E/', '/\x1F/'
),
array(
"\u0000", "\u0001", "\u0002", "\u0003", "\u0004",
"\u0005", "\u0006", "\u0007", "\u0008", "\u0009", "\u000A",
"\u000B", "\u000C", "\u000D", "\u000E", "\u000F", "\u0010", "\u0011",
"\u0012", "\u0013", "\u0014", "\u0015", "\u0016", "\u0017", "\u0018",
"\u0019", "\u001A", "\u001B", "\u001C", "\u001D", "\u001E", "\u001F"
),
$str
);
(The JSON rules state: “All Unicode characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).”)
regex free method
If you are only zapping the control characters I'm familiar with (those under 32 and 127), try this out:
for($control = 0; $control < 32; $control++) {
$pString = str_replace(chr($control), "", $pString;
}
$pString = str_replace(chr(127), "", $pString;
The loop gets rid of all but DEL, which we just add to the end.
I'm thinking this will be a lot less stressful on you and the script then dealing with regex and the regex library.
Updated regex free method
Just for kicks, I came up with another way to do it. This one does it using an array of control characters:
$ctrls = range(chr(0), chr(31));
$ctrls[] = chr(127);
$clean_string = str_replace($ctrls, "", $string);