Remove control characters from string in PHP [duplicate] - php

This question already has answers here:
Remove control characters from PHP string
(6 answers)
Closed 3 years ago.
I have a lot of strings in our MySQL database that have control characters such as ^M. I want a regex that removes it in PHP, but leaves alone things such as new lines, eg: "\n".
I've tried the following:
preg_replace('/[\x00-\x09\x0B\x0C\x0E-\x1F\x7F]/', '', $bad);
This seems to leave it in place.
What's the best way to get rid of these control characters?

I want a regex that removes it in PHP, but leaves alone things such as
new lines, eg: "\n"
Use the following approach:
preg_replace("/(\x0A)|[[:cntrl:]]/", "$1", $bad);
\x0A - points to a newline character
[[:cntrl:]] - represents all control characters
(\x0A)|[[:cntrl:]] - alternation group which matches either a newline character or some of control characters at one time.
$1 holds the first capturing group that is newline character only if it was matched

You can use this replacement:
$result = preg_replace('~[^\P{Cc}\r\n]+~u', '', $str);
\p{Cc} is the unicode character class for control characters. \P{Cc} is the opposite (all that is not a control character).
[^\P{Cc}\r\n] is all that isn't \P{Cc}, \r and \n.
The u modifier ensures that the string and the pattern are read as utf8 strings.
If you want to preserve an other control character, for example the TAB, add it to the negated character class: [^\P{Cc}\r\n\t]

Related

Regex validation pattern only allows one character [duplicate]

This question already has answers here:
How to regex match entire string instead of a single character
(2 answers)
Closed 3 years ago.
I'm trying to make functions for validating usernames, emails, and passwords and my regex isn't working. My regex for usernames is ^[a-zA-Z0-9 _-]$ and when I put anything through that should work it always returns false.
As I understand it, the ^ and $ at the beginning and the end means that it makes sure the entire string matches this regular expression, the a-z and A-Z allows all letters, 0-9 allows all numbers, and the last three characters (the space, underscore, and dash) allow the respective characters.
Why is my regular expression not evaluating properly?
You need a quantifier, + or *. As it was written that only allows 1 of the characters in the character class.
Your a-zA-Z0-9_ also can be replaced with \w. Try:
^[\w -]+$
+ requires 1 or more matches. * requires 0 or more matches so if an empty string is valid use *.
Additionally you could use \h in place of the space character if tabs are allowed. That is the metacharacter for a horizontal space. I find it easier to read than the literal space.
Per comment, Update:
Since it looks like you want the string to be between a certain number of characters we can get more specific with the regex. A range can be created with {x,y} which will replace the quantifier.
^[\w -]{3,30}$
Additionally in PHP you must provide delimiters at the start and end of the regex.
preg_match("/^[\w -]{3,30}$/", $username);
Additionally, you should enable error reporting so you get these useful errors in the future. See https://stackoverflow.com/a/21429652/3783243
You're not specifying the character count. Lets try this instead:
^[A-z0-9]*$
Where [A-z0-9] states that you can use any alphanumeric characters and that it is case sensitive.
The * specifies how many characters, and in this case is unlimited. If you wanted to max out your username length to 10 characters, then you could change it to:
^[A-z0-9]{10}$
Whereby the {10} is specifying a maximum of 10 characters.
UPDATE
To also allow the use of underscores, hyphens and blank spaces (anywhere in the string) - use the below:
^[A-z0-9 _-]{10}$

Allow Underscores at the start and end of Usernames [duplicate]

This question already has answers here:
Regular expression for alphanumeric and underscores
(21 answers)
Closed 3 years ago.
Looking to see how I can edit the Username creation process to allow underscores and hyphens at the beginning and end of usernames.
Currently, if you end your username with a _, it drops it from the creation process.
$regex = '/^[A-Za-z0-9]+[A-Za-z0-9_.]*[A-Za-z0-9]+$/';
if(!preg_match($regex, $_POST['username'])) {
$_SESSION['error'][] = $language->register->error_message->username_characters;
}
You just need to add underscore _ and hyphen - to your first and last character set to allow your username to start or end with those two new characters and write your regex like this,
^[A-Za-z0-9_-]+[A-Za-z0-9_.]*[A-Za-z0-9_-]+$
and as \w is same as writing [a-zA-Z0-9_] hence you can compact your regex to this,
^[\w-]+[\w.]*[\w-]+$
Just want to also mention one point that whenever you write a hyphen - in a character set, make sure to always place it as either the first or last character in the character set, else unknowingly, the hyphen may act either as a range specifier and may not act as a literal hyphen. Although as in above regex, there is only \w and - in the character set, hence we don't need to worry here about the placement of hyphen.
Regex Demo
Also, I am not sure if you want to allow usernames (unlike a variable name which generally is allowed to be just one character) of just one character, but if you do, then you can modify your regex to this,
^[\w-]+([\w.]*[\w-]+)?$
Regex Demo allowing just one character as username

PHP preg_match not working for new line [duplicate]

This question already has answers here:
PHP Regex: How to match \r and \n without using [\r\n]?
(7 answers)
Closed 1 year ago.
I have this nice preg_match regex:
if(preg_match ("%^[A-Za-z0-9ążśźęćń󳥯ŚŹĘĆŃÓŁ\.\,\-\?\!\(\)\"\ \/\t\/\n]{2,50}$%", stripslashes(trim($_POST['x']))){...}
Which should allow all characters that could be used in and eventual text content of a post. Problem is, despite the \n it the functions still doesn't work for new lines in my post, so a syntax of
foo
bar
would not work.
Does anybody know why the function would not work properly?
Any help would be gratefully appreciated.
By default a preg_match() with a pattern using ^ and $ will consider the whole string, even if it contains newlines.
This behaviour can be altered using Pattern Modifiers, of which I will list the ones that fit this topic:
s (PCRE_DOTALL): by default, the dot (.) will not match newlines, but by using the modifier s it will. However, character classes (e.g. [a-z] and [^a-z]) never treat the newline as a special character anyway, thus this modifier will not affect their behaviour like it will for the dot (.).
m (PCRE_MULTILINE): by default, the start (^) and end ($) anchors will by default match the start and end of the whole string that is subjected to pattern matching, even if that string contains newlines. However, when this modifier is used, the preg-function is allowed to consider each part of the string that is separated by newlines as a complete string, so "foo\nbar\nbar" will result in three matches (1: foo, 2: bar, 3: bar) when matched against the pattern /^[a-z]$/m, not just one (1: foo\nbar\bar) as when the m modifier is not used: /^[a-z]$/.
D (PCRE_DOLLAR_ENDONLY): by default, the end ($) anchor will not only match the very end of a string, but also right before a trailing newline (trailing meaning: at the very end of the string). To undo this behaviour and make it very stricly only match the string ending, use this pattern modifier.
YOUR PROBLEM:
if(preg_match("%^[A-Za-z0-9ążśźęćń󳥯ŚŹĘĆŃÓŁ\.\,\-\?\!\(\)\"\ \/\t\/\n]{2,50}$%m", stripslashes(trim($_POST['x']))){...}
I don't see much wrong with your pattern, except that it is not required that you escape characters other than \, -, ^ (only at the start of the character class) and ] (only when not at the start of the character class), but the PHP doc says it's not a violation to still do so.
It might be, though, that your text snippet contains newlines in the form of \r\n and since \r is not included in the character class of your pattern, it will not be matched.
Since my original post mentioned the use of the Patter Modifier m to which you replied that that worked, I wonder what really might have been the issue.

How can I use PHP's preg_replace function to convert Unicode code points to actual characters/HTML entities?

I want to convert a set of Unicode code points in string format to actual characters and/or HTML entities (either result is fine).
For example, if I have the following string assignment:
$str = '\u304a\u306f\u3088\u3046';
I want to use the preg_replace function to convert those Unicode code points to actual characters and/or HTML entities.
As per other Stack Overflow posts I saw for similar issues, I first attempted the following:
$str = '\u304a\u306f\u3088\u3046';
$str2 = preg_replace('/\u[0-9a-f]+/', '&#x$1;', $str);
However, whenever I attempt to do this, I get the following PHP error:
Warning: preg_replace() [function.preg-replace]: Compilation failed: PCRE does not support \L, \l, \N, \U, or \u
I tried all sorts of things like adding the u flag to the regex or changing /\u[0-9a-f]+/ to /\x{[0-9a-f]+}/, but nothing seems to work.
Also, I've looked at all sorts of other relevant pages/posts I could find on the web related to converting Unicode code points to actual characters in PHP, but either I'm missing something crucial, or something is wrong because I can't fix the issue I'm having.
Can someone please offer me a concrete solution on how to convert a string of Unicode code points to actual characters and/or a string of HTML entities?
From the PHP manual:
Single and double quoted PHP strings have special meaning of backslash. Thus if \ has to be matched with a regular expression \\, then "\\\\" or '\\\\' must be used in PHP code.
First of all, in your regular expression, you're only using one backslash (\). As explained in the PHP manual, you need to use \\\\ to match a literal backslash (with some exceptions).
Second, you are missing the capturing groups in your original expression. preg_replace() searches the given string for matches to the supplied pattern and returns the string where the contents matched by the capturing groups are replaced with the replacement string.
The updated regular expression with proper escaping and correct capturing groups would look like:
$str2 = preg_replace('/\\\\u([0-9a-f]+)/i', '&#x$1;', $str);
Output:
おはよう
Expression: \\\\u([0-9a-f]+)
\\\\ - matches a literal backslash
u - matches the literal u character
( - beginning of the capturing group
[0-9a-f] - character class -- matches a digit (0 - 9) or an alphabet (from a - f) one or more times
) - end of capturing group
i modifier - used for case-insensitive matching
Replacement: &#x$1
& - literal ampersand character (&)
# - literal pound character (#)
x - literal character x
$1 - contents of the first capturing group -- in this case, the strings of the form 304a etc.
RegExr Demo.
This page here—titled Escaping Unicode Characters to HTML Entities in PHP—seems to tackle it with this nice function:
function unicode_escape_sequences($str){
$working = json_encode($str);
$working = preg_replace('/\\\u([0-9a-z]{4})/', '&#x$1;', $working);
return json_decode($working);
}
That seems to work with json_encode and json_decode to take pure UTF-8 and convert it into Unicode. Very nice technique. But for your example, this would work.
$str = '\u304a\u306f\u3088\u3046';
echo preg_replace('/\\\u([0-9a-z]{4})/', '&#x$1;', $str);
The output is:
おはよう
Which is:
おはよう
Which translates to:
Good morning

Remove control characters from PHP string

How can I remove control characters like STX from a PHP string? I played around with
preg_replace("/[^a-zA-Z0-9 .\-_;!:?äÄöÖüÜß<>='\"]/","",$pString)
but found that it removed way to much. Is there a way to remove only
control chars?
If you mean by control characters the first 32 ascii characters and \x7F (that includes the carriage return, etc!), then this will work:
preg_replace('/[\x00-\x1F\x7F]/', '', $input);
(Note the single quotes: with double quotes the use of \x00 causes a parse error, somehow.)
The line feed and carriage return (often written \r and \n) may be saved from removal like so:
preg_replace('/[\x00-\x09\x0B\x0C\x0E-\x1F\x7F]/', '', $input);
I must say that I think Bobby's answer is better, in the sense that [:cntrl:] better conveys what the code does than [\x00-\x1F\x7F].
WARNING: ereg_replace is deprecated in PHP >= 5.3.0 and removed in PHP >= 7.0.0!, please use preg_replace instead of ereg_replace:
preg_replace('/[[:cntrl:]]/', '', $input);
For Unicode input, this will remove all control characters, unassigned, private use, formatting and surrogate code points (that are not also space characters, such as tab, new line) from your input text. I use this to remove all non-printable characters from my input.
<?php
$clean = preg_replace('/[^\PC\s]/u', '', $input);
for more info on \p{C} see http://www.regular-expressions.info/unicode.html#category
PHP does support POSIX-Classes so you can use [:cntrl:] instead of some fancy character-magic-stuff:
ereg_replace("[:cntrl:]", "", $pString);
Edit:
A extra pair of square brackets might be needed in 5.3.
ereg_replace("[[:cntrl:]]", "", $pString);
TLDR Answer
Use this Regex...
/[^\PCc^\PCn^\PCs]/u
Like this...
$text = preg_replace('/[^\PCc^\PCn^\PCs]/u', '', $text);
TLDR Explanation
^\PCc : Do not match control characters.
^\PCn : Do not match unassigned characters.
^\PCs : Do not match UTF-8-invalid characters.
Working Demo
Simple demo to demonstrate: IDEOne Demo
$text = "\u{0019}hello";
print($text . "\n\n");
$text = preg_replace('/[^\PCc^\PCn^\PCs]/u', '', $text);
print($text);
Output:
(-Broken-Character)hello
hello
Alternatives
^\PC : Match only visible characters. Do not match any invisible characters.
^\PCc : Match only non-control characters. Do not match any control characters.
^\PCc^\PCn : Match only non-control characters that have been assigned. Do not match any control or unassigned characters.
^\PCc^\PCn^\PCs : Match only non-control characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, or UTF-8-invalid characters.
^\PCc^\PCn^\PCs^\PCf : Match only non-control, non-formatting characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, formatting, or UTF-8-invalid characters.
Source and Explanation
Take a look at the Unicode Character Properties available that can be used to test within a regex. You should be able to use these regexes in Microsoft .NET, JavaScript, Python, Java, PHP, Ruby, Perl, Golang, and even Adobe. Knowing Unicode character classes is very transferable knowledge, so I recommend using it!
This regex will match anything visible, given in both its short-hand and long-hand form...
\PL\PM\PN\PP\PS\PZ
\PLetter\PMark\PNumber\PPunctuation\PSymbol\PSeparator
Normally, \p indicates that it's something we want to match and we use \P (capitalized) to indicate something that does not match. But PHP doesn't have this functionality, so we need to use ^ in the regex to do a manual negation.
A simpler regex then would be ^\PC, but this might be too restrictive in deleting invisible formatting. You may want to look closely and see what's best, but one of the alternatives should fit your needs.
All Matchable Unicode Character Sets
If you want to know any other character sets available, check out regular-expressions.info...
\PL or \PLetter: any kind of letter from any language.
\PLl or \PLowercase_Letter: a lowercase letter that has an uppercase variant.
\PLu or \PUppercase_Letter: an uppercase letter that has a lowercase variant.
\PLt or \PTitlecase_Letter: a letter that appears at the start of a word when only the first letter of the word is capitalized.
\PL& or \PCased_Letter: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
\PLm or \PModifier_Letter: a special character that is used like a letter.
\PLo or \POther_Letter: a letter or ideograph that does not have lowercase and uppercase
\PM or \PMark: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
\PMn or \PNon_Spacing_Mark: a character intended to be combined with another
character without taking up extra space (e.g. accents, umlauts, etc.).
\PMc or \PSpacing_Combining_Mark: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
\PMe or \PEnclosing_Mark: a character that encloses the character it is combined with (circle, square, keycap, etc.).
\PZ or \PSeparator: any kind of whitespace or invisible separator.
\PZs or \PSpace_Separator: a whitespace character that is invisible, but does take up space.
\PZl or \PLine_Separator: line separator character U+2028.
\PZp or \PParagraph_Separator: paragraph separator character U+2029.
\PS or \PSymbol: math symbols, currency signs, dingbats, box-drawing characters, etc.
\PSm or \PMath_Symbol: any mathematical symbol.
\PSc or \PCurrency_Symbol: any currency sign.
\PSk or \PModifier_Symbol: a combining character (mark) as a full character on its own.
\PSo or \POther_Symbol: various symbols that are not math symbols, currency signs, or combining characters.
\PN or \PNumber: any kind of numeric character in any script.
\PNd or \PDecimal_Digit_Number: a digit zero through nine in any script except ideographic scripts.
\PNl or \PLetter_Number: a number that looks like a letter, such as a Roman numeral.
\PNo or \POther_Number: a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from ideographic scripts).
\PP or \PPunctuation: any kind of punctuation character.
\PPd or \PDash_Punctuation: any kind of hyphen or dash.
\PPs or \POpen_Punctuation: any kind of opening bracket.
\PPe or \PClose_Punctuation: any kind of closing bracket.
\PPi or \PInitial_Punctuation: any kind of opening quote.
\PPf or \PFinal_Punctuation: any kind of closing quote.
\PPc or \PConnector_Punctuation: a punctuation character such as an underscore that connects words.
\PPo or \POther_Punctuation: any kind of punctuation character that is not a dash, bracket, quote or connector.
\PC or \POther: invisible control characters and unused code points.
\PCc or \PControl: an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F.
\PCf or \PFormat: invisible formatting indicator.
\PCo or \PPrivate_Use: any code point reserved for private use.
\PCs or \PSurrogate: one half of a surrogate pair in UTF-16 encoding.
\PCn or \PUnassigned: any code point to which no character has been assigned.
To keep the control characters but make them compatible for JSON, I had to to
$str = preg_replace(
array(
'/\x00/', '/\x01/', '/\x02/', '/\x03/', '/\x04/',
'/\x05/', '/\x06/', '/\x07/', '/\x08/', '/\x09/', '/\x0A/',
'/\x0B/','/\x0C/','/\x0D/', '/\x0E/', '/\x0F/', '/\x10/', '/\x11/',
'/\x12/','/\x13/','/\x14/','/\x15/', '/\x16/', '/\x17/', '/\x18/',
'/\x19/','/\x1A/','/\x1B/','/\x1C/','/\x1D/', '/\x1E/', '/\x1F/'
),
array(
"\u0000", "\u0001", "\u0002", "\u0003", "\u0004",
"\u0005", "\u0006", "\u0007", "\u0008", "\u0009", "\u000A",
"\u000B", "\u000C", "\u000D", "\u000E", "\u000F", "\u0010", "\u0011",
"\u0012", "\u0013", "\u0014", "\u0015", "\u0016", "\u0017", "\u0018",
"\u0019", "\u001A", "\u001B", "\u001C", "\u001D", "\u001E", "\u001F"
),
$str
);
(The JSON rules state: “All Unicode characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).”)
regex free method
If you are only zapping the control characters I'm familiar with (those under 32 and 127), try this out:
for($control = 0; $control < 32; $control++) {
$pString = str_replace(chr($control), "", $pString;
}
$pString = str_replace(chr(127), "", $pString;
The loop gets rid of all but DEL, which we just add to the end.
I'm thinking this will be a lot less stressful on you and the script then dealing with regex and the regex library.
Updated regex free method
Just for kicks, I came up with another way to do it. This one does it using an array of control characters:
$ctrls = range(chr(0), chr(31));
$ctrls[] = chr(127);
$clean_string = str_replace($ctrls, "", $string);

Categories