Related
I'd like to limit the users input to a string that contains between 2 and 1024 letters, numbers, spaces, periods, underscores, dashes, carriage returns (new lines) and tabs. The carriage returns and tabs do not work in my regex. I do realize that there are other ways to check the length.
if (!preg_match('/^[a-zA-Z0-9 ._-\r\t]{2,1024}$/', $userstring))
{
echo '<p>Bad string</p>';
}
Thanks ahead of time.
The page has a form with the a control on it.
If I type: 1CR2 (that is 1 followed by a carriage return and then a 2), and submit the page, the error message will be displayed and the box will have 1rn2 in it.
As you're trying to match strings that may have line breaks in them, you need to enable multi-line in your regex using a Pattern Modifier. m will enable multi-line, e.g.:
if (!preg_match('/^[a-zA-Z0-9 ._-\r\t]{2,1024}/m', $userstring))
{
echo '<p>Bad string</p>';
}
The $ was also removed in case there is trailing white space. Suggestions that others have made to use \s instead of \r\t seem reasonable to me.
Reserved characters (including . and -) need to be escaped with a backslash (\). Try the following regex: ^[\w\s\.\-]{2,1024}$. \w matches word characters ([a-zA-Z0-9_]) and \s matches whitespace characters ([ \t\r\n\f]). That leaves you with the . and - that need to be escaped. Final PHP code:
if (!preg_match('/^[\w\s.\-]{2,1024}$/', $userstring))
{
echo '<p>Bad string</p>';
}
More info on shorthand classes and reserved characters.
Edit: . does not need to be escaped in a character class ([]), thanks Barmar.
Try:
if (!preg_match("/^[a-zA-Z0-9 ._\-\n\t]{2,1024}$/", $userstring))
{
echo '<p>Bad string</p>';
}
I believe you'll want your regular expression wrapped in double-quotes so that your newline (you should be using \n instead of \r) and tab characters are properly interpolated. Also, you should escape the - because it otherwise is used to define a range when used within brackets.
I am trying to set a validation rule for a field in my form that checks that the input only contains letters.
At first I tried to make a function that returned true if there were no numbers in the string, for that I used preg_match:
function my_format($str)
{
return preg_match('/^([^0-9])$', $str);
}
It doesn't matter how many times I look at the php manual, it seems like I won't get to understand how to create the pattern I want. What's wrong with what I made?
But I'd like to extend the question: I want the input text to contain any letter but no numbers nor symbols, like question marks, exclamation marks, and all those you can imagine. BUT the letters I want are not only a-z, I want letters with all kinds of accents, as those used in Spanish, Portuguese, Swedish, Polish, Serbian, Islandic...
I guess this is no easy task and hard or impossible to do with preg_match. It there any library that covers my exact needs?
If you're using utf-8 encoded input, go for unicode regex. Using the u modifier.
This one would match a string that only consists of letters and any kind of whitespace/invisible separators:
preg_match('~^[\p{L}\p{Z}]+$~u', $str);
function my_format($str)
{
return preg_match('/^\p{L}+$/', $str);
}
Simpler than you think about!
\p{L} matches any kind of letter from any language
First of all,Merry Christmas.
You are on the right track with the first one, just missing a + to match one or more non-number characters:
preg_match('/^([^0-9]+)$/', $str);
As you can see, 0-9 is a range, from number 0 to 9. This applies to some other cases, like a-z or A-Z, the '-' is special and it indicates that it is a range. for 0-9, you can use shorthand of \d like:
preg_match('/^([^\d]+)$/', $str);
For symbols, if your list is punctuations . , " ' ? ! ; : # $ % & ( ) * + - / < > = # [ ] \ ^ _ { } | ~, there is a shorthand.
preg_match('/^([^[:punct:]]+)$/', $str);
Combined you get:
preg_match('/^([^[:punct:]\d]+)$/', $str);
Use the [:alpha:] POSIX expression.
function my_format($str) {
return preg_match('/[[:alpha:]]+/u', $str);
}
The extra [] turns the POSIX into a range modified by the + to match 1 or more alphabetical characters. As you can see, the :alpha: POSIX matches accented characters as well
If you want to include whitespace, just add \s to the range:
preg_match('/[[:alpha:]\s]+/u', $str);
EDIT: Sorry, I misread your question when I looked over it a second time and thought you wanted punctuation. I've taken it back out.
I use PHP.
My string can look like this
This is a string-test width åäö and some über+strange characters: _like this?
Question
Is there a way to remove non-alphanumeric characters and replace them with a space? Here are some non-alphanumeric characters:
-
+
:
_
?
I've read many threads about it but they don't support other languages, like this one:
preg_replace("/[^A-Za-z0-9 ]/", '', $string);
Requirements
My list of none letter characters might not be complete.
My content contain characters in different languages, like åäöü. Could be very many more.
The non-alphanumeric characters should be replaced with a space. Else the word would be glued to eachother.
You can try this:
preg_replace('~[^\p{L}\p{N}]++~u', ' ', $string);
\p{L} stands for all alphabetic characters (whatever the alphabet).
\p{N} stands for numbers.
With the u modifier characters of the subject string are treated as unicode characters.
Or this:
preg_replace('~\P{Xan}++~u', ' ', $string);
\p{Xan} contains unicode letters and digits.
\P{Xan} contains all that is not unicode letters and digits. (Be careful, it contains white spaces too that you can preserve with ~[^\p{Xan}\s]++~u )
If you want a more specific set of allowed letters you must replace \p{L} with ranges in unicode table.
Example:
preg_replace('~[^a-zÀ-ÖØ-öÿŸ\d]++~ui', ' ', $string);
Why using a possessive quantifier (++) here?
~\P{Xan}+~u will give you the same result as ~\P{Xan}++~u. The difference here is that in the first the engine records each backtracking position (that we don't need) when in the second it doesn't (as in an atomic group). The result is a small performance profit.
I think it's a good practice to use possessive quantifiers and atomic groups when it's possible.
However, the PCRE regex engine makes automatically a quantifier possessive in obvious situations (example: a+b => a++b) except If the PCRE module has been compiled with the option PCRE_NO_AUTO_POSSESS. (http://www.pcre.org/pcre.txt)
More informations about possessive quantifiers and atomic groups here (possessive quantifiers) and here (atomic groups) or here
Are you perhaps looking for \W?
Something like:
/[\W_]*/
Matches all non-alphanumeric character and underscores.
\w matches all word character (alphabet, numeric, underscores)
\W matches anything not in \w.
So, \W matches any non-alphanumeric characters and you add the underscore since \W doesn't match underscores.
EDIT: This make your line of code become:
preg_replace("/[\W_]*/", ' ', $string);
The ' ' means that all matching characters (those not letter and not number) will become white spaces.
reEDIT: You might additionally want to use another preg_replace to remove all the consecutive spaces and replace them with a single space, otherwise you'll end up with:
This is a string test width and some ber strange characters like this
You can use:
preg_replace("/\s+/", ' ', $string);
And lastly trim the beginning and end spaces if any.
I am not entirely sure which variety of regex you are using. However, POSIX regexes allow you to express an alphabetical class, where [:alpha:] represents any alphabetic character.
So try:
preg_replace("/[^[:alpha:]0-9 ]/", '', $string);
Actually, I forgot about [:alnum:] - that makes it simpler:
preg_replace("/[^[:alnum:] ]/", '', $string);
\p{xx} is what you are looking for, I believe, see here
So, try:
preg_replace("/\P{L}+/u", ' ', $string);
How can I remove control characters like STX from a PHP string? I played around with
preg_replace("/[^a-zA-Z0-9 .\-_;!:?äÄöÖüÜß<>='\"]/","",$pString)
but found that it removed way to much. Is there a way to remove only
control chars?
If you mean by control characters the first 32 ascii characters and \x7F (that includes the carriage return, etc!), then this will work:
preg_replace('/[\x00-\x1F\x7F]/', '', $input);
(Note the single quotes: with double quotes the use of \x00 causes a parse error, somehow.)
The line feed and carriage return (often written \r and \n) may be saved from removal like so:
preg_replace('/[\x00-\x09\x0B\x0C\x0E-\x1F\x7F]/', '', $input);
I must say that I think Bobby's answer is better, in the sense that [:cntrl:] better conveys what the code does than [\x00-\x1F\x7F].
WARNING: ereg_replace is deprecated in PHP >= 5.3.0 and removed in PHP >= 7.0.0!, please use preg_replace instead of ereg_replace:
preg_replace('/[[:cntrl:]]/', '', $input);
For Unicode input, this will remove all control characters, unassigned, private use, formatting and surrogate code points (that are not also space characters, such as tab, new line) from your input text. I use this to remove all non-printable characters from my input.
<?php
$clean = preg_replace('/[^\PC\s]/u', '', $input);
for more info on \p{C} see http://www.regular-expressions.info/unicode.html#category
PHP does support POSIX-Classes so you can use [:cntrl:] instead of some fancy character-magic-stuff:
ereg_replace("[:cntrl:]", "", $pString);
Edit:
A extra pair of square brackets might be needed in 5.3.
ereg_replace("[[:cntrl:]]", "", $pString);
TLDR Answer
Use this Regex...
/[^\PCc^\PCn^\PCs]/u
Like this...
$text = preg_replace('/[^\PCc^\PCn^\PCs]/u', '', $text);
TLDR Explanation
^\PCc : Do not match control characters.
^\PCn : Do not match unassigned characters.
^\PCs : Do not match UTF-8-invalid characters.
Working Demo
Simple demo to demonstrate: IDEOne Demo
$text = "\u{0019}hello";
print($text . "\n\n");
$text = preg_replace('/[^\PCc^\PCn^\PCs]/u', '', $text);
print($text);
Output:
(-Broken-Character)hello
hello
Alternatives
^\PC : Match only visible characters. Do not match any invisible characters.
^\PCc : Match only non-control characters. Do not match any control characters.
^\PCc^\PCn : Match only non-control characters that have been assigned. Do not match any control or unassigned characters.
^\PCc^\PCn^\PCs : Match only non-control characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, or UTF-8-invalid characters.
^\PCc^\PCn^\PCs^\PCf : Match only non-control, non-formatting characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, formatting, or UTF-8-invalid characters.
Source and Explanation
Take a look at the Unicode Character Properties available that can be used to test within a regex. You should be able to use these regexes in Microsoft .NET, JavaScript, Python, Java, PHP, Ruby, Perl, Golang, and even Adobe. Knowing Unicode character classes is very transferable knowledge, so I recommend using it!
This regex will match anything visible, given in both its short-hand and long-hand form...
\PL\PM\PN\PP\PS\PZ
\PLetter\PMark\PNumber\PPunctuation\PSymbol\PSeparator
Normally, \p indicates that it's something we want to match and we use \P (capitalized) to indicate something that does not match. But PHP doesn't have this functionality, so we need to use ^ in the regex to do a manual negation.
A simpler regex then would be ^\PC, but this might be too restrictive in deleting invisible formatting. You may want to look closely and see what's best, but one of the alternatives should fit your needs.
All Matchable Unicode Character Sets
If you want to know any other character sets available, check out regular-expressions.info...
\PL or \PLetter: any kind of letter from any language.
\PLl or \PLowercase_Letter: a lowercase letter that has an uppercase variant.
\PLu or \PUppercase_Letter: an uppercase letter that has a lowercase variant.
\PLt or \PTitlecase_Letter: a letter that appears at the start of a word when only the first letter of the word is capitalized.
\PL& or \PCased_Letter: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
\PLm or \PModifier_Letter: a special character that is used like a letter.
\PLo or \POther_Letter: a letter or ideograph that does not have lowercase and uppercase
\PM or \PMark: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
\PMn or \PNon_Spacing_Mark: a character intended to be combined with another
character without taking up extra space (e.g. accents, umlauts, etc.).
\PMc or \PSpacing_Combining_Mark: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
\PMe or \PEnclosing_Mark: a character that encloses the character it is combined with (circle, square, keycap, etc.).
\PZ or \PSeparator: any kind of whitespace or invisible separator.
\PZs or \PSpace_Separator: a whitespace character that is invisible, but does take up space.
\PZl or \PLine_Separator: line separator character U+2028.
\PZp or \PParagraph_Separator: paragraph separator character U+2029.
\PS or \PSymbol: math symbols, currency signs, dingbats, box-drawing characters, etc.
\PSm or \PMath_Symbol: any mathematical symbol.
\PSc or \PCurrency_Symbol: any currency sign.
\PSk or \PModifier_Symbol: a combining character (mark) as a full character on its own.
\PSo or \POther_Symbol: various symbols that are not math symbols, currency signs, or combining characters.
\PN or \PNumber: any kind of numeric character in any script.
\PNd or \PDecimal_Digit_Number: a digit zero through nine in any script except ideographic scripts.
\PNl or \PLetter_Number: a number that looks like a letter, such as a Roman numeral.
\PNo or \POther_Number: a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from ideographic scripts).
\PP or \PPunctuation: any kind of punctuation character.
\PPd or \PDash_Punctuation: any kind of hyphen or dash.
\PPs or \POpen_Punctuation: any kind of opening bracket.
\PPe or \PClose_Punctuation: any kind of closing bracket.
\PPi or \PInitial_Punctuation: any kind of opening quote.
\PPf or \PFinal_Punctuation: any kind of closing quote.
\PPc or \PConnector_Punctuation: a punctuation character such as an underscore that connects words.
\PPo or \POther_Punctuation: any kind of punctuation character that is not a dash, bracket, quote or connector.
\PC or \POther: invisible control characters and unused code points.
\PCc or \PControl: an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F.
\PCf or \PFormat: invisible formatting indicator.
\PCo or \PPrivate_Use: any code point reserved for private use.
\PCs or \PSurrogate: one half of a surrogate pair in UTF-16 encoding.
\PCn or \PUnassigned: any code point to which no character has been assigned.
To keep the control characters but make them compatible for JSON, I had to to
$str = preg_replace(
array(
'/\x00/', '/\x01/', '/\x02/', '/\x03/', '/\x04/',
'/\x05/', '/\x06/', '/\x07/', '/\x08/', '/\x09/', '/\x0A/',
'/\x0B/','/\x0C/','/\x0D/', '/\x0E/', '/\x0F/', '/\x10/', '/\x11/',
'/\x12/','/\x13/','/\x14/','/\x15/', '/\x16/', '/\x17/', '/\x18/',
'/\x19/','/\x1A/','/\x1B/','/\x1C/','/\x1D/', '/\x1E/', '/\x1F/'
),
array(
"\u0000", "\u0001", "\u0002", "\u0003", "\u0004",
"\u0005", "\u0006", "\u0007", "\u0008", "\u0009", "\u000A",
"\u000B", "\u000C", "\u000D", "\u000E", "\u000F", "\u0010", "\u0011",
"\u0012", "\u0013", "\u0014", "\u0015", "\u0016", "\u0017", "\u0018",
"\u0019", "\u001A", "\u001B", "\u001C", "\u001D", "\u001E", "\u001F"
),
$str
);
(The JSON rules state: “All Unicode characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).”)
regex free method
If you are only zapping the control characters I'm familiar with (those under 32 and 127), try this out:
for($control = 0; $control < 32; $control++) {
$pString = str_replace(chr($control), "", $pString;
}
$pString = str_replace(chr(127), "", $pString;
The loop gets rid of all but DEL, which we just add to the end.
I'm thinking this will be a lot less stressful on you and the script then dealing with regex and the regex library.
Updated regex free method
Just for kicks, I came up with another way to do it. This one does it using an array of control characters:
$ctrls = range(chr(0), chr(31));
$ctrls[] = chr(127);
$clean_string = str_replace($ctrls, "", $string);
How can I match a space character in a PHP regular expression?
I mean like "gavin schulz", the space in between the two words. I am using a regular expression to make sure that I only allow letters, number and a space. But I'm not sure how to find the space. This is what I have right now:
$newtag = preg_replace("/[^a-zA-Z0-9s|]/", "", $tag);
If you're looking for a space, that would be " " (one space).
If you're looking for one or more, it's " *" (that's two spaces and an asterisk) or " +" (one space and a plus).
If you're looking for common spacing, use "[ X]" or "[ X][ X]*" or "[ X]+" where X is the physical tab character (and each is preceded by a single space in all those examples).
These will work in every* regex engine I've ever seen (some of which don't even have the one-or-more "+" character, ugh).
If you know you'll be using one of the more modern regex engines, "\s" and its variations are the way to go. In addition, I believe word boundaries match start and end of lines as well, important when you're looking for words that may appear without preceding or following spaces.
For PHP specifically, this page may help.
From your edit, it appears you want to remove all non valid characters The start of this is (note the space inside the regex):
$newtag = preg_replace ("/[^a-zA-Z0-9 ]/", "", $tag);
# ^ space here
If you also want trickery to ensure there's only one space between each word and none at the start or end, that's a little more complicated (and probably another question) but the basic idea would be:
$newtag = preg_replace ("/ +/", " ", $tag); # convert all multispaces to space
$newtag = preg_replace ("/^ /", "", $tag); # remove space from start
$newtag = preg_replace ("/ $/", "", $tag); # and end
Cheat Sheet
Here is a small cheat sheet of everything you need to know about whitespace in regular expressions:
[[:blank:]]
Space or tab only, not newline characters. It is the same as writing [ \t].
[[:space:]] & \s
[[:space:]] and \s are the same. They will both match any whitespace character spaces, newlines, tabs, etc...
\v
Matches vertical Unicode whitespace.
\h
Matches horizontal whitespace, including Unicode characters. It will also match spaces, tabs, non-breaking/mathematical/ideographic spaces.
x (eXtended flag)
Ignore all whitespace. Keep in mind that this is a flag, so you will add it to the end of the regex
like /hello/gmx. This flag will ignore whitespace in your regular expression.
For example, if you write an expression like /hello world/x, it will match helloworld, but not hello world. The extended flag also allows comments in your regex.
Example
/helloworld #hello this is a comment/
If you need to use a space, you can use \ to match spaces.
To match exactly the space character, you can use the octal value \040 (Unicode characters displayed as octal) or the hexadecimal value \x20 (Unicode characters displayed as hex).
Here is the regex syntax reference: https://www.regular-expressions.info/nonprint.html.
In Perl the switch is \s (whitespace).
I am using a regex to make sure that I
only allow letters, number and a space
Then it is as simple as adding a space to what you've already got:
$newtag = preg_replace("/[^a-zA-Z0-9 ]/", "", $tag);
(note, I removed the s| which seemed unintentional? Certainly the s was redundant; you can restore the | if you need it)
If you specifically want *a* space, as in only a single one, you will need a more complex expression than this, and might want to consider a separate non-regex piece of logic.
It seems to me like using a REGEX in this case would just be overkill. Why not just just strpos to find the space character. Also, there's nothing special about the space character in regular expressions, you should be able to search for it the same as you would search for any other character. That is, unless you disabled pattern whitespace, which would hardly be necessary in this case.
You can also use the \b for a word boundary. For the name I would use something like this:
[^\b]+\b[^\b]+(\b|$)
EDIT Modifying this to be a regex in Perl example
if( $fullname =~ /([^\b]+)\b[^\b]+([^\b]+)(\b|$)/ ) {
$first_name = $1;
$last_name = $2;
}
EDIT AGAIN Based on what you want:
$new_tag = preg_replace("/[\s\t]/","",$tag);
Use it like this to allow for a single space.
$newtag = preg_replace("/[^a-zA-Z0-9\s]/", "", $tag)
I'm trying out [[:space:]] in an instance where it looks like bloggers in WordPress are using non-standard space characters. It looks like it will work.
This matches tires better because not all vendors use the same size format. I deal with many vendors all doing size in different format. This is my expression for now
/^[\d][\d](?:\d)?(?:\-|\/|\s)?([?:\d]+)?(?:\.)?(?:\d)?(?:\d)?(?:R|-|\s)?[1-3]([?:[\d]+)?(?:\.)?([?:\d])?(?:\s|-)/img
will catch all
35-12.50-22 HAIDA[AA]
35-12-22 HAIDA[AA]
35/35R20
35/35r20
thus uis a test
rrrrr
awdg
3345588
225-45-17 ACCELERA[AC]
195 50 16 KELLY
1955016 KELLY
CP671"
158 Buckshot
165-40-16-ACHILLES
11-24.5-16-LEAO-LLA08
11-24.5-LEAO-D37
11-22.5-14-LINGLONG-LLD37
11-22.5-HAPPYROAD[AA]