Meaning of a dash between mixed characters in regex? - php

I'm just getting my feet wet with regexes and I came across this within a PHP program that someone else had written:
[ -\w]. Note that the dash is not the first character, there is a space preceding it.
I can't make heads or tails of what it means. I know that the dash between characters inside brackets normally indicates a range, i.e. [a-z] matches any lowercase character "a" through "z", but what does it match when the dash is between characters of different types?
My first thought was that it just matches any space or alphanumeric character, but then the dash wouldn't be necessary. My second thought was that it's matching spaces, alphanumerics, and the dash; but then I realized that the dash would probably be either escaped or moved to the front or back for that.
I've googled around and can't find anything about using a dash in a character class with mixed characters. Maybe I'm using the wrong search terms.

This might help : http://www.regular-expressions.info/charclass.html in the section "Metacharacters Inside Character Classes" it says :
Hyphens at other positions in character classes where they can't
form a range may be interpreted as literals or as errors. Regex
flavors are quite inconsistent about this.
My guess would be that it is being intepreted as a literal, so the regexp would match a space, hyphen or \w .
As a reference, it looks invalid in PCRE:
Debuggex Demo

In the PCRE reference §16. we find:
Perl, when in warning mode, gives warnings for character classes
such as [A-\d] or [a-[:digit:]]. It then treats the hyphens as liter-
als. PCRE has no warning features, so it gives an error in these cases
because they are almost certainly user mistakes.
[ -\w] produces a warning in perl but not in php.

Your regex [ -\w] seems to be a misplaced one as it will only match characters like this:
[ !"#$%&'()*+,./-]
As due to - appearing in the middle it will act as a range between space (32) and first \w (48) characters.

Related

Differences between two regular expressions

Do anybody know why this regex:
/^(([a-zA-Z0-9\(\)áéíóúÁÉÍÓÚñÑ,\.°-]+ *)+)$/
works but this one doesn't:
/^(([a-zA-Z0-9áéíóúÁÉÍÓÚñÑ,\.°-\(\)]+ *)+)$/
The difference is the place where the parenthesis are... I tryed with some online PHP regex testers and got the same result. The second one simply doesn't work...
PHP returns:
preg_match(): Compilation failed: range out of order in character class at offset 44 in...
This is not a critic question because I've managed to make it work but I have the curiosity!
Maybe the unicode characters are changing something?
When the - character is used inside of brackets (indicating a character set) it indicates a range unless it is the last character in the set, first character in the set, or directly after the opening negating character. Then it means a literal dash. By moving it from the end to the middle you changed its meaning. If you want to keep it in the middle you will need to escape it: \-.
If the hyphen is placed as the first or last character in the character class, it is treated as a literal - (as opposed to a range), and as a result do not require escaping.
These are the positions where the hyphen do not need to be escaped:
right after the opening bracket ([), or
right before the closing bracket (]), or
right after the negating caret (^)
In the second regular expression, you're placing the hyphen in the middle, and the regular expression engine tries to create a range with the character before the hyphen, the character after the hyphen, and all characters that lie between them in numerical order. As such a range isn't possible, an error message is triggered. See asciitable.com for the character table.
Putting the hyphen last in the expression actually causes it to not require escaping, as it then can't be part of a range, however you might still want to get into the habit of always escaping it.
At your first regex you've managed every thing correctly even that - hyphen which is at the end of it. well it should be there too! I mean it has two places if you don't want to escape it, one place is at the end of char class and the other one at the beginning of char class!
You guessed nice! otherwise you should escape it!

regex unable to allow apostrophe

I am experiencing a strange problem with a regular expression I have already used before.
The goal is to allow the user to enter his name, with letters, hyphen, and apostrophes if needed in a php form.
My regex is:
"/^[\w\s'àáâãäåçèéêëìíîïðòóôõöùúûüýÿ-]+$/i"
But... everything is allowed but the apostrophe. Escaping it will not change. Why?
To deal with unicode characters, you can do:
/^[\pN\pL\pP\pZ]+$/
where:
\pN stands for any number
\pL stands for any letter
\pP stands for any punctuation
\pZ stands for any space
It matches names like:
d'Alembert
d’Alembert (note the different apos from above)
Jean-François
O'Connors

Why does this regex not validate in the same way in PHP?

when I try preg_match with the following expression: /.{0,5}/, it still matches string longer than 5 characters.
It does, however, work properly when trying in online regexp matcher
The site you reference, myregexp.com, is focussed on Java.
Java has a specific function for matching an exact pattern, without needing to use anchor characters. This is the function which myregexp.com uses.
In most other languages, in order to match an exact pattern, you would need to add the anchoring characters ^ and $ at the start and end of the pattern respectively, otherwise the regex assumes it only needs to find the matched pattern somewhere within the string, rather than the whole string being the match.
This means that without the anchors, your pattern will match any string, of any length, because whatever the string, it will contain within it somewhere a match for "zero to five of any character".
So in PHP, and Perl, and virtually any other language, you need your pattern to look like this:
/^.{0,5}$/
Having explained all that, I would make one final observation though: this specific pattern really doesn't need to be a regular expression -- you could achieve the same thing with strlen(). In addition, the dot character in regex may not work exactly as you expect: it typically matches almost any character; some characters, including new line characters, are excluded by default, so if your string contains five characters, but one of them is a new line, it will fail your regex when you might have expected it to pass. With this in mind, strlen() would be a safer option (or mb_strlen() if you expect to have unicode characters).
If you need to match any character in regex, and the default behaviour of the dot isn't good enough, there are two options: One is to add the s modifier at the end of the expression (ie it becomes /^.{0,5}$/s). The s modifier tells regex to include new line characters in the dot "any character" match.
The other option (which is useful for languages that don't support the s modifier) is to use an expression and its negative together in a character class - eg [\s\S] - instead of the dot. \s matches any white space character, and \S is a negative of \s, so any character not matched by \s. So together in a character class they match any character. It's more long winded and less readable than a dot, but in some languages it's the only way to be sure.
You can find out more about this here: http://www.regular-expressions.info/dot.html
Hope that helps.
You need to anchor it with ^$. These symbols match the beginning and end of the string respectively, so it must be 0-5 characters between the beginning and end. Leaving out the anchors will match anywhere in the string so it could be longer.
/^.{0,5}$/
For better readability, I would probably also enclose the . in (), but that's kind of subjective.
/^(.){0,5}$/

Regex for netbios names

I got this issue figuring out how to build a regexp for verifying a netbios name. According to the ms standard these characters are illegal
\/:*?"<>|
So, thats what I'm trying to detect. My regex is looking like this
^[\\\/:\*\?"\<\>\|]$
But, that wont work.
Can anyone point me in the right direction? (not regexlib.com please...)
And if it matters, I'm using php with preg_match.
Thanks
Your regular expression has two problems:
you insist that the match should span the entire string. As Andrzej says, you are only matching strings of length 1.
you are quoting too many characters. In a character class (i.e. []), you only need to quote characters that are special within character classes, i.e. hyphen, square bracket, backslash.
The following call works for me:
preg_match('/[\\/:*?"<>|]/', "foo"); /* gives 0: does not include invalid characters */
preg_match('/[\\/:*?"<>|]/', "f<oo"); /* gives 1: does include invalid characters */
As it stands at the moment, your regex will match the start of the string (^), then exactly one of the characters in the square brackets (i.e. the illegal characters), then then end of the string ($).
So this likely isn't working because a string of length > 1 will trivially fail to match the regex, and thus be considered OK.
You likely don't need the start and end anchors (the ^ and $). If you remove these, then the regex should match one of the bracketed characters occurring anywhere on the input text, which is what you want.
(Depending on the exact regex dialect, you may canonically need less backslashes within the square brackets, but they are unlikely to do any harm in any case).

Remove control characters from PHP string

How can I remove control characters like STX from a PHP string? I played around with
preg_replace("/[^a-zA-Z0-9 .\-_;!:?äÄöÖüÜß<>='\"]/","",$pString)
but found that it removed way to much. Is there a way to remove only
control chars?
If you mean by control characters the first 32 ascii characters and \x7F (that includes the carriage return, etc!), then this will work:
preg_replace('/[\x00-\x1F\x7F]/', '', $input);
(Note the single quotes: with double quotes the use of \x00 causes a parse error, somehow.)
The line feed and carriage return (often written \r and \n) may be saved from removal like so:
preg_replace('/[\x00-\x09\x0B\x0C\x0E-\x1F\x7F]/', '', $input);
I must say that I think Bobby's answer is better, in the sense that [:cntrl:] better conveys what the code does than [\x00-\x1F\x7F].
WARNING: ereg_replace is deprecated in PHP >= 5.3.0 and removed in PHP >= 7.0.0!, please use preg_replace instead of ereg_replace:
preg_replace('/[[:cntrl:]]/', '', $input);
For Unicode input, this will remove all control characters, unassigned, private use, formatting and surrogate code points (that are not also space characters, such as tab, new line) from your input text. I use this to remove all non-printable characters from my input.
<?php
$clean = preg_replace('/[^\PC\s]/u', '', $input);
for more info on \p{C} see http://www.regular-expressions.info/unicode.html#category
PHP does support POSIX-Classes so you can use [:cntrl:] instead of some fancy character-magic-stuff:
ereg_replace("[:cntrl:]", "", $pString);
Edit:
A extra pair of square brackets might be needed in 5.3.
ereg_replace("[[:cntrl:]]", "", $pString);
TLDR Answer
Use this Regex...
/[^\PCc^\PCn^\PCs]/u
Like this...
$text = preg_replace('/[^\PCc^\PCn^\PCs]/u', '', $text);
TLDR Explanation
^\PCc : Do not match control characters.
^\PCn : Do not match unassigned characters.
^\PCs : Do not match UTF-8-invalid characters.
Working Demo
Simple demo to demonstrate: IDEOne Demo
$text = "\u{0019}hello";
print($text . "\n\n");
$text = preg_replace('/[^\PCc^\PCn^\PCs]/u', '', $text);
print($text);
Output:
(-Broken-Character)hello
hello
Alternatives
^\PC : Match only visible characters. Do not match any invisible characters.
^\PCc : Match only non-control characters. Do not match any control characters.
^\PCc^\PCn : Match only non-control characters that have been assigned. Do not match any control or unassigned characters.
^\PCc^\PCn^\PCs : Match only non-control characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, or UTF-8-invalid characters.
^\PCc^\PCn^\PCs^\PCf : Match only non-control, non-formatting characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, formatting, or UTF-8-invalid characters.
Source and Explanation
Take a look at the Unicode Character Properties available that can be used to test within a regex. You should be able to use these regexes in Microsoft .NET, JavaScript, Python, Java, PHP, Ruby, Perl, Golang, and even Adobe. Knowing Unicode character classes is very transferable knowledge, so I recommend using it!
This regex will match anything visible, given in both its short-hand and long-hand form...
\PL\PM\PN\PP\PS\PZ
\PLetter\PMark\PNumber\PPunctuation\PSymbol\PSeparator
Normally, \p indicates that it's something we want to match and we use \P (capitalized) to indicate something that does not match. But PHP doesn't have this functionality, so we need to use ^ in the regex to do a manual negation.
A simpler regex then would be ^\PC, but this might be too restrictive in deleting invisible formatting. You may want to look closely and see what's best, but one of the alternatives should fit your needs.
All Matchable Unicode Character Sets
If you want to know any other character sets available, check out regular-expressions.info...
\PL or \PLetter: any kind of letter from any language.
\PLl or \PLowercase_Letter: a lowercase letter that has an uppercase variant.
\PLu or \PUppercase_Letter: an uppercase letter that has a lowercase variant.
\PLt or \PTitlecase_Letter: a letter that appears at the start of a word when only the first letter of the word is capitalized.
\PL& or \PCased_Letter: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
\PLm or \PModifier_Letter: a special character that is used like a letter.
\PLo or \POther_Letter: a letter or ideograph that does not have lowercase and uppercase
\PM or \PMark: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
\PMn or \PNon_Spacing_Mark: a character intended to be combined with another
character without taking up extra space (e.g. accents, umlauts, etc.).
\PMc or \PSpacing_Combining_Mark: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
\PMe or \PEnclosing_Mark: a character that encloses the character it is combined with (circle, square, keycap, etc.).
\PZ or \PSeparator: any kind of whitespace or invisible separator.
\PZs or \PSpace_Separator: a whitespace character that is invisible, but does take up space.
\PZl or \PLine_Separator: line separator character U+2028.
\PZp or \PParagraph_Separator: paragraph separator character U+2029.
\PS or \PSymbol: math symbols, currency signs, dingbats, box-drawing characters, etc.
\PSm or \PMath_Symbol: any mathematical symbol.
\PSc or \PCurrency_Symbol: any currency sign.
\PSk or \PModifier_Symbol: a combining character (mark) as a full character on its own.
\PSo or \POther_Symbol: various symbols that are not math symbols, currency signs, or combining characters.
\PN or \PNumber: any kind of numeric character in any script.
\PNd or \PDecimal_Digit_Number: a digit zero through nine in any script except ideographic scripts.
\PNl or \PLetter_Number: a number that looks like a letter, such as a Roman numeral.
\PNo or \POther_Number: a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from ideographic scripts).
\PP or \PPunctuation: any kind of punctuation character.
\PPd or \PDash_Punctuation: any kind of hyphen or dash.
\PPs or \POpen_Punctuation: any kind of opening bracket.
\PPe or \PClose_Punctuation: any kind of closing bracket.
\PPi or \PInitial_Punctuation: any kind of opening quote.
\PPf or \PFinal_Punctuation: any kind of closing quote.
\PPc or \PConnector_Punctuation: a punctuation character such as an underscore that connects words.
\PPo or \POther_Punctuation: any kind of punctuation character that is not a dash, bracket, quote or connector.
\PC or \POther: invisible control characters and unused code points.
\PCc or \PControl: an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F.
\PCf or \PFormat: invisible formatting indicator.
\PCo or \PPrivate_Use: any code point reserved for private use.
\PCs or \PSurrogate: one half of a surrogate pair in UTF-16 encoding.
\PCn or \PUnassigned: any code point to which no character has been assigned.
To keep the control characters but make them compatible for JSON, I had to to
$str = preg_replace(
array(
'/\x00/', '/\x01/', '/\x02/', '/\x03/', '/\x04/',
'/\x05/', '/\x06/', '/\x07/', '/\x08/', '/\x09/', '/\x0A/',
'/\x0B/','/\x0C/','/\x0D/', '/\x0E/', '/\x0F/', '/\x10/', '/\x11/',
'/\x12/','/\x13/','/\x14/','/\x15/', '/\x16/', '/\x17/', '/\x18/',
'/\x19/','/\x1A/','/\x1B/','/\x1C/','/\x1D/', '/\x1E/', '/\x1F/'
),
array(
"\u0000", "\u0001", "\u0002", "\u0003", "\u0004",
"\u0005", "\u0006", "\u0007", "\u0008", "\u0009", "\u000A",
"\u000B", "\u000C", "\u000D", "\u000E", "\u000F", "\u0010", "\u0011",
"\u0012", "\u0013", "\u0014", "\u0015", "\u0016", "\u0017", "\u0018",
"\u0019", "\u001A", "\u001B", "\u001C", "\u001D", "\u001E", "\u001F"
),
$str
);
(The JSON rules state: “All Unicode characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).”)
regex free method
If you are only zapping the control characters I'm familiar with (those under 32 and 127), try this out:
for($control = 0; $control < 32; $control++) {
$pString = str_replace(chr($control), "", $pString;
}
$pString = str_replace(chr(127), "", $pString;
The loop gets rid of all but DEL, which we just add to the end.
I'm thinking this will be a lot less stressful on you and the script then dealing with regex and the regex library.
Updated regex free method
Just for kicks, I came up with another way to do it. This one does it using an array of control characters:
$ctrls = range(chr(0), chr(31));
$ctrls[] = chr(127);
$clean_string = str_replace($ctrls, "", $string);

Categories