Differences between two regular expressions - php

Do anybody know why this regex:
/^(([a-zA-Z0-9\(\)áéíóúÁÉÍÓÚñÑ,\.°-]+ *)+)$/
works but this one doesn't:
/^(([a-zA-Z0-9áéíóúÁÉÍÓÚñÑ,\.°-\(\)]+ *)+)$/
The difference is the place where the parenthesis are... I tryed with some online PHP regex testers and got the same result. The second one simply doesn't work...
PHP returns:
preg_match(): Compilation failed: range out of order in character class at offset 44 in...
This is not a critic question because I've managed to make it work but I have the curiosity!
Maybe the unicode characters are changing something?

When the - character is used inside of brackets (indicating a character set) it indicates a range unless it is the last character in the set, first character in the set, or directly after the opening negating character. Then it means a literal dash. By moving it from the end to the middle you changed its meaning. If you want to keep it in the middle you will need to escape it: \-.

If the hyphen is placed as the first or last character in the character class, it is treated as a literal - (as opposed to a range), and as a result do not require escaping.
These are the positions where the hyphen do not need to be escaped:
right after the opening bracket ([), or
right before the closing bracket (]), or
right after the negating caret (^)
In the second regular expression, you're placing the hyphen in the middle, and the regular expression engine tries to create a range with the character before the hyphen, the character after the hyphen, and all characters that lie between them in numerical order. As such a range isn't possible, an error message is triggered. See asciitable.com for the character table.
Putting the hyphen last in the expression actually causes it to not require escaping, as it then can't be part of a range, however you might still want to get into the habit of always escaping it.

At your first regex you've managed every thing correctly even that - hyphen which is at the end of it. well it should be there too! I mean it has two places if you don't want to escape it, one place is at the end of char class and the other one at the beginning of char class!
You guessed nice! otherwise you should escape it!

Related

Meaning of a dash between mixed characters in regex?

I'm just getting my feet wet with regexes and I came across this within a PHP program that someone else had written:
[ -\w]. Note that the dash is not the first character, there is a space preceding it.
I can't make heads or tails of what it means. I know that the dash between characters inside brackets normally indicates a range, i.e. [a-z] matches any lowercase character "a" through "z", but what does it match when the dash is between characters of different types?
My first thought was that it just matches any space or alphanumeric character, but then the dash wouldn't be necessary. My second thought was that it's matching spaces, alphanumerics, and the dash; but then I realized that the dash would probably be either escaped or moved to the front or back for that.
I've googled around and can't find anything about using a dash in a character class with mixed characters. Maybe I'm using the wrong search terms.
This might help : http://www.regular-expressions.info/charclass.html in the section "Metacharacters Inside Character Classes" it says :
Hyphens at other positions in character classes where they can't
form a range may be interpreted as literals or as errors. Regex
flavors are quite inconsistent about this.
My guess would be that it is being intepreted as a literal, so the regexp would match a space, hyphen or \w .
As a reference, it looks invalid in PCRE:
Debuggex Demo
In the PCRE reference §16. we find:
Perl, when in warning mode, gives warnings for character classes
such as [A-\d] or [a-[:digit:]]. It then treats the hyphens as liter-
als. PCRE has no warning features, so it gives an error in these cases
because they are almost certainly user mistakes.
[ -\w] produces a warning in perl but not in php.
Your regex [ -\w] seems to be a misplaced one as it will only match characters like this:
[ !"#$%&'()*+,./-]
As due to - appearing in the middle it will act as a range between space (32) and first \w (48) characters.

regex validation

I am trying to validate a string of 3 numbers followed by / then 5 more numbers
I thought this would work
(/^([0-9]+[0-9]+[0-9]+/[0-9]+[0-9]+[0-9]+[0-9]+[0-9])/i)
but it doesn't, any ideas what i'm doing wrong
Try this
preg_match('#^\d{3}/\d{5}#', $string)
The reason yours is not working is due to the + symbols which match "one or more" of the nominated character or character class.
Also, when using forward-slash delimiters (the characters at the start and end of your expression), you need to escape any forward-slashes in the pattern by prefixing them with a backslash, eg
/foo\/bar/
PHP allows you to use alternate delimiters (as in my answer) which is handy if your expression contains many forward-slashes.
First of all, you're using / as the regexp delimiter, so you can't use it in the pattern without escaping it with a backslash. Otherwise, PHP will think that you're pattern ends at the / in the middle (you can see that even StackOverflow's syntax highlighting thinks so).
Second, the + is "greedy", and will match as many characters as it can, so the first [0-9]+ would match the first 3 numbers in one go, leaving nothing for the next two to match.
Third, there's no need to use i, since you're dealing with numbers which aren't upper- or lowercase, so case-sensitivity is a moot point.
Try this instead
/^\d{3}\/\d{5}$/
The \d is shorthand for writing [0-9], and the {3} and {5} means repeat 3 or 5 times, respectively.
(This pattern is anchored to the start and the end of the string. Your pattern was only anchored to the beginning, and if that was on purpose, the remove the $ from my pattern)
I recently found this site useful for debugging regexes:
http://www.regextester.com/index2.html
It assumes use of /.../ (meaning you should not include those slashes in the regex you paste in).
So, after I put your regex ^([0-9]+[0-9]+[0-9]+/[0-9]+[0-9]+[0-9]+[0-9]+[0-9]) in the Regex box and 123/45678 in the Test box I see no match. When I put a backslash in front of the forward slash in the middle, then it recognizes the match. You can then try matching 1234/567890 and discover it still matches. Then you go through and remove all the plus signs and then it correctly stops matching.
What I particularly like about this particular site is the way it shows the partial matches in red, allowing you to see where your regex is working up to.

regex: remove all text within "double-quotes" (multiline included)

I'm having a hard time removing text within double-quotes, especially those spread over multiple lines:
$file=file_get_contents('test.html');
$replaced = preg_replace('/"(\n.)+?"/m','', $file);
I want to remove ALL text within double-quotes (included). Some of the text within them will be spread over multiple lines.
I read that newlines can be \r\n and \n as well.
Try this expression:
"[^"]+"
Also make sure you replace globally (usually with a g flag - my PHP is rusty so check the docs).
Another edit: daalbert's solution is best: a quote followed by one or more non-quotes ending with a quote.
I would make one slight modification if you're parsing HTML: make it 0 or more non-quote characters...so the regex will be:
"[^"]*"
EDIT:
On second thought, here's a better one:
"[\S\s]*?"
This says: "a quote followed by either a non-whitespace character or white-space character any number of times, non-greedily, ending with a quote"
The one below uses capture groups when it isn't necessary...and the use of a wildcard here isn't explicit about showing that wildcard matches everything but the new-line char...so it's more clear to say: "either a non-whitespace char or whitespace char" :) -- not that it makes any difference in the result.
there are many regexes that can solve your problem but here's one:
"(.*?(\s)*?)*?"
this reads as:
find a quote optionally followed by: (any number of characters that are not new-line characters non-greedily, followed by any number of whitespace characters non-greedily), repeated any number of times non-greedily
greedy means it will go to the end of the string and try matching it. if it can't find the match, it goes one from the end and tries to match, and so on. so non-greedy means it will find as little characters as possible to try matching the criteria.
great link on regex: http://www.regular-expressions.info
great link to test regexes: http://regexpal.com/
Remember that your regex may have to change slightly based on what language you're using to search using regex.
You can use single line mode (also know as dotall) and the dot will match even newlines (whatever they are):
/".+?"/s
You are using multiline mode which simply changes the meaning of ^ and $ from beginning/end of string to beginning/end of text. You don't need it here.
"[^"]+"
Something like below. s is dotall mode where . will match even newline:
/".+?"/s
$replaced = preg_replace('/"[^"]*"/s','', $file);
will do this for you. However note it won't allow for any quoted double quotes (e.g. A "test \" quoted string" B will result in A quoted string" B with a leading space, not in A B as you might expect.

Regex for netbios names

I got this issue figuring out how to build a regexp for verifying a netbios name. According to the ms standard these characters are illegal
\/:*?"<>|
So, thats what I'm trying to detect. My regex is looking like this
^[\\\/:\*\?"\<\>\|]$
But, that wont work.
Can anyone point me in the right direction? (not regexlib.com please...)
And if it matters, I'm using php with preg_match.
Thanks
Your regular expression has two problems:
you insist that the match should span the entire string. As Andrzej says, you are only matching strings of length 1.
you are quoting too many characters. In a character class (i.e. []), you only need to quote characters that are special within character classes, i.e. hyphen, square bracket, backslash.
The following call works for me:
preg_match('/[\\/:*?"<>|]/', "foo"); /* gives 0: does not include invalid characters */
preg_match('/[\\/:*?"<>|]/', "f<oo"); /* gives 1: does include invalid characters */
As it stands at the moment, your regex will match the start of the string (^), then exactly one of the characters in the square brackets (i.e. the illegal characters), then then end of the string ($).
So this likely isn't working because a string of length > 1 will trivially fail to match the regex, and thus be considered OK.
You likely don't need the start and end anchors (the ^ and $). If you remove these, then the regex should match one of the bracketed characters occurring anywhere on the input text, which is what you want.
(Depending on the exact regex dialect, you may canonically need less backslashes within the square brackets, but they are unlikely to do any harm in any case).

How bad is my regex?

Ok so I managed to solve a problem at work with regex, but the solution is a bit of a monster.
The string to be validated must be:
zero or more: A-Z a-z 0-9, spaces, or these symbols: . - = + ' , : ( ) /
But, the first and/or last characters must not be a forward slash /
This was my solution (used preg_match php function):
"/^[a-z\d\s\.\-=\+\',:\(\)][a-z\d\s\.\-=\+\',\/:\(\)]*[a-z\d\s\.\-=\+\',:\(\)]$|^[a-z\d\s\.\-=\+\',:\(\)]$/i"
A colleague thinks this is too big and complicated. Well it works, so is it really that bad? Anyone in the mood for some regex-golf?
You can simplify your expression to this:
/^(?:[a-z\d\s.\-=+',:()]+(?:/+[a-z\d\s.\-=+',:()]+)*)?$/i
The outer (?:…)? is to allow an empty string. The [a-z\d\s.\-=+',:()]+ allows to start with one or more of the specified characters except the /. If a / follows, it also must be followed by one or more of the other specified characters ((?:/[a-z\d\s.\-=+',:()]+)*).
Furthermore, inside a character set, you only need to escape the characters \, ], and depending on the position also ^ and -.
Try something like this instead
function validate($string) {
return (preg_match("/[a-zA-Z0-9.\-=+',:()/]*/", $string) && substr($string, 0,1) != '/' && substr($string, -1) != '/'))
}
It's a lot simpler to check the first and last character specifically. Otherwise you're left with dealing with a lot of overhead when it comes to empty strings and such. Your regex, for example, requires the string to be at least one character long, otherwise it doesn't validate. Despite "" fitting your criteria.
'#^(?!/)[a-z\d .=+\',:()/-]*$(?<!/)#i'
As others have observed, most of those characters don't need to be escaped inside a character class. Additionally, the hyphen doesn't need to be escaped if it's the last thing listed, and the slash doesn't need to be escaped if you use a different character as the regex delimiter (# in this case, but ~ is a popular choice, too).
I also ditched the double-quotes in favor of single-quotes, which meant I had to escape the single-quote in the regex. That's worth it because single-quoted strings are so much simpler to work with: no $variable interpolation, no embedded executable {code}, and the only characters you have to escape for them are the single-quote and the backslash.
But the main innovation here is the use of lookahead and lookbehind to exclude the slash as the first or last character. That's not just a code-golf tactic, either; I would write the regex this way anyway, because it expresses my intent so much better. Why force the next guy to parse those almost-identical character classes, when you can just say what you mean? "...but the first and last character can't be slashes."

Categories