what does the regular expression (?<!-) mean - php

I'm trying to understand a piece of code and came across this regular expression used in PHP's preg_replace function.
'/(?<!-)color[^{:]*:[^{#]*$/i'
This bit... (?<!-)
doesnt appear in any of my reg-exp manuals. Anyone know what this means please? (Google doesnt return anything - I dont think symbols work in google.)

The ?<! at the start of a parenthetical group is a negative lookbehind. It asserts that the word color (strictly, the c in the engine) was not preceded by a - character.
So, for a more concrete example, it would match color in the strings:
color
+color
someTextColor
But it will fail on something like -color or background-color. Also note that the engine will not technically "match" whatever precedes the c, it simply asserts that it is not a hyphen. This can be an important distinction depending on the context (illustrated on Rubular with a trivial example; note that only the b in the last string is matched, not the preceding letter).

PHP uses perl compatible regular expressions (PCRE) for the preg_* functions. From perldoc perlre:
"(?<!pattern)"
A zero-width negative look-behind assertion. For example
"/(?<!bar)foo/" matches any occurrence of "foo" that does
not follow "bar". Works only for fixed-width look-
behind.

I'm learning regular expressions using Python's re module!
http://docs.python.org/library/re.html
Matches if the current position in the string is not preceded by a match for .... This is called a negative lookbehind assertion. Similar to positive lookbehind assertions, the contained pattern must only match strings of some fixed length. Patterns which start with negative lookbehind assertions may match at the beginning of the string being searched.

Related

Using Multiple Regular Expressions in PHP

I need to write a regular expression that will evaluate the following conditions:
2 consecutive lower case characters
at least 1 digit
at least 1 upper case character
2 consecutive identical punctuation characters
For example, the string 'aa1A!!' should match, as should '!!A1aa'.
I have written the following regular expression:
'/(?=([a-z]){2,})(?=[0-9])(?=[A-Z])(?=(\W)\1)/'
I have found each individual expression works, but I am struggling to put it all together. What am I missing?
First, your pattern must be anchored to be sure that lookaheads are only tested from the position at the start of string. Then, since your characters can be everywhere in the string, you need to start the subpatterns inside lookahead with .*.
\W is a character class for non-word characters (all that is not [A-Za-z0-9_] that includes spaces, control characters, accented letters...). IMO, \pP or [[:punct:]] are more appropriate.
/^(?=.*[a-z]{2})(?=.*[0-9])(?=.*[A-Z])(?=.*(\pP)\1)/
About the idea to make 4 patterns instead of 1, it looks like a good idea, it tastes like a good idea, but it's useless and slower. However, it can be interesting if you want to know what particular rule fails.

Regex match section within string

I have a string foo-foo-AB1234-foo-AB12345678. The string can be in any format, is there a way of matching only the following pattern letter,letter,digits 3-5 ?
I have the following implementation:
preg_match_all('/[A-Za-z]{2}[0-9]{3,6}/', $string, $matches);
Unfortunately this finds a match on AB1234 AND AB12345678 which has more than 6 digits. I only wish to find a match on AB1234 in this instance.
I tried:
preg_match_all('/^[A-Za-z]{2}[0-9]{3,6}$/', $string, $matches);
You will notice ^ and $ to mark the beginning and end, but this only applies to the string, not the section, therefore no match is found.
I understand why the code is behaving like it is. It makes logical sense. I can't figure out the solution though.
You must be looking for word boundaries \b:
\b\p{L}{2}\p{N}{3,5}\b
See demo
Note that \p{L} matches a Unicode letter, and \p{N} matches a Unicode number.
You can as well use your modified regex \b[a-zA-Z]{2}[0-9]{3,5}\b. Note that using anchors makes your regex match only at the beginning of a string (with ^) or/and at the end of the string (with $).
In case you have underscored words (like foo-foo_AB1234_foo_AB12345678_string), you will need a slight modification:
(?<=\b|_)\p{L}{2}\p{N}{3,5}(?=\b|_)
You have to end your regular expression with a pattern for a non-digit. In Java this would be \D, this should be the same in PHP.

Character classes strange behavior in alternations in regular expressions

I'm trying to write a simple regular expression that recognizes a sequence of characters that are not columns or are escaped columns.
I.e:
foo:bar //Does not match
but
foo\:bar //Does match
By my knowledge of Regular Languages, such language can be described by the regular expression
/([^:]|\\[:])*/
You can see a graphical representation of this expression in the wonderful tool Regexper
Using php's preg_match (that is based on the PCRE engine), such expression does not match "foo\:bar".
However, if substitute the class with the single char:
/([^:]|\\:)*/
the expression matches.
Do you have an explanation for this? Is this a sort of limitation of the PCRE engine on character classes?
PS: Testing the first expression on RegExr, that is based on AS3 Regexp engine, does not offer a match, while changing the alternation order:
/(\\[:]|[^:])*/
it does match, while the same expression does not match in PCRE.
preg_match() accepts a regular expression pattern as a string, so you need to double escape everything.
^(?:[^:\\\\]|\\\\:)+$
This matches one or more characters that are not colons or escape characters [^:\\\\], or an escaped colon \\\\:.
Why your first regular expression didn't work: /([^:]|\\[:])*/.
This matches a non-colon [^:], or it matches \\[:] which matches a literal [ followed by a literal : and then a literal ].
Why this works : /([^:]|\\:)*/ ?
This matches a non-colon [^:], or it matches a literal \\: so it effectively matches everything.
Edit: Why /([^:]|E[:])*/ won't match fooE:bar ?
This is what happens: [^:] matches the f then it matches o then the other o then it matches the E, now it finds a colon : and it can't match it, but since by default the PCRE engine doesn't look for the longest possible match it is satisfied with what is has matched so far and stops right there and returns fooE as a match without trying the other alternative E[:] (which is equal by the way to E:) at all.
If you want to match the entire sequence then you will to use an expression like this one:
/([^:E]|E[:])*/
This prevents [^:] from consuming that E.
You can try this. This allow the secuence \\: to have a chance before the negated character class [^:].
^(?:\\:|[^:])+$
If you use the values in the alternation bar inverted as in ^((?:[^:]|\\:)+$ it will not match escaped colon \: because the first alternative will consume the slash (\) before the second expression have a chance to try.

Understanding Regular Expressions

I am tired of being frightened of regular expressions. The topic of this post is limited to PHP implementation of regular expressions, however, any generic regular expression advice would obviously be appreciated (i.e. don't confuse me with scope that is not applicable to PHP).
The following (I believe) will remove any whitespace between numbers. Maybe there is a better way to do so, but I still want to understand what is going on.
$pat="/\b(\d+)\s+(?=\d+\b)/";
$sub="123 345";
$string=preg_replace($pat, "$1", $sub);
Going through the pattern, my interpretation is:
\b A word boundary
\d+ A subpattern of 1 or more digits
\s+ One or more whitespaces
(?=\d+\b) Lookahead assertion of one or more digit followed by a word boundary?
Putting it all together, search for any word boundary followed by one or more digits and then some whitespace, and then do some sort of lookahead assertion on it, and save the results in $1 so it can replace the pattern?
Questions:
Is my above interpretation correct?
What is that lookahead assertion all about?
What is the purpose of the leading / and trailing /?
Is my above interpretation correct?
Yes, your interpretation is correct.
What is that lookahead assertion all about?
That lookahead assertion is a way for you to match characters that have a certain pattern in front of them, without actually having to match the pattern.
So basically, using the regex abcd(?=e) to match the string abcde will give you the match: abcd.
The reason that this matches is that the string abcde does in fact contain:
An a
Followed by a b
Followed by a c
Followed by a d that has an e after it (this is a single character!)
It is important to note that after the 4th item it also contains an actual "e" character, which we didn't match.
On the other hand, trying to match the string against the regex abcd(?=f) will fail, since the sequence:
"a", followed by "b", followed by "c", followed by "d that has an f in front of it"
is not found.
What is the purpose of the leading / and trailing /
Those are delimiters, and are used in PHP to distinguish the pattern part of your string from the modifier part of your string. A delimiter can be any character, although I prefer # signs myself. Remember that the character you are using as a delimiter needs to be escaped if it is used in your pattern.
It would be a good idea to watch this video, and the 4 that follow this:
http://blog.themeforest.net/screencasts/regular-expressions-for-dummies/
The rest of the series is found here:
http://blog.themeforest.net/?s=regex+for+dummies
A colleague sent me the series and after watching them all I was much more comfortable using Regular Expressions.
Another good idea would be installing RegexBuddy or Regexr. Especially RegexBuddy is very useful for understanding the workings of a regular expression.

Why does this regex not validate in the same way in PHP?

when I try preg_match with the following expression: /.{0,5}/, it still matches string longer than 5 characters.
It does, however, work properly when trying in online regexp matcher
The site you reference, myregexp.com, is focussed on Java.
Java has a specific function for matching an exact pattern, without needing to use anchor characters. This is the function which myregexp.com uses.
In most other languages, in order to match an exact pattern, you would need to add the anchoring characters ^ and $ at the start and end of the pattern respectively, otherwise the regex assumes it only needs to find the matched pattern somewhere within the string, rather than the whole string being the match.
This means that without the anchors, your pattern will match any string, of any length, because whatever the string, it will contain within it somewhere a match for "zero to five of any character".
So in PHP, and Perl, and virtually any other language, you need your pattern to look like this:
/^.{0,5}$/
Having explained all that, I would make one final observation though: this specific pattern really doesn't need to be a regular expression -- you could achieve the same thing with strlen(). In addition, the dot character in regex may not work exactly as you expect: it typically matches almost any character; some characters, including new line characters, are excluded by default, so if your string contains five characters, but one of them is a new line, it will fail your regex when you might have expected it to pass. With this in mind, strlen() would be a safer option (or mb_strlen() if you expect to have unicode characters).
If you need to match any character in regex, and the default behaviour of the dot isn't good enough, there are two options: One is to add the s modifier at the end of the expression (ie it becomes /^.{0,5}$/s). The s modifier tells regex to include new line characters in the dot "any character" match.
The other option (which is useful for languages that don't support the s modifier) is to use an expression and its negative together in a character class - eg [\s\S] - instead of the dot. \s matches any white space character, and \S is a negative of \s, so any character not matched by \s. So together in a character class they match any character. It's more long winded and less readable than a dot, but in some languages it's the only way to be sure.
You can find out more about this here: http://www.regular-expressions.info/dot.html
Hope that helps.
You need to anchor it with ^$. These symbols match the beginning and end of the string respectively, so it must be 0-5 characters between the beginning and end. Leaving out the anchors will match anywhere in the string so it could be longer.
/^.{0,5}$/
For better readability, I would probably also enclose the . in (), but that's kind of subjective.
/^(.){0,5}$/

Categories