How exactly do Regular Expression word boundaries work in PHP?

How exactly do Regular Expression word boundaries work in PHP? - php

I'm currently writing a library for matching specific words in content.
Essentially the way it works is by compiling words into regular expressions, and running content through said regular expressions.
A feature I want to add is specifying whether a given word to match must start and/or end a word. For example, I have the word cat. I specify that it must start a word, so catering will match as cat is at the start, but ducat won't match as cat doesn't start the word.
I wanted to do this using word boundaries, but during some testing I found it doesn't work as I'd expect it to.
Take the following,
preg_match("/(^|\b)#nimal/i", "something#nimal", $match);
preg_match("/(^|\b)#nimal/i", "something!#nimal", $match);
In the statements above I would expect the following results,
> false
> 1 (#nimal)
But the result is instead the opposite,
> 1 (#nimal)
> false
In the first, I would expect it to fail as the group will eat the #, leaving nimal to match against #nimal, which obviously it doesn't. Instead, the group matchs an empty string, so #nimal is matched, meaning # is considered to be part of the word.
In the second, I would expect the group to eat the ! leaving #nimal to match the rest (which it should). Instead, it appears to combine the ! and # together to form a word, which is confirmed by the following matching,
preg_match("/g\b!#\bn/i", "something!#nimal", $match);
Any ideas why regular expression does this?
I'd just love a page that clearly documents how word boundaries are determined, I just can't find one for the life of me.

The word boundary \b matches on a change from a \w (a word character) to a \W a non word character. You want to match if there is a \b before your # which is a \W character. So to match you need a word character before your #
something#nimal
^^
==> Match because of the word boundary between g and #.
something!#nimal
^^
==> NO match because between ! and # there is no word boundary, both characters are \W

One problem I've encountered doing similar matching is words like can't and it's, where the apostrophe is considered a word/non-word boundary (as it is matched by \W and not \w). If that is likely to be a problem for you, you should exclude the apostrophe (and all of the variants such as ’ and ‘ that sometimes appear), for example by creating a class e.g. [\b^'].
You might also experience problems with UTF8 characters that are genuinely part of the word (i.e. what us humans mean by a word), for example test your regex against how you encode a word such as Svašek.
It is therefore often easier when parsing normal "linguistic" text to look for "linguistic" boundaries such as space characters (not just literally spaces, but the full class including newlines and tabs), commas, colons, full-stops, etc (and angle-brackets if you are parsing HTML). YMMV.

# is not part of a word character (in your locale probably it is, however, by default a "word" character is any letter or digit or the underscore character, Source - so # is not a word character, therefore not \w but \W and as linked any \w\W or \W\w combination marks a \b position), therefore it's always the word boundary that matches (in the OP's regex).
The following is similar to your regexes with the difference that instead of #, a is used. And beginning of line is a word boundary as well, so no need to specify it as well:
$r = preg_match("/\b(animal)/i", "somethinganimal", $match);
var_dump($r, $match);
$r = preg_match("/\b(animal)/i", "something!animal", $match);
var_dump($r, $match);
Output:
int(0)
array(0) {
}
int(1)
array(2) {
[0]=>
string(6) "animal"
[1]=>
string(6) "animal"
}

Related

Regex check using PHP preg_match_all function [duplicate]

I'm trying to use regexes to match space-separated numbers.
I can't find a precise definition of \b ("word boundary").
I had assumed that -12 would be an "integer word" (matched by \b\-?\d+\b) but it appears that this does not work. I'd be grateful to know of ways of .
[I am using Java regexes in Java 1.6]
Example:
Pattern pattern = Pattern.compile("\\s*\\b\\-?\\d+\\s*");
String plus = " 12 ";
System.out.println(""+pattern.matcher(plus).matches());
String minus = " -12 ";
System.out.println(""+pattern.matcher(minus).matches());
pattern = Pattern.compile("\\s*\\-?\\d+\\s*");
System.out.println(""+pattern.matcher(minus).matches());
This returns:
true
false
true

A word boundary, in most regex dialects, is a position between \w and \W (non-word char), or at the beginning or end of a string if it begins or ends (respectively) with a word character ([0-9A-Za-z_]).
So, in the string "-12", it would match before the 1 or after the 2. The dash is not a word character.

In the course of learning regular expression, I was really stuck in the metacharacter which is \b. I indeed didn't comprehend its meaning while I was asking myself "what it is, what it is" repetitively. After some attempts by using the website, I watch out the pink vertical dashes at the every beginning of words and at the end of words. I got it its meaning well at that time. It's now exactly word(\w)-boundary.
My view is merely to immensely understanding-oriented. Logic behind of it should be examined from another answers.

A word boundary can occur in one of three positions:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
Word characters are alpha-numeric; a minus sign is not.
Taken from Regex Tutorial.

I would like to explain Alan Moore's answer
A word boundary is a position that is either preceded by a word character and not followed by one or followed by a word character and not preceded by one.
Suppose I have a string "This is a cat, and she's awesome", and I want to replace all occurrences of the letter 'a' only if this letter ('a') exists at the "Boundary of a word",
In other words: the letter a inside 'cat' should not be replaced.
So I'll perform regex (in Python) as
re.sub(r"\ba","e", myString.strip()) //replace a with e
Therefore,
Input; Output
This is a cat and she's awesome
This is e cat end she's ewesome

A word boundary is a position that is either preceded by a word character and not followed by one, or followed by a word character and not preceded by one.

I talk about what \b-style regex boundaries actually are here.
The short story is that they’re conditional. Their behavior depends on what they’re next to.
# same as using a \b before:
(?(?=\w) (?<!\w) | (?<!\W) )
# same as using a \b after:
(?(?<=\w) (?!\w) | (?!\W) )
Sometimes that isn’t what you want. See my other answer for elaboration.

I ran into an even worse problem when searching text for words like .NET, C++, C#, and C. You would think that computer programmers would know better than to name a language something that is hard to write regular expressions for.
Anyway, this is what I found out (summarized mostly from http://www.regular-expressions.info, which is a great site): In most flavors of regex, characters that are matched by the short-hand character class \w are the characters that are treated as word characters by word boundaries. Java is an exception. Java supports Unicode for \b but not for \w. (I'm sure there was a good reason for it at the time).
The \w stands for "word character". It always matches the ASCII characters [A-Za-z0-9_]. Notice the inclusion of the underscore and digits (but not dash!). In most flavors that support Unicode, \w includes many characters from other scripts. There is a lot of inconsistency about which characters are actually included. Letters and digits from alphabetic scripts and ideographs are generally included. Connector punctuation other than the underscore and numeric symbols that aren't digits may or may not be included. XML Schema and XPath even include all symbols in \w. But Java, JavaScript, and PCRE match only ASCII characters with \w.
Which is why Java-based regex searches for C++, C# or .NET (even when you remember to escape the period and pluses) are screwed by the \b.
Note: I'm not sure what to do about mistakes in text, like when someone doesn't put a space after a period at the end of a sentence. I allowed for it, but I'm not sure that it's necessarily the right thing to do.
Anyway, in Java, if you're searching text for the those weird-named languages, you need to replace the \b with before and after whitespace and punctuation designators. For example:
public static String grep(String regexp, String multiLineStringToSearch) {
String result = "";
String[] lines = multiLineStringToSearch.split("\\n");
Pattern pattern = Pattern.compile(regexp);
for (String line : lines) {
Matcher matcher = pattern.matcher(line);
if (matcher.find()) {
result = result + "\n" + line;
}
}
return result.trim();
}
Then in your test or main function:
String beforeWord = "(\\s|\\.|\\,|\\!|\\?|\\(|\\)|\\'|\\\"|^)";
String afterWord = "(\\s|\\.|\\,|\\!|\\?|\\(|\\)|\\'|\\\"|$)";
text = "Programming in C, (C++) C#, Java, and .NET.";
System.out.println("text="+text);
// Here is where Java word boundaries do not work correctly on "cutesy" computer language names.
System.out.println("Bad word boundary can't find because of Java: grep with word boundary for .NET="+ grep("\\b\\.NET\\b", text));
System.out.println("Should find: grep exactly for .NET="+ grep(beforeWord+"\\.NET"+afterWord, text));
System.out.println("Bad word boundary can't find because of Java: grep with word boundary for C#="+ grep("\\bC#\\b", text));
System.out.println("Should find: grep exactly for C#="+ grep("C#"+afterWord, text));
System.out.println("Bad word boundary can't find because of Java:grep with word boundary for C++="+ grep("\\bC\\+\\+\\b", text));
System.out.println("Should find: grep exactly for C++="+ grep(beforeWord+"C\\+\\+"+afterWord, text));
System.out.println("Should find: grep with word boundary for Java="+ grep("\\bJava\\b", text));
System.out.println("Should find: grep for case-insensitive java="+ grep("?i)\\bjava\\b", text));
System.out.println("Should find: grep with word boundary for C="+ grep("\\bC\\b", text)); // Works Ok for this example, but see below
// Because of the stupid too-short cutsey name, searches find stuff it shouldn't.
text = "Worked on C&O (Chesapeake and Ohio) Canal when I was younger; more recently developed in Lisp.";
System.out.println("text="+text);
System.out.println("Bad word boundary because of C name: grep with word boundary for C="+ grep("\\bC\\b", text));
System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
// Make sure the first and last cases work OK.
text = "C is a language that should have been named differently.";
System.out.println("text="+text);
System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
text = "One language that should have been named differently is C";
System.out.println("text="+text);
System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
//Make sure we don't get false positives
text = "The letter 'c' can be hard as in Cat, or soft as in Cindy. Computer languages should not require disambiguation (e.g. Ruby, Python vs. Fortran, Hadoop)";
System.out.println("text="+text);
System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
P.S. My thanks to http://regexpal.com/ without whom the regex world would be very miserable!

Check out the documentation on boundary conditions:
http://java.sun.com/docs/books/tutorial/essential/regex/bounds.html
Check out this sample:
public static void main(final String[] args)
{
String x = "I found the value -12 in my string.";
System.err.println(Arrays.toString(x.split("\\b-?\\d+\\b")));
}
When you print it out, notice that the output is this:
[I found the value -, in my string.]
This means that the "-" character is not being picked up as being on the boundary of a word because it's not considered a word character. Looks like #brianary kinda beat me to the punch, so he gets an up-vote.

Reference: Mastering Regular Expressions (Jeffrey E.F. Friedl) - O'Reilly
\b is equivalent to (?<!\w)(?=\w)|(?<=\w)(?!\w)

Word boundary \b is used where one word should be a word character and another one a non-word character.
Regular Expression for negative number should be
--?\b\d+\b
check working DEMO

I believe that your problem is due to the fact that - is not a word character. Thus, the word boundary will match after the -, and so will not capture it. Word boundaries match before the first and after the last word characters in a string, as well as any place where before it is a word character or non-word character, and after it is the opposite. Also note that word boundary is a zero-width match.
One possible alternative is
(?:(?:^|\s)-?)\d+\b
This will match any numbers starting with a space character and an optional dash, and ending at a word boundary. It will also match a number starting at the beginning of the string.

when you use \\b(\\w+)+\\b that means exact match with a word containing only word characters ([a-zA-Z0-9])
in your case for example setting \\b at the begining of regex will accept -12(with space) but again it won't accept -12(without space)
for reference to support my words: https://docs.oracle.com/javase/tutorial/essential/regex/bounds.html

I think it's the boundary (i.e. character following) of the last match or the beginning or end of the string.

preg_replace doesnt not replace what I want

I have this regex that matches strings that I want to check on validity.
However recently I want to use this same regex to replace every character that is not valid to the regex with a character (let's say x).
My regex to match these types of strings is: '#^[\pL\'\’\d][\pL\.\-\ \'\/\,\’\d]*$#iu'
Which allows for the first character to be of any language or any digit and some determined special chars. And all the following letters to be slightly the same but slightly more special characters.
This is what I do (nothing special).
preg_replace($regex, 'x', $string);
Things I tried include trying to negate the regex:
'(?![\pL\'\’\d][\pL\.\-\ \'\/\,\’\d]*)'
'[^\pL\'\’\d][^\pL\.\-\ \'\/\,\’\d]*'
I've also tried splitting up the string into the firstchar and the rest of the string and split the regex in 2.
$validationRegex1 = '[^\pL\'\’\d]';
$validationRegex2 = '[^\pL\.\-\ \'\/\,\’\d]*';
$fixedStr1 = (string) preg_replace($validationRegex1, 'x', $firstChar)
. (string) preg_replace($validationRegex2, 'x', $theRest);
But this also did not seemed to work.
I've experimented a bit with this online tool: https://www.functions-online.com/preg_replace.html
Does anyone know what I am overlooking?
Examples of strings and their expected results
'-' should become 'x'.
'Random-morestuff' stays 'Random-morestuff'
'Random%morestuff' should become 'Randomxmorestuff'
'Rândôm' stays 'Rândôm'

Just an idea but if I got you right, you could use
(?(DEFINE)
(?<first>[\pL\d'’])
(?<other>[-\ \pL\d.'/,’])
)
\b(?&first)(?&other)+\b(*SKIP)(*FAIL)|.
This needs to be replaced by x. You do not have to escape everything in a character class, I changed this accordingly.
See a demo on regex101.com.
A bit more explanation: The (?(DEFINE)...) thingy lets you define subroutines that can be used afterwards and is just syntactic sugar in this case (maybe a bit showing off, really). As you have stated that other characters are allowed depending on theirs positions, I just called them first and other. The \b marks a word boundary, that is a boundary between \w (usually [a-zA-Z0-9_]) and \W (not \w). All of these "words" are allowed, so we let the engine "forget" what has been matched with the (*SKIP)(*FAIL) mechanism and match any other character on the right side of the alternation (|). See how (*SKIP)(*FAIL) works here on SO.

Use
$fixedStr1 = preg_replace('/[\p{L}\'\’\d][\p{L}\.\ \'\/\,\’\d-]*(*SKIP)(*FAIL)|./u', 'x', $input_string);
See regex proof.
Fail matches that match valid symbol words and replace every character appearing in other places.

Why is regex matching ":www.xxx.com" [duplicate]

I'm currently writing a library for matching specific words in content.
Essentially the way it works is by compiling words into regular expressions, and running content through said regular expressions.
A feature I want to add is specifying whether a given word to match must start and/or end a word. For example, I have the word cat. I specify that it must start a word, so catering will match as cat is at the start, but ducat won't match as cat doesn't start the word.
I wanted to do this using word boundaries, but during some testing I found it doesn't work as I'd expect it to.
Take the following,
preg_match("/(^|\b)#nimal/i", "something#nimal", $match);
preg_match("/(^|\b)#nimal/i", "something!#nimal", $match);
In the statements above I would expect the following results,
> false
> 1 (#nimal)
But the result is instead the opposite,
> 1 (#nimal)
> false
In the first, I would expect it to fail as the group will eat the #, leaving nimal to match against #nimal, which obviously it doesn't. Instead, the group matchs an empty string, so #nimal is matched, meaning # is considered to be part of the word.
In the second, I would expect the group to eat the ! leaving #nimal to match the rest (which it should). Instead, it appears to combine the ! and # together to form a word, which is confirmed by the following matching,
preg_match("/g\b!#\bn/i", "something!#nimal", $match);
Any ideas why regular expression does this?
I'd just love a page that clearly documents how word boundaries are determined, I just can't find one for the life of me.

The word boundary \b matches on a change from a \w (a word character) to a \W a non word character. You want to match if there is a \b before your # which is a \W character. So to match you need a word character before your #
something#nimal
^^
==> Match because of the word boundary between g and #.
something!#nimal
^^
==> NO match because between ! and # there is no word boundary, both characters are \W

One problem I've encountered doing similar matching is words like can't and it's, where the apostrophe is considered a word/non-word boundary (as it is matched by \W and not \w). If that is likely to be a problem for you, you should exclude the apostrophe (and all of the variants such as ’ and ‘ that sometimes appear), for example by creating a class e.g. [\b^'].
You might also experience problems with UTF8 characters that are genuinely part of the word (i.e. what us humans mean by a word), for example test your regex against how you encode a word such as Svašek.
It is therefore often easier when parsing normal "linguistic" text to look for "linguistic" boundaries such as space characters (not just literally spaces, but the full class including newlines and tabs), commas, colons, full-stops, etc (and angle-brackets if you are parsing HTML). YMMV.

# is not part of a word character (in your locale probably it is, however, by default a "word" character is any letter or digit or the underscore character, Source - so # is not a word character, therefore not \w but \W and as linked any \w\W or \W\w combination marks a \b position), therefore it's always the word boundary that matches (in the OP's regex).
The following is similar to your regexes with the difference that instead of #, a is used. And beginning of line is a word boundary as well, so no need to specify it as well:
$r = preg_match("/\b(animal)/i", "somethinganimal", $match);
var_dump($r, $match);
$r = preg_match("/\b(animal)/i", "something!animal", $match);
var_dump($r, $match);
Output:
int(0)
array(0) {
}
int(1)
array(2) {
[0]=>
string(6) "animal"
[1]=>
string(6) "animal"
}

Regex: how to match any string until whitespace, or until punctuation followed by whitespace?

I'm trying to write a regular expression which will find URLs in a plain-text string, so that I can wrap them with anchor tags. I know there are expressions already available for this, but I want to create my own, mostly because I want to know how it works.
Since it's not going to break anything if my regex fails, my plan is to write something fairly simple. So far that means: 1) match "www" or "http" at the start of a word 2) keep matching until the word ends.
I can do that, AFAICT. I have this: \b(http|www).?[^\s]+
Which works on foo www.example.com bar http://www.example.com etc.
The problem is that if I give it foo www.example.com, http://www.example.com it thinks that the comma is a part of the URL.
So, if I am to use one expression to do this, I need to change "...and stop when you see whitespace" to "...and stop when you see whitespace or a piece of punctuation right before whitespace". This is what I'm not sure how to do.
At the moment, a solution I'm thinking of running with is just adding another test – matching the URL, and then on the next line moving any sneaky punctuation. This just isn't as elegant.
Note: I am writing this in PHP.
Aside: why does replacing \s with \b in the expression above not seem to work?
ETA:
Thanks everyone!
This is what I eventually ended up with, based on Explosion Pills's advice:
function add_links( $string ) {
function replace( $arr ) {
if ( strncmp( "http", $arr[1], 4) == 0 ) {
return "<a href=$arr[1]>$arr[1]</a>$arr[2]$arr[3]";
} else {
return "$arr[1]$arr[2]$arr[3]";
}
}
return preg_replace_callback( '/\b((?:http|www).+?)((?!\/)[\p{P}]+)?(\s|$)/x', replace, $string );
}
I added a callback so that all of the links would start with http://, and did some fiddling with the way it handles punctuation.
It's probably not the Best way to do things, but it works. I've learned a lot about this in the last little while, but there is still more to learn!

preg_replace('/
\b # Initial word boundary
( # Start capture
(?: # Non-capture group
http|www # http or www (alternation)
) # end group
.+? # reluctant match for at least one character until...
) # End capture
( # Start capture
[,.]+ # ...one or more of either a comma or period.
# add more punctuation as needed
)? # End optional capture
(\s|$) # Followed by either a space character or end of string
/x', '\1\2\3'
...is probably what you are going for. I think it's still imperfect, but it should at least work for your needs.
Aside: I think this is because \b matches punctuation too

You can achieve this with a positive lookahead assertion:
\b(http:|www\.)(?:[^\s,.!?]|[,.!?](?!\s))+
See it here on Regexr.
Means, match anything, but whitespace ,.!? OR match ,.!? when it is not followed by whitespace.
Aside: A word boundary is not a character or a set of characters, you can't put it into a character class. It is a zero width assertion, that is matching on a change from a word character to a non-word character. Here, I believe, \b in a character class is interpreted as the backspace character (the string escape sequence).

The problem may lie in the dot, which means "any character" in regex-speak. You'll probably have to escape it:
\b(http|www)\.?[^\s]+
Then, the question mark means 0 or 1 so you've said "an optional dot" which is not what you want (right?):
\b(http|www)\.[^\s]+
Now, it will only match http. and www. so you need to tell what other characters you'll let it accept:
\b(http|www)\.[^\s\w]+
or
\b(http|www)\.[^\sa-zA-Z]+
So now you're saying,
at the boundary of a word
check for http or www
put a dot
allow any range a-z or A-Z, don't allow any whitespace character
one or more of those
Note - I haven't tested these but they are hopefully correct-ish.
Aside (my take on it) - the \s means 'whitespace'. The \b means 'word boundary'. The [] means 'an allowed character range'. The ^ means 'not'. The + means 'one or more'.
So when you say [^\b]+ you're saying 'don't allow word boundaries in this range of characters, and there must be one or more' and since there's nothing else there > nothing else is allowed > there's not one or more > it probably breaks.

You should try something like this:
\b(http|www).?[\w\.\/]+

Regexp look-behind to match internet speeds

So the user may search for "10 mbit" after which I want to capture the "10" so I can use it in a speed-search rather than a string-search. This isn't a problem, the below regexp does this fine:
if (preg_match("/(\d+)\smbit/", $string)){ ... }
But, the user may search for something like "10/10 mbit" or "10-100 mbit". I don't want to match those with the above regexp - they should be handled in another fashion. So I would like a regexp that matches "10 mbit" if the number is all-numeric as a whole word (i.e. contained by whitespace, newline or lineend/linestart)
Using lookbehind, I did this:
if (preg_match("#(?<!/)(\d+)\s+mbit#i", $string)){
Just to catch those that doesn't have "/" before them, but this matched true for this string: "10/10 mbit" so I'm obviously doing something wrong here, but what?

If the slash or hyphen is the only thing you care about, this should do it:
'#(?<![\d/-])(\d+)\s+mbit#i`
The problem with your regex is that \d+ is only required to match one digit. It can't match the 10 in 10/10 mbit because it's preceded by a slash, but the 0 isn't. To make sure it matches from the beginning of the number, you have to include \d in the list of things it can't be preceded by.

You lookback assertion is negative. It tells the string should not be preceded by /
So the / is matched inside the string (as the regex cannot match only "10" : you forbid it explicitely with the assertion). Maybe you wanted a positive lookbehind?

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How exactly do Regular Expression word boundaries work in PHP? - php

Related

Regex check using PHP preg_match_all function [duplicate]

preg_replace doesnt not replace what I want

Why is regex matching ":www.xxx.com" [duplicate]

Regex: how to match any string until whitespace, or until punctuation followed by whitespace?

Regexp look-behind to match internet speeds

Categories

Resources