I'm trying to use regexes to match space-separated numbers.
I can't find a precise definition of \b ("word boundary").
I had assumed that -12 would be an "integer word" (matched by \b\-?\d+\b) but it appears that this does not work. I'd be grateful to know of ways of .
[I am using Java regexes in Java 1.6]
Example:
Pattern pattern = Pattern.compile("\\s*\\b\\-?\\d+\\s*");
String plus = " 12 ";
System.out.println(""+pattern.matcher(plus).matches());
String minus = " -12 ";
System.out.println(""+pattern.matcher(minus).matches());
pattern = Pattern.compile("\\s*\\-?\\d+\\s*");
System.out.println(""+pattern.matcher(minus).matches());
This returns:
true
false
true
A word boundary, in most regex dialects, is a position between \w and \W (non-word char), or at the beginning or end of a string if it begins or ends (respectively) with a word character ([0-9A-Za-z_]).
So, in the string "-12", it would match before the 1 or after the 2. The dash is not a word character.
In the course of learning regular expression, I was really stuck in the metacharacter which is \b. I indeed didn't comprehend its meaning while I was asking myself "what it is, what it is" repetitively. After some attempts by using the website, I watch out the pink vertical dashes at the every beginning of words and at the end of words. I got it its meaning well at that time. It's now exactly word(\w)-boundary.
My view is merely to immensely understanding-oriented. Logic behind of it should be examined from another answers.
A word boundary can occur in one of three positions:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
Word characters are alpha-numeric; a minus sign is not.
Taken from Regex Tutorial.
I would like to explain Alan Moore's answer
A word boundary is a position that is either preceded by a word character and not followed by one or followed by a word character and not preceded by one.
Suppose I have a string "This is a cat, and she's awesome", and I want to replace all occurrences of the letter 'a' only if this letter ('a') exists at the "Boundary of a word",
In other words: the letter a inside 'cat' should not be replaced.
So I'll perform regex (in Python) as
re.sub(r"\ba","e", myString.strip()) //replace a with e
Therefore,
Input; Output
This is a cat and she's awesome
This is e cat end she's ewesome
A word boundary is a position that is either preceded by a word character and not followed by one, or followed by a word character and not preceded by one.
I talk about what \b-style regex boundaries actually are here.
The short story is that they’re conditional. Their behavior depends on what they’re next to.
# same as using a \b before:
(?(?=\w) (?<!\w) | (?<!\W) )
# same as using a \b after:
(?(?<=\w) (?!\w) | (?!\W) )
Sometimes that isn’t what you want. See my other answer for elaboration.
I ran into an even worse problem when searching text for words like .NET, C++, C#, and C. You would think that computer programmers would know better than to name a language something that is hard to write regular expressions for.
Anyway, this is what I found out (summarized mostly from http://www.regular-expressions.info, which is a great site): In most flavors of regex, characters that are matched by the short-hand character class \w are the characters that are treated as word characters by word boundaries. Java is an exception. Java supports Unicode for \b but not for \w. (I'm sure there was a good reason for it at the time).
The \w stands for "word character". It always matches the ASCII characters [A-Za-z0-9_]. Notice the inclusion of the underscore and digits (but not dash!). In most flavors that support Unicode, \w includes many characters from other scripts. There is a lot of inconsistency about which characters are actually included. Letters and digits from alphabetic scripts and ideographs are generally included. Connector punctuation other than the underscore and numeric symbols that aren't digits may or may not be included. XML Schema and XPath even include all symbols in \w. But Java, JavaScript, and PCRE match only ASCII characters with \w.
Which is why Java-based regex searches for C++, C# or .NET (even when you remember to escape the period and pluses) are screwed by the \b.
Note: I'm not sure what to do about mistakes in text, like when someone doesn't put a space after a period at the end of a sentence. I allowed for it, but I'm not sure that it's necessarily the right thing to do.
Anyway, in Java, if you're searching text for the those weird-named languages, you need to replace the \b with before and after whitespace and punctuation designators. For example:
public static String grep(String regexp, String multiLineStringToSearch) {
String result = "";
String[] lines = multiLineStringToSearch.split("\\n");
Pattern pattern = Pattern.compile(regexp);
for (String line : lines) {
Matcher matcher = pattern.matcher(line);
if (matcher.find()) {
result = result + "\n" + line;
}
}
return result.trim();
}
Then in your test or main function:
String beforeWord = "(\\s|\\.|\\,|\\!|\\?|\\(|\\)|\\'|\\\"|^)";
String afterWord = "(\\s|\\.|\\,|\\!|\\?|\\(|\\)|\\'|\\\"|$)";
text = "Programming in C, (C++) C#, Java, and .NET.";
System.out.println("text="+text);
// Here is where Java word boundaries do not work correctly on "cutesy" computer language names.
System.out.println("Bad word boundary can't find because of Java: grep with word boundary for .NET="+ grep("\\b\\.NET\\b", text));
System.out.println("Should find: grep exactly for .NET="+ grep(beforeWord+"\\.NET"+afterWord, text));
System.out.println("Bad word boundary can't find because of Java: grep with word boundary for C#="+ grep("\\bC#\\b", text));
System.out.println("Should find: grep exactly for C#="+ grep("C#"+afterWord, text));
System.out.println("Bad word boundary can't find because of Java:grep with word boundary for C++="+ grep("\\bC\\+\\+\\b", text));
System.out.println("Should find: grep exactly for C++="+ grep(beforeWord+"C\\+\\+"+afterWord, text));
System.out.println("Should find: grep with word boundary for Java="+ grep("\\bJava\\b", text));
System.out.println("Should find: grep for case-insensitive java="+ grep("?i)\\bjava\\b", text));
System.out.println("Should find: grep with word boundary for C="+ grep("\\bC\\b", text)); // Works Ok for this example, but see below
// Because of the stupid too-short cutsey name, searches find stuff it shouldn't.
text = "Worked on C&O (Chesapeake and Ohio) Canal when I was younger; more recently developed in Lisp.";
System.out.println("text="+text);
System.out.println("Bad word boundary because of C name: grep with word boundary for C="+ grep("\\bC\\b", text));
System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
// Make sure the first and last cases work OK.
text = "C is a language that should have been named differently.";
System.out.println("text="+text);
System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
text = "One language that should have been named differently is C";
System.out.println("text="+text);
System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
//Make sure we don't get false positives
text = "The letter 'c' can be hard as in Cat, or soft as in Cindy. Computer languages should not require disambiguation (e.g. Ruby, Python vs. Fortran, Hadoop)";
System.out.println("text="+text);
System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
P.S. My thanks to http://regexpal.com/ without whom the regex world would be very miserable!
Check out the documentation on boundary conditions:
http://java.sun.com/docs/books/tutorial/essential/regex/bounds.html
Check out this sample:
public static void main(final String[] args)
{
String x = "I found the value -12 in my string.";
System.err.println(Arrays.toString(x.split("\\b-?\\d+\\b")));
}
When you print it out, notice that the output is this:
[I found the value -, in my string.]
This means that the "-" character is not being picked up as being on the boundary of a word because it's not considered a word character. Looks like #brianary kinda beat me to the punch, so he gets an up-vote.
Reference: Mastering Regular Expressions (Jeffrey E.F. Friedl) - O'Reilly
\b is equivalent to (?<!\w)(?=\w)|(?<=\w)(?!\w)
Word boundary \b is used where one word should be a word character and another one a non-word character.
Regular Expression for negative number should be
--?\b\d+\b
check working DEMO
I believe that your problem is due to the fact that - is not a word character. Thus, the word boundary will match after the -, and so will not capture it. Word boundaries match before the first and after the last word characters in a string, as well as any place where before it is a word character or non-word character, and after it is the opposite. Also note that word boundary is a zero-width match.
One possible alternative is
(?:(?:^|\s)-?)\d+\b
This will match any numbers starting with a space character and an optional dash, and ending at a word boundary. It will also match a number starting at the beginning of the string.
when you use \\b(\\w+)+\\b that means exact match with a word containing only word characters ([a-zA-Z0-9])
in your case for example setting \\b at the begining of regex will accept -12(with space) but again it won't accept -12(without space)
for reference to support my words: https://docs.oracle.com/javase/tutorial/essential/regex/bounds.html
I think it's the boundary (i.e. character following) of the last match or the beginning or end of the string.
I have a regular expression to escape all special characters in a search string. This works great, however I can't seem to get it to work with word boundaries. For example, with the haystack
add +
or
add (+)
and the needle
+
the regular expression /\+/gi matches the "+". However the regular expression /\b\+/gi doesn't. Any ideas on how to make this work?
Using
add (plus)
as the haystack and /\bplus/gi as the regex, it matches fine. I just can't figure out why the escaped characters are having problems.
\b is a zero-width assertion: it doesn't consume any characters, it just asserts that a certain condition holds at a given position. A word boundary asserts that the position is either preceded by a word character and not followed by one, or followed by a word character and not preceded by one. (A "word character" is a letter, a digit, or an underscore.) In your string:
add +
...there's a word boundary at the beginning because the a is not preceded by a word character, and there's one after the second d because it's not followed by a word character. The \b in your regex (/\b\+/) is trying to match between the space and the +, which doesn't work because neither of those is a word character.
Try changing it to:
/\b\s?+/gi
Edit:
Extend this concept as far as you want. If you want the first + after any word boundary:
/\b[^+]*+/gi
Boundaries are very conditional assertions; what they anchor depends on what they touch. See this answer for a detailed explanation, along with what else you can do to deal with it.
I'm writing a function that should retrieve all occurrences that I pass.
I'm italian so I think that I could be more clear with an example.
I would check if my phrase contains some fruits.
Ok, so lets see my php code:
$pattern='<apple|orange|pear|lemon|Goji berry>i';
$phrase="I will buy an apple to do an applepie!";
preg_match_all($pattern,$phrase,$match);
the result will be an array with "apple" and "applepie".
How can I search only exact occurency?
Reading the manual I found:
http://php.net/manual/en/regexp.reference.anchors.php
I try to use \A , \Z , ^ and $ but no one seems to work correctly in my case!
Someone can help me?
EDIT: After the #cris85 's answer I try to improve my question ...
My really pattern contains over 200 occorrency and the phrase is over 10000 caracters so the real case is too large to insert here.
After some trials I found an error on the occurrency "microsoft exchange"! There is some special caracters that I must escape?
At the moment I escape "+" "-" "." "?" "$" and "*".
The anchors you tried to use are for the full string, not per word. You can use word boundaries to match individual words. This should allow you to find only complete fruit matches:
$pattern='<\b(?:apple|orange|pear|lemon|Goji berry)\b>i';
The ?: is so you don't make an additional capture group, it is a non-capture group.
Here's the definitation from regex-expressions for what a boundary matches:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
PHP Demo: https://3v4l.org/h5GCf
Regex Demo: https://regex101.com/r/5aBaMO/1/
What is going on with the \w character type? At the moment it outputs an array called $replace that has all the name except only the first letter of each first name. I don't really understand what its doing to get to this point. \w is any word character but that doesn't help me.
<?php
$rappers = array('Drake Themotto', 'Tom Ford', 'Lil Wayne');
$replace = preg_replace('/(\w)\w* (\w)/', '\1 \2', $rappers);
print_r($replace);
?>
From left to right your regex contains:
A group with one word character
Zero or more word characters
A space
A group with one word character
For "Drake Themotto" this means:
The first group \1 will be "D"
The following word characters "rake" match but will not be stored
The space will not be stored
The second group \2 will be "T"
For the replacement this means that the matching part of your string is "Drake T". This matching string will be replaced by "\1 \2" which is "D T" in this case.
After that, there are some other characters "hemotto". You did not mention them in your regex, but since it does not contain a $ to mark the end of the string (in this case the regex would not match) or another \w* to match (= in this case: to remove) the other characters of the string, this rest simply will be ignored. Because you just "replace" something, "ignored" means that nothing will be replaced here and it will be appended to the result.
I'm trying to find the string "C#" in a text using php and reg exp.
I'm using
\bc\x{0023}\b
But doesn't work at all.
\bc\x{0023}
works but that's not a solution for me
Any clue ?
It's because the escape sequence \b means a word boundary. Word is defined according to the PHP manual as:
"A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word".".
Word boundary means the boundary between a word and a nonword. In otherwords, a between a character that is a word character and character is a not a word character. The problem is that # is not a word character. Thus, unless # is followed by a word character, #\b will never match.
Perhaps you should define more clearly using character classes what you want. For example /\bc#(?![a-z])/i (that is, C# that is not followed by a-z character range)