Regex check using PHP preg_match_all function [duplicate] - php

I'm trying to use regexes to match space-separated numbers.
I can't find a precise definition of \b ("word boundary").
I had assumed that -12 would be an "integer word" (matched by \b\-?\d+\b) but it appears that this does not work. I'd be grateful to know of ways of .
[I am using Java regexes in Java 1.6]
Example:
Pattern pattern = Pattern.compile("\\s*\\b\\-?\\d+\\s*");
String plus = " 12 ";
System.out.println(""+pattern.matcher(plus).matches());
String minus = " -12 ";
System.out.println(""+pattern.matcher(minus).matches());
pattern = Pattern.compile("\\s*\\-?\\d+\\s*");
System.out.println(""+pattern.matcher(minus).matches());
This returns:
true
false
true

A word boundary, in most regex dialects, is a position between \w and \W (non-word char), or at the beginning or end of a string if it begins or ends (respectively) with a word character ([0-9A-Za-z_]).
So, in the string "-12", it would match before the 1 or after the 2. The dash is not a word character.

In the course of learning regular expression, I was really stuck in the metacharacter which is \b. I indeed didn't comprehend its meaning while I was asking myself "what it is, what it is" repetitively. After some attempts by using the website, I watch out the pink vertical dashes at the every beginning of words and at the end of words. I got it its meaning well at that time. It's now exactly word(\w)-boundary.
My view is merely to immensely understanding-oriented. Logic behind of it should be examined from another answers.

A word boundary can occur in one of three positions:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
Word characters are alpha-numeric; a minus sign is not.
Taken from Regex Tutorial.

I would like to explain Alan Moore's answer
A word boundary is a position that is either preceded by a word character and not followed by one or followed by a word character and not preceded by one.
Suppose I have a string "This is a cat, and she's awesome", and I want to replace all occurrences of the letter 'a' only if this letter ('a') exists at the "Boundary of a word",
In other words: the letter a inside 'cat' should not be replaced.
So I'll perform regex (in Python) as
re.sub(r"\ba","e", myString.strip()) //replace a with e
Therefore,
Input; Output
This is a cat and she's awesome
This is e cat end she's ewesome

A word boundary is a position that is either preceded by a word character and not followed by one, or followed by a word character and not preceded by one.

I talk about what \b-style regex boundaries actually are here.
The short story is that they’re conditional. Their behavior depends on what they’re next to.
# same as using a \b before:
(?(?=\w) (?<!\w) | (?<!\W) )
# same as using a \b after:
(?(?<=\w) (?!\w) | (?!\W) )
Sometimes that isn’t what you want. See my other answer for elaboration.

I ran into an even worse problem when searching text for words like .NET, C++, C#, and C. You would think that computer programmers would know better than to name a language something that is hard to write regular expressions for.
Anyway, this is what I found out (summarized mostly from http://www.regular-expressions.info, which is a great site): In most flavors of regex, characters that are matched by the short-hand character class \w are the characters that are treated as word characters by word boundaries. Java is an exception. Java supports Unicode for \b but not for \w. (I'm sure there was a good reason for it at the time).
The \w stands for "word character". It always matches the ASCII characters [A-Za-z0-9_]. Notice the inclusion of the underscore and digits (but not dash!). In most flavors that support Unicode, \w includes many characters from other scripts. There is a lot of inconsistency about which characters are actually included. Letters and digits from alphabetic scripts and ideographs are generally included. Connector punctuation other than the underscore and numeric symbols that aren't digits may or may not be included. XML Schema and XPath even include all symbols in \w. But Java, JavaScript, and PCRE match only ASCII characters with \w.
Which is why Java-based regex searches for C++, C# or .NET (even when you remember to escape the period and pluses) are screwed by the \b.
Note: I'm not sure what to do about mistakes in text, like when someone doesn't put a space after a period at the end of a sentence. I allowed for it, but I'm not sure that it's necessarily the right thing to do.
Anyway, in Java, if you're searching text for the those weird-named languages, you need to replace the \b with before and after whitespace and punctuation designators. For example:
public static String grep(String regexp, String multiLineStringToSearch) {
String result = "";
String[] lines = multiLineStringToSearch.split("\\n");
Pattern pattern = Pattern.compile(regexp);
for (String line : lines) {
Matcher matcher = pattern.matcher(line);
if (matcher.find()) {
result = result + "\n" + line;
}
}
return result.trim();
}
Then in your test or main function:
String beforeWord = "(\\s|\\.|\\,|\\!|\\?|\\(|\\)|\\'|\\\"|^)";
String afterWord = "(\\s|\\.|\\,|\\!|\\?|\\(|\\)|\\'|\\\"|$)";
text = "Programming in C, (C++) C#, Java, and .NET.";
System.out.println("text="+text);
// Here is where Java word boundaries do not work correctly on "cutesy" computer language names.
System.out.println("Bad word boundary can't find because of Java: grep with word boundary for .NET="+ grep("\\b\\.NET\\b", text));
System.out.println("Should find: grep exactly for .NET="+ grep(beforeWord+"\\.NET"+afterWord, text));
System.out.println("Bad word boundary can't find because of Java: grep with word boundary for C#="+ grep("\\bC#\\b", text));
System.out.println("Should find: grep exactly for C#="+ grep("C#"+afterWord, text));
System.out.println("Bad word boundary can't find because of Java:grep with word boundary for C++="+ grep("\\bC\\+\\+\\b", text));
System.out.println("Should find: grep exactly for C++="+ grep(beforeWord+"C\\+\\+"+afterWord, text));
System.out.println("Should find: grep with word boundary for Java="+ grep("\\bJava\\b", text));
System.out.println("Should find: grep for case-insensitive java="+ grep("?i)\\bjava\\b", text));
System.out.println("Should find: grep with word boundary for C="+ grep("\\bC\\b", text)); // Works Ok for this example, but see below
// Because of the stupid too-short cutsey name, searches find stuff it shouldn't.
text = "Worked on C&O (Chesapeake and Ohio) Canal when I was younger; more recently developed in Lisp.";
System.out.println("text="+text);
System.out.println("Bad word boundary because of C name: grep with word boundary for C="+ grep("\\bC\\b", text));
System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
// Make sure the first and last cases work OK.
text = "C is a language that should have been named differently.";
System.out.println("text="+text);
System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
text = "One language that should have been named differently is C";
System.out.println("text="+text);
System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
//Make sure we don't get false positives
text = "The letter 'c' can be hard as in Cat, or soft as in Cindy. Computer languages should not require disambiguation (e.g. Ruby, Python vs. Fortran, Hadoop)";
System.out.println("text="+text);
System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
P.S. My thanks to http://regexpal.com/ without whom the regex world would be very miserable!

Check out the documentation on boundary conditions:
http://java.sun.com/docs/books/tutorial/essential/regex/bounds.html
Check out this sample:
public static void main(final String[] args)
{
String x = "I found the value -12 in my string.";
System.err.println(Arrays.toString(x.split("\\b-?\\d+\\b")));
}
When you print it out, notice that the output is this:
[I found the value -, in my string.]
This means that the "-" character is not being picked up as being on the boundary of a word because it's not considered a word character. Looks like #brianary kinda beat me to the punch, so he gets an up-vote.

Reference: Mastering Regular Expressions (Jeffrey E.F. Friedl) - O'Reilly
\b is equivalent to (?<!\w)(?=\w)|(?<=\w)(?!\w)

Word boundary \b is used where one word should be a word character and another one a non-word character.
Regular Expression for negative number should be
--?\b\d+\b
check working DEMO

I believe that your problem is due to the fact that - is not a word character. Thus, the word boundary will match after the -, and so will not capture it. Word boundaries match before the first and after the last word characters in a string, as well as any place where before it is a word character or non-word character, and after it is the opposite. Also note that word boundary is a zero-width match.
One possible alternative is
(?:(?:^|\s)-?)\d+\b
This will match any numbers starting with a space character and an optional dash, and ending at a word boundary. It will also match a number starting at the beginning of the string.

when you use \\b(\\w+)+\\b that means exact match with a word containing only word characters ([a-zA-Z0-9])
in your case for example setting \\b at the begining of regex will accept -12(with space) but again it won't accept -12(without space)
for reference to support my words: https://docs.oracle.com/javase/tutorial/essential/regex/bounds.html

I think it's the boundary (i.e. character following) of the last match or the beginning or end of the string.

Related

How do I escape the brackets in a mysql REGEXP [duplicate]

I have a regular expression to escape all special characters in a search string. This works great, however I can't seem to get it to work with word boundaries. For example, with the haystack
add +
or
add (+)
and the needle
+
the regular expression /\+/gi matches the "+". However the regular expression /\b\+/gi doesn't. Any ideas on how to make this work?
Using
add (plus)
as the haystack and /\bplus/gi as the regex, it matches fine. I just can't figure out why the escaped characters are having problems.
\b is a zero-width assertion: it doesn't consume any characters, it just asserts that a certain condition holds at a given position. A word boundary asserts that the position is either preceded by a word character and not followed by one, or followed by a word character and not preceded by one. (A "word character" is a letter, a digit, or an underscore.) In your string:
add +
...there's a word boundary at the beginning because the a is not preceded by a word character, and there's one after the second d because it's not followed by a word character. The \b in your regex (/\b\+/) is trying to match between the space and the +, which doesn't work because neither of those is a word character.
Try changing it to:
/\b\s?+/gi
Edit:
Extend this concept as far as you want. If you want the first + after any word boundary:
/\b[^+]*+/gi
Boundaries are very conditional assertions; what they anchor depends on what they touch. See this answer for a detailed explanation, along with what else you can do to deal with it.

Regex PHP word boundaries?

Why doesn't this regex:
$match = preg_grep("%^\w{2,5}\b[a-zA-Z]%", $randarray);
return '123 Main street'? from a $randarray = array('123 Main Street');
these word boundaries are confusing me.When I type %^\w{2,5}\b[a-zA-Z]\b% also nothing happens...why?
A word boundary is not a character
A word boundary is \b. A word boundary is not a space, or any character at all. It is the transition between a word and a non-word, so it's actually a point between characters, not a character itself.
If you want to match 123 Main street, you will have to match a sequence of numbers, followed by a space, followed by (I think) one or more words. So something like
/^\w{2,5}(\s[a-zA-Z]+\b)+/
So the second group matches a space (that comes after the street number or the previous word of the name), a sequence of alphabetical characters, and a word boundary. It will match '123 main street', and just 'main street'.
Greedy/ungreedy
By default a regular expression is greedy and will match as much characters as possible. So in this case you won't actually need the word boundary at all. It won't match str if it can match street. So the following regular expression will have the same effect as the one above, (unless you add some ungready modifier).
/^\w{2,5}(\s[a-zA-Z]+)+/
But for an ungreedy regular expression it is important. Compare
^\w{2,5}(\s[a-zA-Z]+?)+
and
^\w{2,5}(\s[a-zA-Z]+?\b)+
The first one will match 123 M, while the second one will match 123 Main street again.
Test your regexes
If you like to test this or other regular expressions, you can visit http://www.phpliveregex.com/ It allows you to test regular expressions to see how they work with a couple of preg_* functions.
Your expression:
^\w{2,5}\b[a-zA-Z]
Will match "123 Main Street" up until here:
123 Main Street
^
Note that the word boundary actually takes up no space at all, so the caret is positioned at the character that follows it.
At that point, it tries to match [a-zA-Z] and fails. Instead, you should match space:
^\w{2,5}\s+[a-zA-Z]
The word boundary will naturally occur due to the transition between \w and \s so I've taken that out.
Assuming that you want to validate that your subject "starts with a word between 2 and 5 chars long"
preg_match('%^\w{2,5}\b[a-zA-Z]*%', '123 Main Street')
(you're missing the *)

With word boundaries (\b) in RegEx do I need to have it before AND after the word, or just before?

If I want to match all occurrences of the word foo would I use \bfoo\b or without the last one? It seems both work, but what's proper?
You would need to use both. Without the last \b you would get a match on strings such as:
"I love football"
"You foolishly left off your second word boundary"
However, note that word boundary \b's definition is based on the definition for \w: a word boundary is defined when it is between a non-word character and a word character, where word character is defined by \w. \w for ASCII string is equivalent to [A-Za-z0-9_], so \bfoo\b also rejects cases such as:
foo123
3foo
foo_bar
fun_foo
Since digits and _ are consider word character, if they are right next to foo, it won't form a word boundary, therefore \bfoo\b will not match any of the above.
Given the following input:
foo foobar
This regular expression will match only the first foo:
\bfoo\b
This regular expression will match the first and also the foo in foobar:
\bfoo

How exactly do Regular Expression word boundaries work in PHP?

I'm currently writing a library for matching specific words in content.
Essentially the way it works is by compiling words into regular expressions, and running content through said regular expressions.
A feature I want to add is specifying whether a given word to match must start and/or end a word. For example, I have the word cat. I specify that it must start a word, so catering will match as cat is at the start, but ducat won't match as cat doesn't start the word.
I wanted to do this using word boundaries, but during some testing I found it doesn't work as I'd expect it to.
Take the following,
preg_match("/(^|\b)#nimal/i", "something#nimal", $match);
preg_match("/(^|\b)#nimal/i", "something!#nimal", $match);
In the statements above I would expect the following results,
> false
> 1 (#nimal)
But the result is instead the opposite,
> 1 (#nimal)
> false
In the first, I would expect it to fail as the group will eat the #, leaving nimal to match against #nimal, which obviously it doesn't. Instead, the group matchs an empty string, so #nimal is matched, meaning # is considered to be part of the word.
In the second, I would expect the group to eat the ! leaving #nimal to match the rest (which it should). Instead, it appears to combine the ! and # together to form a word, which is confirmed by the following matching,
preg_match("/g\b!#\bn/i", "something!#nimal", $match);
Any ideas why regular expression does this?
I'd just love a page that clearly documents how word boundaries are determined, I just can't find one for the life of me.
The word boundary \b matches on a change from a \w (a word character) to a \W a non word character. You want to match if there is a \b before your # which is a \W character. So to match you need a word character before your #
something#nimal
^^
==> Match because of the word boundary between g and #.
something!#nimal
^^
==> NO match because between ! and # there is no word boundary, both characters are \W
One problem I've encountered doing similar matching is words like can't and it's, where the apostrophe is considered a word/non-word boundary (as it is matched by \W and not \w). If that is likely to be a problem for you, you should exclude the apostrophe (and all of the variants such as ’ and ‘ that sometimes appear), for example by creating a class e.g. [\b^'].
You might also experience problems with UTF8 characters that are genuinely part of the word (i.e. what us humans mean by a word), for example test your regex against how you encode a word such as Svašek.
It is therefore often easier when parsing normal "linguistic" text to look for "linguistic" boundaries such as space characters (not just literally spaces, but the full class including newlines and tabs), commas, colons, full-stops, etc (and angle-brackets if you are parsing HTML). YMMV.
# is not part of a word character (in your locale probably it is, however, by default a "word" character is any letter or digit or the underscore character, Source - so # is not a word character, therefore not \w but \W and as linked any \w\W or \W\w combination marks a \b position), therefore it's always the word boundary that matches (in the OP's regex).
The following is similar to your regexes with the difference that instead of #, a is used. And beginning of line is a word boundary as well, so no need to specify it as well:
$r = preg_match("/\b(animal)/i", "somethinganimal", $match);
var_dump($r, $match);
$r = preg_match("/\b(animal)/i", "something!animal", $match);
var_dump($r, $match);
Output:
int(0)
array(0) {
}
int(1)
array(2) {
[0]=>
string(6) "animal"
[1]=>
string(6) "animal"
}

Regular expression fails on unicode

I'm trying to find the string "C#" in a text using php and reg exp.
I'm using
\bc\x{0023}\b
But doesn't work at all.
\bc\x{0023}
works but that's not a solution for me
Any clue ?
It's because the escape sequence \b means a word boundary. Word is defined according to the PHP manual as:
"A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word".".
Word boundary means the boundary between a word and a nonword. In otherwords, a between a character that is a word character and character is a not a word character. The problem is that # is not a word character. Thus, unless # is followed by a word character, #\b will never match.
Perhaps you should define more clearly using character classes what you want. For example /\bc#(?![a-z])/i (that is, C# that is not followed by a-z character range)

Categories