I have a regular expression to escape all special characters in a search string. This works great, however I can't seem to get it to work with word boundaries. For example, with the haystack
add +
or
add (+)
and the needle
+
the regular expression /\+/gi matches the "+". However the regular expression /\b\+/gi doesn't. Any ideas on how to make this work?
Using
add (plus)
as the haystack and /\bplus/gi as the regex, it matches fine. I just can't figure out why the escaped characters are having problems.
\b is a zero-width assertion: it doesn't consume any characters, it just asserts that a certain condition holds at a given position. A word boundary asserts that the position is either preceded by a word character and not followed by one, or followed by a word character and not preceded by one. (A "word character" is a letter, a digit, or an underscore.) In your string:
add +
...there's a word boundary at the beginning because the a is not preceded by a word character, and there's one after the second d because it's not followed by a word character. The \b in your regex (/\b\+/) is trying to match between the space and the +, which doesn't work because neither of those is a word character.
Try changing it to:
/\b\s?+/gi
Edit:
Extend this concept as far as you want. If you want the first + after any word boundary:
/\b[^+]*+/gi
Boundaries are very conditional assertions; what they anchor depends on what they touch. See this answer for a detailed explanation, along with what else you can do to deal with it.
Related
I have a website where users can have custom actions when a keyword is detected in a sentence. How I currently do matches is like the following:
$output = array();
preg_match('/\b' . $keyword . '\b/', $phrase, $output);
If I find a match if(count($output) > 0) { then the custom action is ran. This is for spoken sentences so it is for things like operator, we have a custom one called [silence] so when silence is detected it runs an action.
However when the keyword contains brackets for example: [silence] the regex fails because it has square brackets. I have tried escaping both like \b\[silence\]\b However this does not detect a match.
Also this is in PHP
Thanks in advance,
Joe
The "word boundary" expression matches if the next character is a part of a word, and [ isn't (it is not a letter)
From Regex tutorial :
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
Simply put: \b allows you to perform a “whole words only” search using a regular expression in the form of \bword\b. A “word character” is a character that can be used to form words. All characters that are not “word characters” are “non-word characters”.
So you need to "rewrite" the \b expression that can suit your need, like :
(?<=[\s\.,;])\[silence\](?=[\s\.,;])
First, a non-matching "delimiter character" (space, dot, comma, ... You probably need to add a few more), followed by your expression, followed by a non-matching delimiter character again.
With the help of SO, I was able to make a regular expression for my purposes, it works great, but it completely ignores special characters.
$pattern='/(?=.*\b\Q'.str_replace(' ','\E\b)(?=.*\b\Q',$requestedservice).'\E\b)/i';
preg_match($pattern, $item)
Here $requestedservice is the character that it's trying to match with $item from the database.
The $item is Walk - Dance so if the $requestedservice is Walk - Dance as well, it's not matched, but if the $requestedservice is Walk Dance it is matched.
I am not sure why it's ignoring special characters like - / %
I am using html_entity_decode for the $requestedservice so that's not an issue.
Any guidance would be really helpful.
Your word boundaries are working against you. If you have ., for instance, your pattern is /(?=.*\b\Q.\E\b)/i, which asserts that there is a literal . with a word boundary before and after it, and since . is a non-word character, that means there has to be a word character before and after it.
Instead you could use (?<!\w) in place of the first and third \b and (?!\w) in place of the second and fourth \b to specifically assert there is not a word character before and after each of your string parts that need to match.
im looking for a regex that matches words that repeat a letter(s) more than once and that are next to each other.
Here's an example:
This is an exxxmaple oooonnnnllllyyyyy!
By far I havent found anything that can exactly match:
exxxmaple and oooonnnnllllyyyyy
I need to find it and place them in an array, like this:
preg_match_all('/\b(???)\b/', $str, $arr) );
Can somebody explain what regexp i have to use?
You can use a very simple regex like
\S*(\w)(?=\1+)\S*
See how the regex matches at http://regex101.com/r/rF3pR7/3
\S matches anything other than a space
* quantifier, zero or more occurance of \S
(\w) matches a single character, captures in \1
(?=\1+) postive look ahead. Asserts that the captrued character is followed by itsef \1
+ quantifiers, one or more occurence of the repeated character
\S* matches anything other than space
EDIT
If the repeating must be more than once, a slight modification of the regex would do the trick
\S*(\w)(?=\1{2,})\S*
for example http://regex101.com/r/rF3pR7/5
Use this if you want discard words like apple etc .
\b\w*(\w)(?=\1\1+)\w*\b
or
\b(?=[^\s]*(\w)\1\1+)\w+\b
Try this.See demo.
http://regex101.com/r/kP8uF5/20
http://regex101.com/r/kP8uF5/21
You can use this pattern:
\b\w*?(\w)\1{2}\w*
The \w class and the word-boundary \b limit the search to words. Note that the word boundary can be removed, however, it reduces the number of steps to obtain a match (as the lazy quantifier). Note too, that if you are looking for words (in the common meaning), you need to remove the word boundary and to use [a-zA-Z] instead of \w.
(\w)\1{2} checks if a repeated character is present. A word character is captured in group 1 and must be followed with the content of the capture group (the backreference \1).
If I want to match all occurrences of the word foo would I use \bfoo\b or without the last one? It seems both work, but what's proper?
You would need to use both. Without the last \b you would get a match on strings such as:
"I love football"
"You foolishly left off your second word boundary"
However, note that word boundary \b's definition is based on the definition for \w: a word boundary is defined when it is between a non-word character and a word character, where word character is defined by \w. \w for ASCII string is equivalent to [A-Za-z0-9_], so \bfoo\b also rejects cases such as:
foo123
3foo
foo_bar
fun_foo
Since digits and _ are consider word character, if they are right next to foo, it won't form a word boundary, therefore \bfoo\b will not match any of the above.
Given the following input:
foo foobar
This regular expression will match only the first foo:
\bfoo\b
This regular expression will match the first and also the foo in foobar:
\bfoo
Can someone explain what this function
preg_replace('/&\w;/', '', $buf)
does? I have looked at various tutorials and found that it replaces the pattern /&\w;/ with string ''. But I can't understand the pattern /&\w;/. What does it represent?
Similarly in
preg_match_all("/(\b[\w+]+\b)/", $buf, $words)
I can't understand what does the string "/(\b[\w+]+\b)/" represents.
Please help. Thanks in advance :)
The explanation of your first expression is simple, it is:
& # Match the character “&” literally
\w # Match a single character that is a “word character” (letters, digits, and underscores)
; # Match the character “;” literally
The second one is:
( # Match the regular expression below and capture its match into backreference number 1
\b # Assert position at a word boundary
[\w+] # Match a single character present in the list below
# A word character (letters, digits, and underscores)
# The character “+”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
\b # Assert position at a word boundary
)
The preg_replace function makes use of regular expressions. Regular expressions allow you to find patterns in text in a really powerful way.
To be able to use functions like preg_replace or preg_match I recommend you to take a look first at how regular expressions work.
You can gather a lot of info on this site http://www.regular-expressions.info/
And you can use software tools to help you understand the regex (like RegexBuddy)
In regular expressions, \w stands for any "word" character. That is: a-z, A-Z, 0-9 and underscore. \b stands for "word boundary", that is the beginning and end of a word (a series of word characters).
So, /&\w;/ is a regular expression to match the & sign, followed by a series of word characters, followed by a ;. For example, &foobar; would match, and preg_replace will replace it with an empty string.
In that same manner, /(\b[\w+]+\b)/ matches a word boundary, followed by multiple word characters, followed by another word boundary. The words are captured separately using the parenthesis. So, this regular expression will simply return the words in a string as an array.